This tutorial is a literate Haskell program whwre we start by specifying the language pragmas and imports we will need for this module.
{-# LANGUAGE QuasiQuotes #-}
import Text.RE.Replace
import Text.RE.TDFA.String
regex supports the replacement of matched text with alternative text. This section will cover replacement text specified with templates. More flexible tools that allow functions calculate the replacement text are covered below.
Capture sub-expressions, whose matched text can be inserted into the replacement template, can be specified as follows:
$(
… )
identifies a capture that can be identified by its left-to-right position relative to the other captures in the replacement template, with $1
being used to represent the leftmost capture, $2
the next leftmost capture, and so on;
${foo}(
… )
can be used to identify a capture by name. Such captures can be identified either by their left-to-right position in the regular expression or by ${foo}
in the template.
uk_dates :: String -> String
=
uk_dates src "${d}/${m}/${y}" $ src *=~ [re|${y}([0-9]{4})-${m}([0-9]{2})-${d}([0-9]{2})|] replaceAll
> uk_dates "2016-01-09 2015-12-5 2015-10-05"
ghci"09/01/2016 2015-12-5 05/10/2015"
uk_dates' :: String -> String
=
uk_dates' src "$3/$2/$1" $ src *=~ [re|$([0-9]{4})-$([0-9]{2})-$([0-9]{2})|] replaceAll
> uk_dates' "2016-01-09 2015-12-5 2015-10-05"
ghci"09/01/2016 2015-12-5 05/10/2015"
yielding the same result.
(Most regex conventions use plain parentheses, (
… )
, to mark captures but we would like to reserve those exclusively for grouping in regex REs.)
The types returned by the ?=~
and *=~
form the foundations of the package. Understandingv these simple types is the key to understanding the package.
*=~
in this module (imported from Text.RE.TDFA.String
) is:
(*=~) :: String -> RE -> Matches String
with Matches
defined in Text.RE.ZeInternals.Types.Capture
thus:
-- | the result of matching a RE against a text (with @*=~@), retaining
-- the text that was matched against
data Matches a =
Matches
matchesSource :: !a -- ^ the source text being matched
{ allMatches :: ![Match a] -- ^ all 'Match' instances found, left to right
,
}deriving (Show,Eq,Typeable)
The critical component of the Matches
type is the [Match a]
in allMatches
, containing the details all of each substring matched by the RE. The matchSource
component also retains a copy of the original search string but the critical information is in allmatches
.
?=~
in this module (imported from Text.RE.TDFA.String
) is:
(?=~) :: String -> RE -> Match String
with Match
(referenced in the definition of Matches
above) defined in Text.RE.ZeInternals.Types.Capture
thus:
-- | the result of matching a RE to a text once (with @?=~@), retaining
-- the text that was matched against
data Match a =
Match
matchSource :: !a -- ^ the whole source text
{ captureNames :: !CaptureNames -- ^ the RE's capture names
, matchArray :: !(Array CaptureOrdinal (Capture a))
,-- ^ 0..n-1 captures,
-- starting with the
-- text matched by the
-- whole RE
}deriving (Show,Eq,Typeable)
Like matchesSource
above, matchSource
retains the original search string, but also a CaptureNames
field listing all of the capture names in the RE (needed by the text replacemnt tools).
But the ‘real’ content of Match
is to be found in the MatchArray
, enumerating all of the substrings captured by this match, starting with 0
for the substring captured by the whole RE, 1
for the leftmost explicit capture in the RE, 2
for the next leftmost capture, and so on.
Each captured substring is represented by the following Capture
type:
-- | the matching of a single sub-expression against part of the source
-- text
data Capture a =
Capture
captureSource :: !a -- ^ the whole text that was searched
{ capturedText :: !a -- ^ the text that was matched
, captureOffset :: !Int -- ^ the number of characters preceding the
,-- match with -1 used if no text was captured
-- by the RE (not even the empty string)
captureLength :: !Int -- ^ the number of chacter in the captured
,-- sub-string
}deriving (Show,Eq)
Here we list the whole original search string in captureSource
and the text of the sub-string captured in capturedText
. captureOffset
contains the number of characters preceding the captured substring, or is negative if no substring was captured (which is a different situation from epsilon, the empty string, being captured). captureLength
gives the length of the captured string in capturedText
.
The test suite in examples/re-tests.lhs contains extensive worked-out examples of these Matches
/Match
/Capture
types.
replaceAllCaptures
takes a REContext
, a substitution function and a Matches
and applies the function to each captured substring according to the REContext
, as we can see in the following example function to clean up all of the mis-formatted dates in the argument string,
fixup_dates :: String -> String
=
fixup_dates src SUB phi $ src *=~ [re|([0-9]+)-([0-9]+)-([0-9]+)|]
replaceAllCaptures where
= Just $ case locationCapture loc of
phi _ loc cap 1 -> fmt 4 $ read s
2 -> fmt 2 $ read s
3 -> fmt 2 $ read s
-> error "fixup_dates"
_ where
= capturedText cap
s
fmt :: Int -> Int -> String
= replicate (max 0 $ w - length x_s ) '0' ++ x_s
fmt w x where
= show x x_s
> fixup_dates "2016-01-09 2015-12-5 2015-10-05"
ghci"2016-01-09 2015-12-05 2015-10-05"
The replaceAllCaptures
function is of type
-- | substitutes using a function that takes the full Match
-- context and returns the same replacement text as the _phi_phi
-- context.
replaceAllCaptures :: Replace a
=> REContext
-> (Match a->RELocation->Capture a->Maybe a)
-> Matches a
-> a
and the REContext
and RELocation
types are defined in Text.RE.Replace
as follows,
-- | @REContext@ specifies which contexts the substitutions should be applied
data REContext
= TOP -- ^ substitutions should be applied to the top-level only,
-- the text that matched the whole RE
| SUB -- ^ substitutions should only be applied to the text
-- captured by bracketed sub-REs
| ALL -- ^ the substitution function should be applied to all
-- captures, the top level and the sub-expression captures
deriving (Show)
-- | the @RELocation@ information passed into the substitution function
-- specifies which sub-expression is being substituted
data RELocation =
RELocation
locationMatch :: Int
{-- ^ the zero-based, i-th string to be matched,
-- when matching all strings, zero when only the
-- first string is being matched
locationCapture :: CaptureOrdinal
,-- ^ 0, when matching the top-level string
-- matched by the whole RE, 1 for the top-most,
-- left-most redex captured by bracketed
-- sub-REs, etc.
}deriving (Show)
The processing function gets applied to the captures specified by the REContext
, which can be directed to process ALL
of the captures, including the substring captured by the whole RE and all of the subsidiary capture, or just the TOP
, 0
capture that the whole RE matches, or just the SUB
(subsidiary) captures, as was the case above.
The substitution function takes the Match
corresponding to the current redex being processed, the RELocation
information specifying redex n redex and capure i, and the Capure
being substituted. Our substitution function didn’t need the Match
context so it ignored it.
The substition function either return Nothing
to indicate that no substitution should be made or the replacement text.
ALL
context and a 0
case for the substitution function.
fixup_and_reformat_dates :: String -> String
=
fixup_and_reformat_dates src ALL f $ src *=~ [re|([0-9]+)-([0-9]+)-([0-9]+)|]
replaceAllCaptures where
= Just $ case locationCapture loc of
f _ loc cap 0 -> "["++txt++"]"
1 -> fmt 4 $ read txt
2 -> fmt 2 $ read txt
3 -> fmt 2 $ read txt
-> error "fixup_date"
_ where
= capturedText cap txt
fixup_and_reformat_dates
applied to our running example,
> fixup_and_reformat_dates "2016-01-09 2015-12-5 2015-10-05"
ghci"[2016-01-09] [2015-12-05] [2015-10-05]"
Text.RE.Replace
provides analagous functions for replacing the test of a single Match
returned from ?=~
.