re-tutorial-replacing.lhs

The regex Replacing Tutorial

Language Options and Imports

This tutorial is a literate Haskell program whwre we start by specifying the language pragmas and imports we will need for this module.

{-# LANGUAGE QuasiQuotes                      #-}
import           Text.RE.Replace
import           Text.RE.TDFA.String

Simple Text Replacement

regex supports the replacement of matched text with alternative text. This section will cover replacement text specified with templates. More flexible tools that allow functions calculate the replacement text are covered below.

Capture sub-expressions, whose matched text can be inserted into the replacement template, can be specified as follows:

A function to convert ISO format dates into a UK-format date could be written thus:
uk_dates :: String -> String
uk_dates src =
  replaceAll "${d}/${m}/${y}" $ src *=~ [re|${y}([0-9]{4})-${m}([0-9]{2})-${d}([0-9]{2})|]
with
ghci> uk_dates "2016-01-09 2015-12-5 2015-10-05"
"09/01/2016 2015-12-5 05/10/2015"
The same function written with numbered captures:
uk_dates' :: String -> String
uk_dates' src =
  replaceAll "$3/$2/$1" $ src *=~ [re|$([0-9]{4})-$([0-9]{2})-$([0-9]{2})|]
with
ghci> uk_dates' "2016-01-09 2015-12-5 2015-10-05"
"09/01/2016 2015-12-5 05/10/2015"

yielding the same result.

(Most regex conventions use plain parentheses, (), to mark captures but we would like to reserve those exclusively for grouping in regex REs.)

Matches/Match/Capture

The types returned by the ?=~ and *=~ form the foundations of the package. Understandingv these simple types is the key to understanding the package.

The type of *=~ in this module (imported from Text.RE.TDFA.String) is:
(*=~) :: String -> RE -> Matches String

with Matches defined in Text.RE.ZeInternals.Types.Capture thus:

-- | the result of matching a RE against a text (with @*=~@), retaining
-- the text that was matched against
data Matches a =
  Matches
    { matchesSource :: !a          -- ^ the source text being matched
    , allMatches    :: ![Match a]  -- ^ all 'Match' instances found, left to right
    }
  deriving (Show,Eq,Typeable)

The critical component of the Matches type is the [Match a] in allMatches, containing the details all of each substring matched by the RE. The matchSource component also retains a copy of the original search string but the critical information is in allmatches.

The type of ?=~ in this module (imported from Text.RE.TDFA.String) is:
(?=~) :: String -> RE -> Match String

with Match (referenced in the definition of Matches above) defined in Text.RE.ZeInternals.Types.Capture thus:

-- | the result of matching a RE to a text once (with @?=~@), retaining
-- the text that was matched against
data Match a =
  Match
    { matchSource  :: !a                -- ^ the whole source text
    , captureNames :: !CaptureNames     -- ^ the RE's capture names
    , matchArray   :: !(Array CaptureOrdinal (Capture a))
                                        -- ^ 0..n-1 captures,
                                        -- starting with the
                                        -- text matched by the
                                        -- whole RE
    }
  deriving (Show,Eq,Typeable)

Like matchesSource above, matchSource retains the original search string, but also a CaptureNames field listing all of the capture names in the RE (needed by the text replacemnt tools).

But the ‘real’ content of Match is to be found in the MatchArray, enumerating all of the substrings captured by this match, starting with 0 for the substring captured by the whole RE, 1 for the leftmost explicit capture in the RE, 2 for the next leftmost capture, and so on.

Each captured substring is represented by the following Capture type:

-- | the matching of a single sub-expression against part of the source
-- text
data Capture a =
  Capture
    { captureSource  :: !a    -- ^ the whole text that was searched
    , capturedText   :: !a    -- ^ the text that was matched
    , captureOffset  :: !Int  -- ^ the number of characters preceding the
                              -- match with -1 used if no text was captured
                              -- by the RE (not even the empty string)
    , captureLength  :: !Int  -- ^ the number of chacter in the captured
                              -- sub-string
    }
  deriving (Show,Eq)

Here we list the whole original search string in captureSource and the text of the sub-string captured in capturedText. captureOffset contains the number of characters preceding the captured substring, or is negative if no substring was captured (which is a different situation from epsilon, the empty string, being captured). captureLength gives the length of the captured string in capturedText.

The test suite in examples/re-tests.lhs contains extensive worked-out examples of these Matches/Match/Capture types.

Using Functions to Replace Text

Sometimes you will need to process each string captured by an RE with a function. replaceAllCaptures takes a REContext, a substitution function and a Matches and applies the function to each captured substring according to the REContext, as we can see in the following example function to clean up all of the mis-formatted dates in the argument string,
fixup_dates :: String -> String
fixup_dates src =
    replaceAllCaptures SUB phi $ src *=~ [re|([0-9]+)-([0-9]+)-([0-9]+)|]
  where
    phi _ loc cap = Just $ case locationCapture loc of
        1 -> fmt 4 $ read s
        2 -> fmt 2 $ read s
        3 -> fmt 2 $ read s
        _ -> error "fixup_dates"
      where
        s = capturedText cap

fmt :: Int -> Int -> String
fmt w x = replicate (max 0 $ w - length x_s ) '0' ++ x_s
  where
    x_s = show x
which will fix up our running example
ghci> fixup_dates "2016-01-09 2015-12-5 2015-10-05"
"2016-01-09 2015-12-05 2015-10-05"

The replaceAllCaptures function is of type

-- | substitutes using a function that takes the full Match
-- context and returns the same replacement text as the _phi_phi
-- context.
replaceAllCaptures :: Replace a
                   => REContext
                   -> (Match a->RELocation->Capture a->Maybe a)
                   -> Matches a
                   -> a

and the REContext and RELocation types are defined in Text.RE.Replace as follows,

-- | @REContext@ specifies which contexts the substitutions should be applied
data REContext
  = TOP   -- ^ substitutions should be applied to the top-level only,
          -- the text that matched the whole RE
  | SUB   -- ^ substitutions should only be applied to the text
          -- captured by bracketed sub-REs
  | ALL   -- ^ the substitution function should be applied to all
          -- captures, the top level and the sub-expression captures
  deriving (Show)

-- | the @RELocation@ information passed into the substitution function
-- specifies which sub-expression is being substituted
data RELocation =
  RELocation
    { locationMatch   :: Int
          -- ^ the zero-based, i-th string to be matched,
          -- when matching all strings, zero when only the
          -- first string is being matched
    , locationCapture :: CaptureOrdinal
          -- ^ 0, when matching the top-level string
          -- matched by the whole RE, 1 for the top-most,
          -- left-most redex captured by bracketed
          -- sub-REs, etc.
    }
  deriving (Show)

The processing function gets applied to the captures specified by the REContext, which can be directed to process ALL of the captures, including the substring captured by the whole RE and all of the subsidiary capture, or just the TOP, 0 capture that the whole RE matches, or just the SUB (subsidiary) captures, as was the case above.

The substitution function takes the Match corresponding to the current redex being processed, the RELocation information specifying redex n redex and capure i, and the Capure being substituted. Our substitution function didn’t need the Match context so it ignored it.

The substition function either return Nothing to indicate that no substitution should be made or the replacement text.

The above fixup function could be extended to enclose whole date in square brackets by specifing an ALL context and a 0 case for the substitution function.
fixup_and_reformat_dates :: String -> String
fixup_and_reformat_dates src =
    replaceAllCaptures ALL f $ src *=~ [re|([0-9]+)-([0-9]+)-([0-9]+)|]
  where
    f _ loc cap = Just $ case locationCapture loc of
        0 -> "["++txt++"]"
        1 -> fmt 4 $ read txt
        2 -> fmt 2 $ read txt
        3 -> fmt 2 $ read txt
        _ -> error "fixup_date"
      where
        txt = capturedText cap
The fixup_and_reformat_dates applied to our running example,
ghci> fixup_and_reformat_dates "2016-01-09 2015-12-5 2015-10-05"
"[2016-01-09] [2015-12-05] [2015-10-05]"

Text.RE.Replace provides analagous functions for replacing the test of a single Match returned from ?=~.