re-tutorial-replacing.lhs

The regex Replacing Tutorial

Language Options and Imports

This tutorial is a literate Haskell program whwre we start by specifying the language pragmas and imports we will need for this module.

{-# LANGUAGE QuasiQuotes                      #-}
import           Text.RE.Replace
import           Text.RE.TDFA.String

Simple Text Replacement

regex supports the replacement of matched text with alternative text. This section will cover replacement text specified with templates. More flexible tools that allow functions calculate the replacement text are covered below.

Capture sub-expressions, whose matched text can be inserted into the replacement template, can be specified as follows:

A function to convert ISO format dates into a UK-format date could be written thus:
uk_dates :: String -> String
uk_dates src =
  replaceAll "${d}/${m}/${y}" $ src *=~ [re|${y}([0-9]{4})-${m}([0-9]{2})-${d}([0-9]{2})|]
with
ghci> uk_dates "2016-01-09 2015-12-5 2015-10-05"
"09/01/2016 2015-12-5 05/10/2015"
The same function written with numbered captures:
uk_dates' :: String -> String
uk_dates' src =
  replaceAll "$3/$2/$1" $ src *=~ [re|$([0-9]{4})-$([0-9]{2})-$([0-9]{2})|]
with
ghci> uk_dates' "2016-01-09 2015-12-5 2015-10-05"
"09/01/2016 2015-12-5 05/10/2015"

yielding the same result.

(Most regex conventions use plain parentheses, (), to mark captures but we would like to reserve those exclusively for grouping in regex REs.)

Matches/Match/Capture

The types returned by the ?=~ and *=~ form the foundations of the package. Understandingv these simple types is the key to understanding the package.

The type of *=~ in this module (imported from Text.RE.TDFA.String) is:

with Matches defined in Text.RE.ZeInternals.Types.Capture thus:

The critical component of the Matches type is the [Match a] in allMatches, containing the details all of each substring matched by the RE. The matchSource component also retains a copy of the original search string but the critical information is in allmatches.

The type of ?=~ in this module (imported from Text.RE.TDFA.String) is:

with Match (referenced in the definition of Matches above) defined in Text.RE.ZeInternals.Types.Capture thus:

Like matchesSource above, matchSource retains the original search string, but also a CaptureNames field listing all of the capture names in the RE (needed by the text replacemnt tools).

But the ‘real’ content of Match is to be found in the MatchArray, enumerating all of the substrings captured by this match, starting with 0 for the substring captured by the whole RE, 1 for the leftmost explicit capture in the RE, 2 for the next leftmost capture, and so on.

Each captured substring is represented by the following Capture type:

Here we list the whole original search string in captureSource and the text of the sub-string captured in capturedText. captureOffset contains the number of characters preceding the captured substring, or is negative if no substring was captured (which is a different situation from epsilon, the empty string, being captured). captureLength gives the length of the captured string in capturedText.

The test suite in examples/re-tests.lhs contains extensive worked-out examples of these Matches/Match/Capture types.

Using Functions to Replace Text

Sometimes you will need to process each string captured by an RE with a function. replaceAllCaptures takes a REContext, a substitution function and a Matches and applies the function to each captured substring according to the REContext, as we can see in the following example function to clean up all of the mis-formatted dates in the argument string,
fixup_dates :: String -> String
fixup_dates src =
    replaceAllCaptures SUB phi $ src *=~ [re|([0-9]+)-([0-9]+)-([0-9]+)|]
  where
    phi _ loc cap = Just $ case locationCapture loc of
        1 -> fmt 4 $ read s
        2 -> fmt 2 $ read s
        3 -> fmt 2 $ read s
        _ -> error "fixup_dates"
      where
        s = capturedText cap

fmt :: Int -> Int -> String
fmt w x = replicate (max 0 $ w - length x_s ) '0' ++ x_s
  where
    x_s = show x
which will fix up our running example
ghci> fixup_dates "2016-01-09 2015-12-5 2015-10-05"
"2016-01-09 2015-12-05 2015-10-05"

The replaceAllCaptures function is of type

and the REContext and RELocation types are defined in Text.RE.Replace as follows,

The processing function gets applied to the captures specified by the REContext, which can be directed to process ALL of the captures, including the substring captured by the whole RE and all of the subsidiary capture, or just the TOP, 0 capture that the whole RE matches, or just the SUB (subsidiary) captures, as was the case above.

The substitution function takes the Match corresponding to the current redex being processed, the RELocation information specifying redex n redex and capure i, and the Capure being substituted. Our substitution function didn’t need the Match context so it ignored it.

The substition function either return Nothing to indicate that no substitution should be made or the replacement text.

The above fixup function could be extended to enclose whole date in square brackets by specifing an ALL context and a 0 case for the substitution function.
fixup_and_reformat_dates :: String -> String
fixup_and_reformat_dates src =
    replaceAllCaptures ALL f $ src *=~ [re|([0-9]+)-([0-9]+)-([0-9]+)|]
  where
    f _ loc cap = Just $ case locationCapture loc of
        0 -> "["++txt++"]"
        1 -> fmt 4 $ read txt
        2 -> fmt 2 $ read txt
        3 -> fmt 2 $ read txt
        _ -> error "fixup_date"
      where
        txt = capturedText cap
The fixup_and_reformat_dates applied to our running example,
ghci> fixup_and_reformat_dates "2016-01-09 2015-12-5 2015-10-05"
"[2016-01-09] [2015-12-05] [2015-10-05]"

Text.RE.Replace provides analagous functions for replacing the test of a single Match returned from ?=~.