re-tutorial.lhs

The regex Tutorial

This tutorial is a self-testing literate Haskell programme introducing the vanilla API of the regex package. There are other tutorials for explaining the more specialist aspects of regex and you can load them into into you Haskell REPL of choice: see the regex Tutorials page for details.

Language Pragmas

The first thing you will have to do is enable QuasiQuotes as regex uses them to check that REs are well-formed at compile time.

{-# LANGUAGE QuasiQuotes                      #-}

If you are trying out examples interactively at the ghci prompt then you will need

:seti -XQuasiQuotes

Importing the API

Before importing the regex API into your Haskell script you will need to answer two questions:

Which flavour of REs do I need? If you need Posix REs then the TDFA is for you, otherwise it is the PCRE back end, which is housed in a seperate regex-with-pcre package.
Which Haskell type is being used for the text I need to match? This can influence as, at the time of writing, the PCRE regex back end does not support theText types.

The import statement will in general look like this

  import Text.RE.<back-end>.<text-type>

As we have no interest in Posix/PCRE distinctions or performance here, we have chosen to work with the TDFA back end with String types.

import Text.RE.TDFA.String

You could also import Text.RE.TDFA or Text.RE.PCRE to get an API in which the operators are overloaded over all text types accepted by each of these back ends: see the Tools Tutorial for details.

Single `Match` with `?=~`

The regex API provides two matching operators: one for looking for the first match in its search string and the other for finding all of the matches. The first-match operator, ?=~, yields the result of attempting to find the first match.

(?=~) :: String -> RE -> Match String

The boolean matched function,

matched :: Match a -> Bool

can be used to test whether a match was found:

ghci> matched $ "2016-01-09 2015-12-5 2015-10-05" ?=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|]
True

To get the matched text use matchText,

matchedText :: Match a -> Maybe a

which returns Nothing if no match was found in the search string:

ghci> matchedText $ "2016-01-09 2015-12-5 2015-10-05" ?=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|]
Just "2016-01-09"

ghci> matchedText $ "2015-12-5" ?=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|]
Nothing

Multiple `Matches` with `*=~`

Use *=~ to locate all of the non-overlapping substrings that match a RE,

(*=~)      :: String -> RE -> Matches String
anyMatches :: Matches a -> Bool

anyMatches can be used to determine if any matches were found

ghci> anyMatches $ "2016-01-09 2015-12-5 2015-10-05" *=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|]
True

and countMatches will tell us how many sub-strings matched:

ghci> countMatches $ "2016-01-09 2015-12-5 2015-10-05" *=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|]
2

matches will return all of the matches.

matches :: Natches a -> [a]

ghci> matches $ "2016-01-09 2015-12-5 2015-10-05" *=~ [re|[0-9]{4}-[0-9]{2}-[0-9]{2}|]
["2016-01-09","2015-10-05"]

The `regex` Macros and Parsers

regex supports macros in regular expressions. There are a bunch of standard macros that you can just use, and you can define your own.

RE macros are enclosed in @{ … ‘}’. By convention the macros in the standard environment start with a ‘%’. @{%date} will match an ISO 8601 date, this

ghci> countMatches $ "2016-01-09 2015-12-5 2015-10-05" *=~ [re|@{%date}|]
2

will pick out the two dates.

There are also parsing functions for analysing the matched text. The @{%string} macro will match quoted strings (in which double quotes can be escaped with backslashes in the usual way) and its companion parseString function will extract the string that was being quoted, interpreting any escaped double quotes:

ghci> map parseString $ matches $ "\"foo\", \"bar\" and a quote \"\\\"\"" *=~ [re|@{%string}|]
[Just "foo",Just "bar", Just "\""]

See the macro tables page for details of the standard macros and their parsers.

See the testbench tutorial for more on how you can develop, document and test RE macros with the regex test bench.

Search and Replace

If you need to edit a string then SearchReplace [ed| … |] templates can be used with ?=~/ to replace a single instance or *=~/ to replace all matching instances.

ghci> "0000 40AA fab0" ?=~/ [ed|${adr}([0-9A-Fa-f]{4}):?///0x${adr}:|]
"0x0000: 40AA fab0"

ghci> "0000: 40AA fab0" *=~/ [ed|[0-9A-Fa-f]{4}///0x$0|]
"0x0000: 0x40AA 0xfab0"

Specifying Options

By default regular expressions are of the multi-line case-sensitive variety so this

ghci> countMatches $ "0a\nbb\nFe\nA5" *=~ [re|[0-9a-f]{2}$|]
2

will find 2 matches, the ‘$’ anchor matching each of the newlines, but only the first two lowercase hex numbers matching the RE. The case sensitivity and multiline-ness can be controled by selecting alternative parsers.

long name	short forms	multiline	case sensitive
reMultilineSensitive	reMS, re	yes	yes
reMultilineInsensitive	reMI	yes	no
reBlockSensitive	reBS	no	yes
reBlockInsensitive	reBI	no	no

So while the default setup

ghci> countMatches $ "0a\nbb\nFe\nA5" *=~ [reMultilineSensitive|[0-9a-f]{2}$|]
2

finds 2 matches, a case-insensitive RE

ghci> countMatches $ "0a\nbb\nFe\nA5" *=~ [reMultilineInsensitive|[0-9a-f]{2}$|]
4

finds 4 matches, while a non-multiline RE

ghci> countMatches $ "0a\nbb\nFe\nA5" *=~ [reBlockSensitive|[0-9a-f]{2}$|]
0

finds no matches but a non-multiline, case-insensitive match

ghci> countMatches $ "0a\nbb\nFe\nA5" *=~ [reBlockInsensitive|[0-9a-f]{2}$|]
1

finds the final match.

For the hard of typing the shortforms are available.

ghci> matched $ "SuperCaliFragilisticExpialidocious" ?=~ [reMI|supercalifragilisticexpialidocious|]
True

Compiling and Escaping

It is possible to compile a dynamically aquired RE string at run-time using compileRegex:

compileRegex :: (Functor m, Monad m) => String -> m RE

ghci> matches $ "2016-01-09 2015-12-5 2015-10-05" *=~ (maybe (error "evalme_CPL_01") id $ compileRegex "[0-9]{4}-[0-9]{2}-[0-9]{2}")
["2016-01-09","2015-10-05"]

These will compile the RE using the default multiline, case-sensitive options, but you can specify the options dynamically using compileRegexWith:

compileRegexWith :: (Functor m, Monad m) => SimpleREOptions -> String -> m RE

where SimpleREOptions is a simple enumerated type.

-- | the default API uses these simple, universal RE options,
-- which get auto-converted into the apropriate back-end 'REOptions_'
data SimpleREOptions
  = MultilineSensitive        -- ^ case-sensitive with ^ and $ matching the start and end of a line
  | MultilineInsensitive      -- ^ case-insensitive with ^ and $ matsh the start and end of a line
  | BlockSensitive            -- ^ case-sensitive with ^ and $ matching the start and end of the input text
  | BlockInsensitive          -- ^ case-insensitive with ^ and $ matching the start and end of the input text
  deriving (Bounded,Enum,Eq,Ord,Show)

ghci> matches $ "2016-01-09 2015-12-5 2015-10-05" *=~ (maybe (error "evalme_CPL_01") id $ compileRegexWith MultilineSensitive "[0-9]{4}-[0-9]{2}-[0-9]{2}")
["2016-01-09","2015-10-05"]

If you need to compile SearchReplace templates for use with ?=~/ and *=~/ then the compileSearchReplace and compileSearchReplaceWith,

compileSearchReplace     :: (Monad m, Functor m, IsRegex RE s) => String -> String -> m (SearchReplace RE s)
compileSearchReplaceWith :: (Monad m, Functor m, IsRegex RE s) => SimpleREOptions -> String -> String -> m (SearchReplace RE s)

work analagously to compileRegex and compileRegexWith, with the RE and replacement template (either side of the ‘///’ in the [ed|...///...|] quasi quoters) being passed into these functions in two separate strings, to compile to the SearchReplace type expected by the ?=~/ and *=~/ operators.

-- | contains a compiled RE and replacement template
data SearchReplace re s =
  SearchReplace
    { getSearch   :: !re    -- ^ the RE to match a string to replace
    , getTemplate :: !s     -- ^ the replacement template with ${cap}
                            -- used to identify a capture (by number or
                            -- name if one was given) and '$$' being
                            -- used to escape a single '$'
    }
  deriving (Show)

The escape and escapeWith functions are special compilers that compile a string into a RE that should match itself, which is assumed to be embedded in a complex RE to be compiled.

escape :: (Functor m, Monad m) => (String->String) -> String -> m RE

The function pased in the first argument to escape takes the RE string that will match the string passed in the second argument and yields the RE to be compiled, which is returned from the parsing action.

ghci> "fooe{0}bar" *=~/ SearchReplace (maybe (error "evalme_CPL_03") id $ escape id "e{0}") ""
"foobar"

The Classic regex-base Match Operators

The original =~ and =~~ match operators are still available for those that have mastered them.

ghci> "bar"    =~  [re|(foo|bar)|] :: Bool
True

ghci> "quux"   =~  [re|(foo|bar)|] :: Bool
False

ghci> "foobar" =~  [re|(foo|bar)|] :: Int
2

ghci> "foo"    =~~ [re|bar|]       :: Maybe String
Nothing