The regex Options Tutorial

Language Options and Imports

This tutorial is a literate Haskell program whwre we start by specifying the language pragmas and imports we will need for this module.

{-# LANGUAGE QuasiQuotes                      #-}

For this module we will work with the PCRE nativeoptions which are based on bit masks, so Data.Bits will be needed.

Text.RE.REOptions provides the generic regex types and functions for handling options, regardless of the selected back end.

Note that we import the PCRE String APi Text.RE.PCRE.String and the general regex PCRE back end, Text.RE.PCRE, needed for the the types and functions is supplies for accessing the PCRE native options. We could have just imported Text.RE.PCRE but it is useful to see which extra types and functions being used from this module (they will be qualified with PCRE.).

We also import Text.Regex.PCRE from the regex-pcre package for the native pcre-regex types and functions themselves.

import           Data.Bits
import qualified Text.RE.PCRE                 as PCRE
import           Text.RE.PCRE.String
import           Text.RE.REOptions
import           Text.Regex.PCRE

Compiling REs with the Complete REOptions

Each regex-tools back end — TDFA and PCRE — has it own complile-time options and execution-time options, called in each case CompOption and ExecOption. The SimpleREOptions selected with the RE parser, e.g.,
ghci> countMatches $ "0a\nbb\nFe\nA5" *=~ [reBlockInsensitive|[0-9a-f]{2}$|]
configures the RE back end accordingly so that you don't have to, but you may need full access to your chosen back end's options, in which case you will need to know about the REOptions type, defined by each of the back ends in terms of the REOptions_ type of Text.RE.REOptions as follows.
type REOptions = REOptions_ RE CompOption ExecOption

(Bear in mind that RE, CompOption and ExecOption are defined differently in the TDFA and PCRE back ends.)

The REOptions_ type is defined in Text.RE.REOptions as follows:

-- | the general options for an RE are dependent on which back end is
-- being used and are parameterised over the @RE@ type for the back end,
-- and its @CompOption@ and @ExecOption@ types (the compile-time and
-- execution time options, respectively); each back end will define an
-- @REOptions@ type that fills out these three type parameters with the
-- apropriate types (see, for example, "Text.RE.TDFA")
data REOptions_ r c e =
    { optionsMacs :: !(Macros r)    -- ^ the available TestBench RE macros
    , optionsComp :: !c             -- ^ the back end compile-time options
    , optionsExec :: !e             -- ^ the back end execution-time options
  deriving (Show)

(For more information on the options provided by the back ends see the decumentation for the regex-tdfa and regex-pcre packages as apropriate.)

Each back end provides a function to compile REs from some options and a string containing the RE as follows:
compileRegexWithOptions :: (IsOption o, Functor m, Monad   m)
                        => o
                        -> String
                        -> m RE

where o is one of the following RE-configuring types:

The compilation takes place in a monad to allow for failure. In the following examples we will use this helper, which will extract the compiled RE using error to deal with any failures.
check :: Either String a -> a
check = either error id
ghci> countMatches $ "0a\nbb\nFe\nA5" *=~ check (compileRegexWith BlockInsensitive "[0-9a-f]{2}$")

This will allow you to compile regular expressions when the either the text to be compiled or the options have been dynamically determined.

If you need to build SearchReplace templates then there is an analagous compilation function for that:

compileSearchReplaceWithOptions :: (Monad m,Functor m,IsRegex RE s)
                                => REOptions
                                -> String
                                -> String
                                -> m (SearchReplace RE s)

Specifying REOptions with re_ and ed_

If you just need to specify some non-standard options with a static RE, you can use the re_ quasi quoter, which yields a function takes an option type and returns the RE compiled with the given options:
ghci> countMatches $ "0a\nbb\nFe\nA5" *=~ [re_|[0-9a-f]{2}$|] BlockInsensitive

Any option o such that IsOption o RE CompOption ExecOption (i.e., any option type accepted by compileRegex above) can be used with [re_ ... |].

The [ed_ ... /// ... |] for compiling SearchReplace templates works analagously, yielding a function that takes an option type and returns the SearchReplace template comoiled with those RE options.

Configuring Native (PCRE) Options

The function unpackSimpleREOptions, used to generate PCRE native options from the generic SimpleREOptions is defined like this. (We have made some minor organizational changes for this presentaion, but this is equivalent to the library code used for PCRE.unpackSimpleREOptions.)

unpackSimpleREOptions :: SimpleREOptions -> PCRE.REOptions
unpackSimpleREOptions sro =
    { optionsMacs = PCRE.prelude      -- the standard 'prelude' macro environment
    , optionsComp = comp              -- our calculated PCRE compile options
    , optionsExec = defaultExecOpt    -- the default PCRE run-time options
    comp =
      wiggle ml compMultiline $
      wiggle ci compCaseless

    (ml,ci) = case sro of
        MultilineSensitive    -> (,) True  False
        MultilineInsensitive  -> (,) True  True
        BlockSensitive        -> (,) False False
        BlockInsensitive      -> (,) False True

-- set or clear a PCRE option bit according to the
-- Bool in its first argument using the bit mask
-- passed in the second argument
wiggle :: Bits a => Bool -> a -> a -> a
wiggle True  m v = v .|.            m
wiggle False m v = v .&. complement m

Now we will set up a apecial set of PCRE options based on BlockInsensitive, but with the PCRE DOTALL option bit set.

myOptions :: PCRE.REOptions
myOptions =
    { optionsComp = wiggle True compDotAll $ optionsComp biOptions

biOptions :: PCRE.REOptions
biOptions = unpackSimpleREOptions BlockInsensitive

Now we can test myOptions with the [re_| ... |] quasi quoter as follows.

ghci> matched $ "0a\nbbxFe&A5 " ?=~ [re_|^([0-9a-f]{2}.){4}$|] myOptions

That test matched, but if we provide just BlockInsensitive options set up in biOptions above,

ghci> matched $ "0a\nbbxFe&A5 " ?=~ [re_|^([0-9a-f]{2}.){4}$|] biOptions

the match fails.