Combinating Parser Combinators

Introduction

We have everything we need to start build all the parsers we could possibly want. This is where is gets exciting so I wont spend any time giving an introduction today and we can just jump straight into it. The things covered in this chapter were mostly led by looking at the Parsec Hackage docsand choosing the parsers that I thought were important.

For motivation, I thought it would be fun if we tried to construct parsers for an actual grammar. I decided to go with a grammar that defines a calculator. In a sort of adapted EBNF, here it is

  SimpleExp := Num | '(' Exp ')'
  Exp       := SimpleExp | UnOpExpr | BinOpExpr | FuncApp
  UnOpExpr  := '-' SimpleExp
             | 'sin' SimpleExp 
             | 'cos' SimpleExp 
             | 'tan' SimpleExp
  BinOpExpr := Exp + Exp  -- lowest precedence
             | Exp - Exp 
             | Exp * Exp 
             | Exp / Exp  
             | Exp ^ Exp  -- highest precedence
  FuncDef   := Ident '(' [Var {', ' Var}] ')' '=' Exp
  Var       := Ident
  FuncApp   := Ident '(' [Exp {, Exp}] ')'
  Ident     := String

So a few examples of things we would be able to write are:

  f(x, y) = x + y 
  5 + Sin (5 * 3) + 3 * 7
  g(x) = x
  1 + (2 + 3) + g(4) + f(2, 3)

The final thing we need is a the data types to encode this structure in Haskell

  data Expr    = Num Float 
               | Var String
               | UnOpExpr UnOp Expr 
               | BinOpExpr BinOp Expr Expr 
               | FuncApp String [Expr]
  data UnOp    = Negate | Sin | Cos | Tan
  data BinOp   = Add    | Sub | Mul | Div | Exp
  data FuncDef = FuncDef String [String] Expr

Now we are ready to start, I’m going to try and explain the theory behind these parsers, but I will try and motivate it by specific examples in our calculator language.

satisfy

To begin, we will implement what could arguably be described as one of the most fundemental parsers. It is satisfy :: (Char -> Bool) -> Parser Char which takes a function and successfully parsers the first character of the input only if the function returns True when applied to that function. From the first chapter, we should already be able to do this.

  satisfy f = Parser $ \inp -> 
    case inp of 
      c:cs -> if f c then Just (c, cs) else Nothing 
      _    -> Nothing

If your memory is good, you might realise that this looks very similar to pChar , and in fact we can now redefine it!

pChar (again)

Now that we have satisfy , it means that pChar is trivial. This is the fundemental idea of parser combinators: we gradually add complexity to small parsers to create incredibly complex parsers.

  pChar :: Char -> Parser Char
  pChar c = satisfy (c==)

pString (again)

This one is going to be a bit more complex, but I want to use it as an oppotunity to show off all of the work that we did last chapter. We’re going to follow the same sort of structure as before, where we parsed each character of the string sequentially. So let’s consider the base case pString "" , this is just a parser that parses no input and returns "" . Fortunately though, that sounds suspiciously like pure "" , so we’re done with the base case!

Next, we need to work out what pString (c:cs) should do. The idea now is that we’re going to create a parser for c and then combine that with the parser for cs . I think for this one it’s best if I write it out first and then explain afterwards, so

  pString :: String -> Parser String 
  pString ""     = pure ""
  pString (c:cs) = (:) <$> pChar c <*> pString cs

The base case should already be clear. For the recursive case, let’s assume that pString cs does exactly what we expect, which is return a parser that parses exactly the string cs , then we also have pChar c . All we need now is some way to combine the two parsers… The logic here is that if we try and parse some input string “combinator” to pString "combin" , then pChar "c" will return Just ('c', "ombinator") and then pString "ombin" will return Just ("ombin", "ator") . So combining these two parsers should correspond to appending “c” to “ombin”, ie. (:) .

Still it may not be clear, so first let’s consider what (:) <$> pChar c does. By our definition of fmap , we apply the function to the parsed result of the parser, so in this case, we will now have the result Just ('c' :, "ombinator") . By looking at the types of each of these functions, make sure your confident with the fact that

  (:) <$> pChar c :: Parser (String -> String)
  pString cs      :: Parser String

Now, it should be quite easy to see just from the types that we need to use <*> . This is a design pattern that you will get used to. If you have a function of two parameters that you need to apply to the values contained inside two applicatives, then this pattern of f <$> app1 <*> app2 is generally what you’d want to use - we’re going to use this a lot more so we will have plenty more practise. I think this parser is much more elegant than our original one, and we didn’t have to write any low level logic, all we have to do now is consider what our parsers should parser and how we have to combine the results.

Sidenote: `<:>`

A small sidenote is that the (:) <$> px <*> py pattern can be made slightly more elegant, we do this by defining a new function.

  (<:>) :: Parser a -> Parser [a] -> Parser a
  px <:> py = (:) <$> px <*> py

But you should be able to see that this is not a function that is specific to parsers, and we should always aim to make function types as general as possible, all we really need here is an applicative function, so we can rewrite this as

  (<:>) :: Applicative f => f a -> f [a] -> f[a]
  px <:> py = (:) <$> px <*> py

Hence, the inductive case of our pString parser becomes

  pString (c:cs) = pChar c <:> pString cs

It’s cleaner than what we had before, and its an additional tool that we can use in the future - what we’ve done is called “lifting” - we’ve lifted (:) to an applicative context.

pSpan

Another really useful small parser that we will need is pSpan :: (Char -> Bool) -> Parser String It works in a similar way to the normal span :: (a -> Bool) -> [a] -> ([a], [a]) . It will continue to parse a string as long as the predicate is satisfied, for example

  runParser (pSpan isDigit) "123abc45" == Just ("123", "abc45")
  runParser (pSpan isSpace) "    abc"  == Just ("    ", "abc")
  runParser (pSpan isSpace) "abc"      == Nothing

You can see that what this parser does is repeatedly check if the predicate is satisfied after parsing each character, which is like repeatedly applying a satisfy parser until it fails. Finally we get to use our Alternative instance that we defined previously, one of the functions the Alternative class provides is some , which attemps to run the Alternative at least once (which in this case corresponds to successfully parsing a character), so we get

  pSpan :: (Char -> Bool) -> Parser String 
  pSpan f = some (satisfy f)
  -- or pSpan = some . satisfy

pVar

Finally we are ready to parse are first part of the grammar! We are going to parse strings, but only strings that contain alphanumeric characters and underscores - however, a variable name cannot start with a number. So the way we can think about this is first we have to parse a letter or an underscore, then we can parse any alphanumeric character or an underscore.

First let’s construct our small parsers

  pUnderscore :: Parser Char 
  pUnderscore = pChar '_'

  pAlpha :: Parser Char 
  pAlpha = satisfy isAlpha

  pAlphanumeric :: Parser Char 
  pAlphanumeric = satisfy isAlphaNum

First let’s think how to parse any alphanumeric character or an underscore. It turns out that the answer is exactly in the question, using <|> from Alternative, we can very easily express this as

  pAlphanumeric <|> pUnderscore :: Parser Char

After our first character, there may be more than one alphanumeric/”_” (there may also be zero), so to allow for parsing some arbitrary number of characters, we can use many , which is related to some from before, except it also succeeds if the supplied parser succeeds 0 times.

  many (pAlphanumeric <|> pUnderscore) :: Parser [Char]

However, we need to guarantee that the first character is a alphanumeric character, and then we just combine that with the rest of the parsed string, looking at the types of pAlpha and many (pAlphanumeric <|> pUnderscore) , we can see that we simply need to cons the result of the first parser to the second.

The final step is to wrap this all in a Var . The type constructor is a function, so we can simply fmap it into our result to turn Parser String into Parser Expr

  pVar :: Parser Expr 
  pVar = Var <$> pAlpha <:> many (pAlphanumeric <|> pUnderscore)

So finally, we can parse our first part of the grammar, let’s try it !

  > runParser pVar "num1 = 5"
  Just (Var "num1"," = 5")
  > runParser pVar "num1_add_5 + 5"
  Just (Var "num1_add_5"," + 5")
  > runParser pVar "1num"
  Nothing

This is our first big success ! A lot of the other parsers will follow a similar process, we essential build up the grammar from small parsers and some intelligent combining.

pNum

This has been quite a long tutorial, so I think we will do one more parser thats similar to pVar to finish off. So what is a float, well we can think of it as an integer that might have a decimal point and some numbers after the decimal point. Let’s start with the easiest bit and do the integer part. What we want to do is just parse a contiguous sequence of numbers, which is exactly the job of pSpan , we do this with

  span isDigit

Let’s see how it works:

  > runParser (pSpan isDigit) "12345" 
  Just ("12345","")
  > runParser (pSpan isDigit) "12345.6789"
  Just ("12345",".6789")

So it works for integers, but as soon as the number has a decimal place, it doesnt do what we want. So let’s construct a parser for the non-integer part. It will start with a ’.’ and then be followed by some digits, I’ll leave this one as an exercise, we will call it pDecimalPart for lack of a better name.

So now we a have a parser for the integer part, and a parser for the decimal part, we need some way to combine them. What we’re really looking to do is concatenate the results of both parsers. I think the best way to do this is to lift ++ to an applicative context, but it’s almost exactly the same as what we did with cons, so I’ll leave this as an easy exercise.

What we end up with is,

  pSpan isDigit <++> pDecimalPart :: Parser String

However this is not exactly correct, we have lost the ability to parse numbers without a decimal. To fix this, we have say that if there is no decimal, we just return an empty string, the parser pure "" captures the idea of the empty string, we consume zero input and return "" . So we can write

  pSpan isDigit <++> (pDecimalPart <|> pure [])

The final part is to wrap this in our Num constructor, but we have a small problem, Num :: Float -> Expr , but we currently have a string, this is no problem, though we simply do

  read <$> pSpan isDigit <++> (pDecimalPart <|> pure [])

Now we can do

  pNum :: Parser Expr 
  pNum = Num <$> read <$> pSpan isDigit <++> (pDecimalPart <|> pure [])

or slightly better,

  pNum = Num . read <$> pSpan isDigit <++> (pDecimalPart <|> pure [])

We can test this and see that (if you implemented pDecimalPart correctly) that we have another successful parser!

sepBy1

To motivate this a bit, let’s think about the expression g(1, 5 + 4, x * 3) . For this, we need to repeatedly parse expressions seperated by commas. The concept of parsing a specific piece of the grammar seperated by another piece of the grammar is a very common scenerio, so let’s abstract it.

sepBy1 :: Parser a -> Parser b -> Parser [a] is the parser that parses tokens from px seperated by tokens from py - note that it assumes that there is at least one token from px - eg

  > runParser (sepBy1 pInt pComma) "1,23,453" 
  [1, 23, 453] 
  > runParser (sepBy1 pInt pComma) "1" 
  [1]
  > runParser (sepBy1 pInt pComma) ""
  Nothing

So how should we do this ? Well, generally if we have a grammar like this in EBNF form, it would be

  Int {',' Int}

where the curly brackets represent zero or more repititions. We can use this to motivate how we construct the parser

  sepBy1 :: Parser a -> Parser b -> Parser [a]
  sepBy1 px py = px <:> many (py *> px)

I’ve introduced a new symbol, *> :: Applicative f => f a -> f b -> f b , what this essentially does is sequentially peforms py then px , but throws away the result from the first and just returns the result from the second. So using the example of sepBy1 pInt pComma , we parse at least one integer, and then we try to repeatedly parse: a comma, followed by an integer. In the first example, we could group it like

  1    ,23    ,543

Hopefully from this it’s more obivous why sepBy1 works.

sepBy

The problem with using this for our function arguments is that we could have a function with zero arguments. This is precisly sepBy , it’s very easy to define.

  sepBy :: Parser a -> Parser b -> Parser [a]
  sepBy px py = sepBy1 px py <|> pure []

This is basically the same logic as sepBy1 , but if it fails to parse at least one token for px , it just returns an empty list (or more pedantically, it returns the parser that consumes no input and returns the empty list).

            > runParser (sepBy1 pInt pComma) "1,23,453" 
            [1, 23, 453] 
            > runParser (sepBy1 pInt pComma) "1" 
            [1]
            > runParser (sepBy1 pInt pComma) ""
            []

Compare this with the results from sepBy1 . This is also the logic behind how some and many are defined in the applicative class.

pFunApp

Now we can easily parse our function applications. Let’s assume that the rules for a function name follow the rules of the variable names. For this reason, let’s abstract out that part of the logic from pVar , so we get

  pIdent :: Parser String 
  pIdent = pAlpha <:> many (pAlphanumeric <|> pUnderscore)
  
  pVar :: Parser Expr 
  pVar = Var <$> pIdent

So first we parse the function name, which is exactly pIndent , then we expect a series of expressions separated by commas surrounded by brackets. Let’s think about this in pieces. First, let’s think about the comma seperated expressions. One’s first instinct might be to write

  sepBy pExpr (pChar ',')

But the problem is that this will fail on 1, 2, 3 because of the whitespace. So what we really want to do is parse whitespace where necessary between the commas. We can create a really simple whitespace parser \ ws , but I’ll leave this as an exercise. So we can do

  sepBy pExpr (ws *> pChar ',' <* ws)

I’ve introduced another new operator <* , this is like *> - it sequentially executes the two actions, but it will throw away the second one instead of the first. So now, "1, 2, 3" will successfully parse as we discard the whitespace (note however that " 1, 2, 3 " doesn’t, but we will fix that soon.) and now we can surround it with brackets using

  '(' *> sepBy pExpr (ws *> pChar ',' <* ws) <* ')'

So we can now parse "(1, 2, 3)" , but the whitespace strikes again because "( 1, 2, 3)" will fail, so we fix this with

  '(' *> ws *> sepBy pExpr (ws *> pChar ',' <* ws) <* ws <* ')'

Now, we are ready to parse the entire function call

  pFuncApp :: Parser Expr 
  pFuncApp = FuncApp <$> pIdent <*> (pChar '(') *> ws *> args <* ws <* ws <* (pChar ')')
    where 
      args = sepBy pExpr (ws *> pChar ',' <* ws)

Hopefully its clear how we’ve combined all of these, as a hint FuncApp :: String -> [Expr] -> Expr , so FuncApp <$> pIdent :: Parser ([Expr] -> Expr) which should make it a bit easier to understand. This isnt the most attractive function ever, but we will introduce another parser which will improve this.

Let’s try it out a bit - we cant actually try it as it is, because we haven’t defined pExpr yet, but I’ll replace that with pNum for testing:

  > runParser pFuncApp "f(1, 2, 3)"
  Just (FuncApp "f" [Num 1.0,Num 2.0,Num 3.0],"")
  > runParser pFuncApp "f()"
  Just (FuncApp "f" [],"")

pFuncDef

Fortunately, this one shares a lot of similarities with pFuncApp . We already have the mechanisms to do this. So first we want to parse an identifier, but we already have pIdent , then a list of identifiers surrounded by brackets and seperated by commas - but thats extremely similar to what we did in the last parser, and then we expect to parse an equals sign (that may be surrounded by whitespace), but this is not needed to construct our Haskell data type, so we throw away the result, and finally, we parse an expression on the right hand side of the equals sign.

If we apply this sequential reasoning, and then chain it together with FuncDef :: String -> [String] -> Expr -> Expr , it should be quite intuitive

  pFuncDef :: Parser FuncDef
  pFuncDef = FuncDef 
               <$> pIdent               
               <*> betweenBrackets args 
               <*  ws 
               <*  pChar '='
               <*  ws 
               <*> pExpr 
  where 
    args = sepBy pIdent (ws *> pChar ',' <* ws)

Hopefully at this point you’re starting to get the hang of how to apply functions in an applicative context. As you can see, this function was actually not too difficult to implement, the work in building the smaller parsers and combining them together.

Finishing notes

I think we’ve covered quite a lot in this, in the next tutorial, we will finish our grammar and introduce slightly more complex parsers. Feel free to let me know how you found this tutorial, I am always looking to improve the quality of this website.

As a little summary, here are the important functions that we’ve covered so far that are useful when parsing an language. In the next tutorial there will be many more.

        satisfy :: (Char -> Bool) -> Parser Char
        pSpan   :: (Char -> Bool) -> Parser String
        (<:>)   :: Applicative f => f a -> f [a] -> f [a]
        sepBy1  :: Parser a -> Parser b -> Parser [a]
        sepBy   :: Parser a -> Parser b -> Parser [a]
        between :: Parser a -> Parser b -> Parser c -> Parser c

Exercises

Define satisfy using >>=
Using satisfy , define pAny :: Parser Char , which successfully parses any character, but fails if there is no input.
Write a parser to parse a ”.” followed by some digits. It should work on “.123”, “.01”, but fail of ”.” and “.123a”
Write a function <++> :: Applicate f => f [a] -> f [a] -> f [a] that lifts concatentation to an applicative context.