Combinating Parser Combinators
Introduction
We have everything we need to start build all the parsers we could possibly want. This is where is gets exciting so I wont spend any time giving an introduction today and we can just jump straight into it. The things covered in this chapter were mostly led by looking at the Parsec Hackage docsand choosing the parsers that I thought were important.
For motivation, I thought it would be fun if we tried to construct parsers for an actual grammar. I decided to go with a grammar that defines a calculator. In a sort of adapted EBNF, here it is
SimpleExp := Num | '(' Exp ')'
Exp := SimpleExp | UnOpExpr | BinOpExpr | FuncApp
UnOpExpr := '-' SimpleExp
| 'sin' SimpleExp
| 'cos' SimpleExp
| 'tan' SimpleExp
BinOpExpr := Exp + Exp -- lowest precedence
| Exp - Exp
| Exp * Exp
| Exp / Exp
| Exp ^ Exp -- highest precedence
FuncDef := Ident '(' [Var {', ' Var}] ')' '=' Exp
Var := Ident
FuncApp := Ident '(' [Exp {, Exp}] ')'
Ident := String
So a few examples of things we would be able to write are:
f(x, y) = x + y
5 + Sin (5 * 3) + 3 * 7
g(x) = x
1 + (2 + 3) + g(4) + f(2, 3)
The final thing we need is a the data types to encode this structure in Haskell
data Expr = Num Float
| Var String
| UnOpExpr UnOp Expr
| BinOpExpr BinOp Expr Expr
| FuncApp String [Expr]
data UnOp = Negate | Sin | Cos | Tan
data BinOp = Add | Sub | Mul | Div | Exp
data FuncDef = FuncDef String [String] Expr
Now we are ready to start, I’m going to try and explain the theory behind these parsers, but I will try and motivate it by specific examples in our calculator language.
satisfy
To begin, we will implement what could arguably be described as one of the most fundemental parsers. It is satisfy :: (Char -> Bool) -> Parser Char which takes a function and successfully parsers the first character of the input only if the function returns True when applied to that function. From the first chapter, we should already be able to do this.
satisfy f = Parser $ \inp ->
case inp of
c:cs -> if f c then Just (c, cs) else Nothing
_ -> Nothing
If your memory is good, you might realise that this looks very similar to pChar , and in fact we can now redefine it!
pChar (again)
Now that we have satisfy , it means that pChar is trivial. This is the fundemental idea of parser combinators: we gradually add complexity to small parsers to create incredibly complex parsers.
pChar :: Char -> Parser Char
pChar c = satisfy (c==)
pString (again)
This one is going to be a bit more complex, but I want to use it as an oppotunity to show off all of the work that we did last chapter. We’re going to follow the same sort of structure as before, where we parsed each character of the string sequentially. So let’s consider the base case pString "" , this is just a parser that parses no input and returns "" . Fortunately though, that sounds suspiciously like pure "" , so we’re done with the base case!
Next, we need to work out what pString (c:cs) should do. The idea now is that we’re going to create a parser for c and then combine that with the parser for cs . I think for this one it’s best if I write it out first and then explain afterwards, so
pString :: String -> Parser String
pString "" = pure ""
pString (c:cs) = (:) <$> pChar c <*> pString cs
The base case should already be clear. For the recursive case, let’s assume that pString cs does exactly what we expect, which is return a parser that parses exactly the string cs , then we also have pChar c . All we need now is some way to combine the two parsers… The logic here is that if we try and parse some input string “combinator” to pString "combin" , then pChar "c" will return Just ('c', "ombinator") and then pString "ombin" will return Just ("ombin", "ator") . So combining these two parsers should correspond to appending “c” to “ombin”, ie. (:) .
Still it may not be clear, so first let’s consider what (:) <$> pChar c does. By our definition of fmap , we apply the function to the parsed result of the parser, so in this case, we will now have the result Just ('c' :, "ombinator") . By looking at the types of each of these functions, make sure your confident with the fact that
(:) <$> pChar c :: Parser (String -> String)
pString cs :: Parser String
Now, it should be quite easy to see just from the types that we need to use <*> . This is a design pattern that you will get used to. If you have a function of two parameters that you need to apply to the values contained inside two applicatives, then this pattern of f <$> app1 <*> app2 is generally what you’d want to use - we’re going to use this a lot more so we will have plenty more practise. I think this parser is much more elegant than our original one, and we didn’t have to write any low level logic, all we have to do now is consider what our parsers should parser and how we have to combine the results.
Sidenote: <:>
A small sidenote is that the (:) <$> px <*> py pattern can be made slightly more elegant, we do this by defining a new function.
(<:>) :: Parser a -> Parser [a] -> Parser a
px <:> py = (:) <$> px <*> py
But you should be able to see that this is not a function that is specific to parsers, and we should always aim to make function types as general as possible, all we really need here is an applicative function, so we can rewrite this as
(<:>) :: Applicative f => f a -> f [a] -> f[a]
px <:> py = (:) <$> px <*> py
Hence, the inductive case of our pString parser becomes
pString (c:cs) = pChar c <:> pString cs
It’s cleaner than what we had before, and its an additional tool that we can use in the future - what we’ve done is called “lifting” - we’ve lifted (:) to an applicative context.
pSpan
Another really useful small parser that we will need is pSpan :: (Char -> Bool) -> Parser String It works in a similar way to the normal span :: (a -> Bool) -> [a] -> ([a], [a]) . It will continue to parse a string as long as the predicate is satisfied, for example
runParser (pSpan isDigit) "123abc45" == Just ("123", "abc45")
runParser (pSpan isSpace) " abc" == Just (" ", "abc")
runParser (pSpan isSpace) "abc" == Nothing
You can see that what this parser does is repeatedly check if the predicate is satisfied after parsing each character, which is like repeatedly applying a satisfy parser until it fails. Finally we get to use our Alternative instance that we defined previously, one of the functions the Alternative class provides is some , which attemps to run the Alternative at least once (which in this case corresponds to successfully parsing a character), so we get
pSpan :: (Char -> Bool) -> Parser String
pSpan f = some (satisfy f)
-- or pSpan = some . satisfy
pVar
Finally we are ready to parse are first part of the grammar! We are going to parse strings, but only strings that contain alphanumeric characters and underscores - however, a variable name cannot start with a number. So the way we can think about this is first we have to parse a letter or an underscore, then we can parse any alphanumeric character or an underscore.
First let’s construct our small parsers
pUnderscore :: Parser Char
pUnderscore = pChar '_'
pAlpha :: Parser Char
pAlpha = satisfy isAlpha
pAlphanumeric :: Parser Char
pAlphanumeric = satisfy isAlphaNum
First let’s think how to parse any alphanumeric character or an underscore. It turns out that the answer is exactly in the question, using <|> from Alternative, we can very easily express this as
pAlphanumeric <|> pUnderscore :: Parser Char
After our first character, there may be more than one alphanumeric/”_” (there may also be zero), so to allow for parsing some arbitrary number of characters, we can use many , which is related to some from before, except it also succeeds if the supplied parser succeeds 0 times.
many (pAlphanumeric <|> pUnderscore) :: Parser [Char]
However, we need to guarantee that the first character is a alphanumeric character, and then we just combine that with the rest of the parsed string, looking at the types of pAlpha and many (pAlphanumeric <|> pUnderscore) , we can see that we simply need to cons the result of the first parser to the second.
The final step is to wrap this all in a Var . The type constructor is a function, so we can simply fmap it into our result to turn Parser String into Parser Expr
pVar :: Parser Expr
pVar = Var <$> pAlpha <:> many (pAlphanumeric <|> pUnderscore)
So finally, we can parse our first part of the grammar, let’s try it !
> runParser pVar "num1 = 5"
Just (Var "num1"," = 5")
> runParser pVar "num1_add_5 + 5"
Just (Var "num1_add_5"," + 5")
> runParser pVar "1num"
Nothing
This is our first big success ! A lot of the other parsers will follow a similar process, we essential build up the grammar from small parsers and some intelligent combining.
pNum
This has been quite a long tutorial, so I think we will do one more parser thats similar to pVar to finish off. So what is a float, well we can think of it as an integer that might have a decimal point and some numbers after the decimal point. Let’s start with the easiest bit and do the integer part. What we want to do is just parse a contiguous sequence of numbers, which is exactly the job of pSpan , we do this with
span isDigit
Let’s see how it works:
> runParser (pSpan isDigit) "12345"
Just ("12345","")
> runParser (pSpan isDigit) "12345.6789"
Just ("12345",".6789")
So it works for integers, but as soon as the number has a decimal place, it doesnt do what we want. So let’s construct a parser for the non-integer part. It will start with a ’.’ and then be followed by some digits, I’ll leave this one as an exercise, we will call it pDecimalPart for lack of a better name.
So now we a have a parser for the integer part, and a parser for the decimal part, we need some way to combine them. What we’re really looking to do is concatenate the results of both parsers. I think the best way to do this is to lift ++ to an applicative context, but it’s almost exactly the same as what we did with cons, so I’ll leave this as an easy exercise.
What we end up with is,
pSpan isDigit <++> pDecimalPart :: Parser String
However this is not exactly correct, we have lost the ability to parse numbers without a decimal. To fix this, we have say that if there is no decimal, we just return an empty string, the parser pure "" captures the idea of the empty string, we consume zero input and return "" . So we can write
pSpan isDigit <++> (pDecimalPart <|> pure [])
The final part is to wrap this in our Num constructor, but we have a small problem, Num :: Float -> Expr , but we currently have a string, this is no problem, though we simply do
read <$> pSpan isDigit <++> (pDecimalPart <|> pure [])
Now we can do
pNum :: Parser Expr
pNum = Num <$> read <$> pSpan isDigit <++> (pDecimalPart <|> pure [])
or slightly better,
pNum = Num . read <$> pSpan isDigit <++> (pDecimalPart <|> pure [])
We can test this and see that (if you implemented pDecimalPart correctly) that we have another successful parser!
sepBy1
To motivate this a bit, let’s think about the expression g(1, 5 + 4, x * 3) . For this, we need to repeatedly parse expressions seperated by commas. The concept of parsing a specific piece of the grammar seperated by another piece of the grammar is a very common scenerio, so let’s abstract it.
sepBy1 :: Parser a -> Parser b -> Parser [a] is the parser that parses tokens from px seperated by tokens from py - note that it assumes that there is at least one token from px - eg
> runParser (sepBy1 pInt pComma) "1,23,453"
[1, 23, 453]
> runParser (sepBy1 pInt pComma) "1"
[1]
> runParser (sepBy1 pInt pComma) ""
Nothing
So how should we do this ? Well, generally if we have a grammar like this in EBNF form, it would be
Int {',' Int}
where the curly brackets represent zero or more repititions. We can use this to motivate how we construct the parser
sepBy1 :: Parser a -> Parser b -> Parser [a]
sepBy1 px py = px <:> many (py *> px)
I’ve introduced a new symbol, *> :: Applicative f => f a -> f b -> f b , what this essentially does is sequentially peforms py then px , but throws away the result from the first and just returns the result from the second. So using the example of sepBy1 pInt pComma , we parse at least one integer, and then we try to repeatedly parse: a comma, followed by an integer. In the first example, we could group it like
1 ,23 ,543
Hopefully from this it’s more obivous why sepBy1 works.
sepBy
The problem with using this for our function arguments is that we could have a function with zero arguments. This is precisly sepBy , it’s very easy to define.
sepBy :: Parser a -> Parser b -> Parser [a]
sepBy px py = sepBy1 px py <|> pure []
This is basically the same logic as sepBy1 , but if it fails to parse at least one token for px , it just returns an empty list (or more pedantically, it returns the parser that consumes no input and returns the empty list).
> runParser (sepBy1 pInt pComma) "1,23,453"
[1, 23, 453]
> runParser (sepBy1 pInt pComma) "1"
[1]
> runParser (sepBy1 pInt pComma) ""
[]
Compare this with the results from sepBy1 . This is also the logic behind how some and many are defined in the applicative class.
pFunApp
Now we can easily parse our function applications. Let’s assume that the rules for a function name follow the rules of the variable names. For this reason, let’s abstract out that part of the logic from pVar , so we get
pIdent :: Parser String
pIdent = pAlpha <:> many (pAlphanumeric <|> pUnderscore)
pVar :: Parser Expr
pVar = Var <$> pIdent
So first we parse the function name, which is exactly pIndent , then we expect a series of expressions separated by commas surrounded by brackets. Let’s think about this in pieces. First, let’s think about the comma seperated expressions. One’s first instinct might be to write
sepBy pExpr (pChar ',')
But the problem is that this will fail on 1, 2, 3 because of the whitespace. So what we really want to do is parse whitespace where necessary between the commas. We can create a really simple whitespace parser \ ws , but I’ll leave this as an exercise. So we can do
sepBy pExpr (ws *> pChar ',' <* ws)
I’ve introduced another new operator <* , this is like *> - it sequentially executes the two actions, but it will throw away the second one instead of the first. So now, "1, 2, 3" will successfully parse as we discard the whitespace (note however that " 1, 2, 3 " doesn’t, but we will fix that soon.) and now we can surround it with brackets using
'(' *> sepBy pExpr (ws *> pChar ',' <* ws) <* ')'
So we can now parse "(1, 2, 3)" , but the whitespace strikes again because "( 1, 2, 3)" will fail, so we fix this with
'(' *> ws *> sepBy pExpr (ws *> pChar ',' <* ws) <* ws <* ')'
Now, we are ready to parse the entire function call
pFuncApp :: Parser Expr
pFuncApp = FuncApp <$> pIdent <*> (pChar '(') *> ws *> args <* ws <* ws <* (pChar ')')
where
args = sepBy pExpr (ws *> pChar ',' <* ws)
Hopefully its clear how we’ve combined all of these, as a hint FuncApp :: String -> [Expr] -> Expr , so FuncApp <$> pIdent :: Parser ([Expr] -> Expr) which should make it a bit easier to understand. This isnt the most attractive function ever, but we will introduce another parser which will improve this.
Let’s try it out a bit - we cant actually try it as it is, because we haven’t defined pExpr yet, but I’ll replace that with pNum for testing:
> runParser pFuncApp "f(1, 2, 3)"
Just (FuncApp "f" [Num 1.0,Num 2.0,Num 3.0],"")
> runParser pFuncApp "f()"
Just (FuncApp "f" [],"")
pFuncDef
Fortunately, this one shares a lot of similarities with pFuncApp . We already have the mechanisms to do this. So first we want to parse an identifier, but we already have pIdent , then a list of identifiers surrounded by brackets and seperated by commas - but thats extremely similar to what we did in the last parser, and then we expect to parse an equals sign (that may be surrounded by whitespace), but this is not needed to construct our Haskell data type, so we throw away the result, and finally, we parse an expression on the right hand side of the equals sign.
If we apply this sequential reasoning, and then chain it together with FuncDef :: String -> [String] -> Expr -> Expr , it should be quite intuitive
pFuncDef :: Parser FuncDef
pFuncDef = FuncDef
<$> pIdent
<*> betweenBrackets args
<* ws
<* pChar '='
<* ws
<*> pExpr
where
args = sepBy pIdent (ws *> pChar ',' <* ws)
Hopefully at this point you’re starting to get the hang of how to apply functions in an applicative context. As you can see, this function was actually not too difficult to implement, the work in building the smaller parsers and combining them together.
Finishing notes
I think we’ve covered quite a lot in this, in the next tutorial, we will finish our grammar and introduce slightly more complex parsers. Feel free to let me know how you found this tutorial, I am always looking to improve the quality of this website.
As a little summary, here are the important functions that we’ve covered so far that are useful when parsing an language. In the next tutorial there will be many more.
satisfy :: (Char -> Bool) -> Parser Char
pSpan :: (Char -> Bool) -> Parser String
(<:>) :: Applicative f => f a -> f [a] -> f [a]
sepBy1 :: Parser a -> Parser b -> Parser [a]
sepBy :: Parser a -> Parser b -> Parser [a]
between :: Parser a -> Parser b -> Parser c -> Parser c
Exercises
-
Define
satisfyusing>>= -
Using
satisfy, definepAny :: Parser Char, which successfully parses any character, but fails if there is no input. -
Write a parser to parse a ”.” followed by some digits. It should work on “.123”, “.01”, but fail of ”.” and “.123a”
-
Write a function
<++> :: Applicate f => f [a] -> f [a] -> f [a]that lifts concatentation to an applicative context.