scala-parser-combinators/Getting_Started.md at 1.2.x · scala/scala-parser-combinators

Getting Started

Scala parser combinators are a powerful way to build parsers that can be used in everyday programs. But it's hard to understand the plumbing pieces and how to get started. After you get the first couple of samples to compile and work, the plumbing starts to make sense. But until then it can be daunting, and the standard documentation isn't much help (some readers may remember the original "Scala By Example" chapter on parser combinators, and how that chapter disappeared from subsequent revisions of the book). So what are the components of a parser? How do those components fit together? What methods do I call? What patterns can be matched? Until those pieces are understood, you can’t begin to work on your grammar or build and process abstract syntax trees. So to minimize complexity, I wanted to start here with the simplest possible language: a lowercase word. Let’s build a parser for that language. We can describe the grammar in a single production rule:

word -> [a-z]+

Here’s what the parser looks like:

import scala.util.parsing.combinator._
class SimpleParser extends RegexParsers {
 def word: Parser[String] = """[a-z]+""".r ^^ { _.toString }
}

The package scala.util.parsing.combinator contains all of the interesting stuff. Our parser extends RegexParsers because we do some lexical analysis. """[a-z]+""".r is the regular expression. ^^ is documented to be "a parser combinator for function application". Basically, if the parsing on the left of the ^^ succeeds, the function on the right is executed. If you've done yacc parsing, the left hand side of the ^^ corresponds to the grammar rule and the right hand side corresponds to the code generated by the rule. Since the method "word" returns a Parser of type String, the function on the right of ^^ needs to return a String.

So how do we use this parser? Well, if we want to extract a word from string, we can call

SimpleParser.parse(SimpleParser.word(myString))

Here’s a little program to do this.

object TestSimpleParser extends SimpleParser {
 def main(args: Array[String]) = println(parse(word, "johnny come lately"))
}

Two things to notice here:

The object extends SimpleParser. That gets us around having to prefix everything with "SimpleParser".
When we run this, we don’t get back the "word" we parsed, we get a ParserResult[String] back. The "String" type parameter is needed because the method named "word" returns a result of type Parser[String], and the type parameter carries through to the ParseResult.

When we run the program, we get the following at the console:

[1.7] parsed: johnny

That says that the first character of the input that matched the parser is position 1, and the first character remaining to be matched is in position 7. This is a good start, but all of this should suggest that we are missing something because we have a ParseResult, but not the thing we want, which is the word. We need to handle the ParserResult better. We could call the "get" method on the ParseResult. That would give us the result, but that would be making the optimistic assumption that everything worked and that parsing was successful. We can't plan on that because we probably can't control the input enough to know that it is valid. The input is given to us and we have to make the best of it. That means detecting and handling errors, which sounds like a job for pattern matching, right? In Scala we use pattern matching to trap exceptions, we use pattern matching (Options) to branch for success and failure, so you would expect to use pattern matching to deal with parsing as well. And in fact you can pattern match on the ParseResult for the various termination states. Here’s a rewrite of the little program that does a better job:

object TestSimpleParser extends SimpleParser {
 def main(args: Array[String]) = { 
 parse(word, "johnny come lately") match {
 case Success(matched,_) => println(matched)
 case Failure(msg,_) => println("FAILURE: " + msg)
 case Error(msg,_) => println("ERROR: " + msg)
 }
 }
}

In comparison to Option, which has two primary cases (Some and None), the ParseResult basically has three cases: 1) Success, 2) Failure, and 3) Error. Each case is matched by a pattern of two items. In the Success case, the first item is the object produced by the parser (a String for us since "word" returns a Parser[String]), and in the Failure and Error cases, the first item is an error message. In all cases, the second item in the match is the remaining unmatched input, which we don’t care about here. But if we were doing fancy error handling or subsequent parsing, we would pay close attention to. The difference between Failure and Error is that on a Failure, parsing will backtrack when parsing continues (this rule didn't work but maybe there is some other grammar rule that will), whereas the Error case is fatal and there will be no backtracking (you have a syntax error, there is no way to match the expression you have provided with the grammar for this language, edit the expression and try again).

This tiny example shows a lot of the necessary parser combinator plumbing. Now let’s look at a slightly more complex (and admittedly contrived) example to bring forward some of the remaining plumbing. Say that what we are really after is a word followed by a number. Pretend that this is data about the frequency count of words in a long document. Of course, there are ways to do this by simple regular expression matching, but let’s take a slightly more abstract approach to show some more combinator plumbing. In addition to words we will also have to match numbers, and we will have to match words and numbers together. So first, let’s add a new type to gather words and counts. Here is a simple case class for that: