Monday 16 March 2015

Overthinking Scala Parsers

Let me start by telling how much I love Scala's Parsing Combinators framework. I had a nice experience some time ago, working with tools based on very similar ideas on Haskell, and I felt glad to realize that the Scala guys where able to fully maintain the power and simplicity of functional parser combinators while, at the same time, integrate it so gracefully to the object oriented world and Scala's complex type system.

Anyway, I have no intention to write about the framework itself (It's very well documented and it's not hard to find lots of tutorials and examples), but rather focus on sharing our recent experience dealing with some of it's rough edges. There where mainly two particular subjects that, though subtle, I found interesting. One, once you got your parser to work, what can you do to make it nicer? And two, how do you test it?

Imagine you have some code like this one:

object AwesomeParser extends RegexParsers {
  val foo: Parser[Foo] = ... // Some foo parser
  val bar: Parser[Bar] = ... // Some bar parser
  val yetMoreParsers = ...
}

For the sake of the post, let's assume that this is a complete, working example. We could also give for granted that the code is very clean and we attend all the common functional good practices such as avoiding side effects and defining larger parsers as the combination of very small and simple ones. Ok, so the logic is working, and the code is clear; what could possibly be wrong?

Well... Nothing is "wrong" I guess... But there are a couple of things here that I felt kind of unpleasant. For example, let's say that the day comes that I have to add a new parsers defined as one or more foo and one or more bar.

object AwesomeParser extends RegexParsers {
  val manyFoos = foo.+
  val foo: Parser[Foo] = ... // Some foo parser
  val bar: Parser[Bar] = ... // Some bar parser
  val manyBars = bar.+
  val yetMoreParsers = ...
}

This may still look ok, it will even compile just fine, but truth is the manyFoos implementation is broken. Why? Because it references a value not yet initialized. At this point you may be thinking "Oh, God! Just put the manyFoos val at the bottom, you obsessive freak!" and, sure, that would fix this scenario, but once you have a good fifty something parsers with combinations and recursions finding the right spot to place your parsers can be tricky. Besides I don't want to worry about declaring stuff before using it (What am I, a caveman?).
We could replace all those vals with defs, but that would also introduce some unnecessary overhead, since parsers would have to be combined every single time they are referenced... I find that the most satisfying alternative (in my experience, of course) is to define parsers as lazy initialized values. That way the parsers would get combined one single time, but you can pretty much place them wherever you want to.

object AwesomeParser extends RegexParsers {
  lazy val manyFoos = foo.+
  lazy val foo: Parser[Foo] = ... // Some foo parser
  lazy val bar: Parser[Bar] = ... // Some bar parser
  lazy val manyBars = bar.+
  lazy val yetMoreParsers = ...
}

Another annoying thing about this implementation is how dirty the interface is. Sure, the code may look clean and easy now, but wait until you get that AwesomeParser object and press ctrl+space...
As you may notice, my AwesomeParser object extends the RegexParsers trait (though it may have been any other member of the  Parsers subtrait family). These traits pretty much define little parsers ecosystems, along with the types and functions needed to manipulate them. Problem is, the amount of methods and types defined is HUGE (and all the parsers we added don't really help the problem). Also, there is no standar entry point. A RegexParser may be executed using the parse or parseAll methods, while a Json parser provides parseFull and parseRaw.
Adding all up, it is quite bothersome for the parser user to navigate this large interfaces looking for the way to execute the parser and ignoring amost everything else. And even when they find the right parse method, most of it's versions require the user to pass along the parser that should be used to parse. For example, our AwesomeParser may be called like this:

AwesomeParser.parse(AwesomeParser.manyFoos,"foofoofoofoo")

Now, lets face it. It is great to have the possibility to chose the parse approach, but most of the time you don't need that many options. On those cases, a simple way to get a cleaner interface for your parser may be to "hide" it's definition in an inner declaration and just expose the messages you want.

object AwesomeParser {

  def apply(input: String) = Definition.parse(Definition.manyFoos, input)

  protected object Definition extends RegexParsers {
    lazy val manyFoos = foo.+
    lazy val foo: Parser[Foo] = ... // Some foo parser
    lazy val bar: Parser[Bar] = ... // Some bar parser
    lazy val manyBars = bar.+
    lazy val yetMoreParsers = ...
  }
}

One last thing to object about this implementation is that it is pretty much impossible to extend. A singleton object may seem like a great implementation for a parser; it grants global access and probably leads to a healthy stateless implementation, but what if somebody wants to extend it or make a subtle variation on the grammar? Well, why can't we have booth? It is as simple as offering a trait implementation of your parser and make a default singleton instance to extend it. That way you keep all the perks of a global, well-known parser without sacrificing extensibility.

object AwesomeParser {

  def apply(input: String) = Definition.parse(Definition.manyFoos, input)

  protected object Definition extends AwesomeParserDefinition
  trait AwesomeParserDefinition extends RegexParsers {
    lazy val manyFoos = foo.+
    lazy val foo: Parser[Foo] = ... // Some foo parser
    lazy val bar: Parser[Bar] = ... // Some bar parser
    lazy val manyBars = bar.+
    lazy val yetMoreParsers = ...
  }
}

Now lets talk testing. I can't even imagine developing/maintaining a parser without a full set of tests. The thing is that, in my experience, parser tests tend to be much MUCH larger than the parsers themselves and require lots of attention and care. The simpler and more expresive they are, the easier it will be to implement changes to the parsed grammar. That's why we packed a couple of simple tools we developed to improve the parser testing and published them here for anyone to use.

And that's all I have to say about that.


N.