Update this project:
I really like [funcparserlib](http://code.google.com/p/funcparserlib/). By far my favorite parser and have used it in numerous projects over the years. But it took me forever to "get". The author wrote two tutorials, the [Official Tutorial](http://spb-archlinux.ru/2009/funcparserlib/Tutorial) and the [Bracket Tutorial](http://spb-archlinux.ru/2009/funcparserlib/Brackets). Both of these are solid, but here is my take on the library. First up, do you like [EBNF](http://en.wikipedia.org/wiki/Ebnf)? Are you aware of the limitations of [regex](http://en.wikipedia.org/wiki/Chomsky_hierarchy)? Do you secretly desire to write out formal specifications? Good. Otherwise you'll wonder what all the fuss is about. The official tutorial spends a lot of time using Python's built in lexer. Good for speed, but a needless complication. It should be funcparserlib all the way down! Let's start with the basics: import funcparserlib.parser as p I'll be using `p.` to access funcparserlib in the code samples. This library is **tiny**. Can not stress this enough. There are only thirteen methods. Typically a tutorial would start describing simple examples now. But with a small number of very powerful words, a mile high overview is more helpful. # Major terms ## match and parser > These are not commands. But I'll be saying them a lot. These parsers are stream based. Big parsers are made from small parsers. They examine a stream of characters, and each character gets compared against a filtering/parsing function. If there are no errors and the parser manages to bite off enough characters to make it happy, then a match has been found. If something unexpected comes up, the parser aborts the match. The almost-matched characters are returned, and the next parser tries to match them. (See [LL](http://en.wikipedia.org/wiki/LL_parser) for more depth.) The exciting thing is that these parsers have a hierarchy, and can recurse into each other. And just in case your mind is not blown enough, all of these parsers take smaller parsers as input and return bigger parsers as output, to create your full blown parser. ## `finished` > The simplest parser. It matches the end of the stream. ## `a` > The second simplest parser. It takes a single character and will match it. ## `some` > Builds a parser from scratch. Like the `filter` command, it takes a boolean test function as input. ## `many` > Similar to `itertools.takewhile`, it takes a single parser and tries to match with it for as long as possible. ## `oneplus` > Very similar to `many`, but requires at least one match. ## `maybe` > Instead of aborting the match if a parser fails, insert `None` into the result. ## `skip` > Match something, but don't echo the match to the result. ## `pure` > For inserting stuff into the result. Used by `maybe`, but not commonly needed. ## `forward_decl` > Used as a placeholder for making a self-referential parser. ## `with_forward_decls` > This is a [decorator](http://www.artima.com/weblogs/viewpost.jsp?thread=240808). Used to create mutually recursive parsers. Now for the last three. They are very important, more so than any previous method. Maybe these are why funcparserlib is so much fun. Or why it produces such tight code, free of parentheses and dotted calls. Some common operators are redefined for parsers: ## `+` > Concatenates parsers together. Matches the first, then the second. ## `|` > Like its logical OR brother, allows matching either of two parsers. ## `>>` > Send match to a function. Used to change, clean up and process the results. On to the code: a real parser for a miniature domain specific language. The case study behind this writeup is the project I was working on when everything clicked: the [closing credits](http://www.baphl.org/2/video/BAPHL%20credits.avi) animation for [BAPHL 2](http://www.baphl.org/2/). In a nutshell, we made our own knockoff of Portal's credits. I got to do all the programming and graphics, while Nathan Curtis did musical magic to create the catchy tune. The tricky part was flexibility. It had to be easy to sync the animation to the words. It had to be easy to swap fresh ascii art in. The credits were going to change that morning depending on who showed up. And it had to be extendible, as feature creep was almost a given. Funcparserlib made this flexibility easy to obtain. The resulting bit of python could read a couple of text files (one per "window") and spit out several thousand PNG frames, which were glued together with the soundtrack in `mencoder`. Fun fact, `ffmpeg` (my usual AV swiss army knife of choice) breaks if you try to make a video from ≈1000 still frames. # Grammar time Those text files would contain a bunch of plain text with tags interspersed. Anything that was not a tag (ie, normal text) would be rendered to the screen. All text was preceded and followed by timestamp tags, and the letters would be interpolated between these times. [4.5] floats are absolute timestamps [+0.5] relative timestamp [++10.0] delay (insert dead air w/o changing all absolute timestamps)  single ints are predefined palette colors [6,7] double ints are x,y cursor position [1,2,3] triple ints are rgb colors [n] newline (newlines in the file are ignored) [c] clear the screen As far as grammars go, this is about as simple as it gets. There is no nesting or branching. The square bracket characters can not even be escaped! Time to build a parser. # Gotcha Number One Be very careful when playing with funcparserlib in the Python interpreter. Because of how it uses closures, changes to functions don't propagate through. > parser_a = .... > parser_b = ....(parser_a, ....) > # whoops, made a typo > parser_a = .... Parser_b will still be using the old definition of parser_a! This screwed me up for so long. My advice for dealing with this is to keep your parser rules in a separate file. Import the file to the REPL, play with the parsers, edit the file, `reload()` the import to play with the changes. # Basics digits = p.oneplus(p.some(lambda c: c.isdigit())) digits.parse('123') == ['1', '2', '3'] to_int = lambda ds: sum(int(d)*10**m for m,d in enumerate(reversed(ds))) to_int(['1', '2', '3']) == 123 integer = digits >> to_int integer.parse('123') == 123 to_dec = lambda ds: sum(int(d)*10**(-m-1) for m,d in enumerate(ds)) to_dec(['1', '2', '3']) == 0.12300000000000001 # close enough decimal = integer + p.skip(p.a('.')) + (digits >> to_dec) >> sum decimal.parse('123.456') == 123.456 numbers = p.many((decimal | integer) + p.skip(p.maybe(p.a(' ')))) numbers.parse('1 2 3 4.5 6 7.5') == [1, 2, 3, 4.5, 6, 7.5] There will be a heavy use of [lambda functions](http://www.secnetix.de/olli/Python/lambda_functions.hawk) when building simple parsers. The `.parse()` method is the primary interface to the parsers themselves. Normally it will be called once on the largest and most complete parser. `digits` chomps into the stream character by character, collecting that returns `.isdigit()==True`. `to_int` is not a parser, but a simple function which converts a list of number-characters into a whole integer. `integer` combines these two, creating a (marginally) useful parser. `to_dec` is based heavily on `to_int`, but designed for numbers on the right side of the decimal point. `decimal` takes a leading integer, decimal point and trailing decimal term. These are summed together to make a float. `numbers` is almost a useful parser, capable of dealing with larger structures. But not that useful. Parsers should be able to at least handle malformed data. numbers.parse('1 2.5 3 Q 4.5') == [1, 2.5, 3] # something is missing numbers2 = numbers + p.skip(p.finished) numbers2.parse('1 2.5 3 Q 4.5') == NoParseError: should have reached
: Q By default, bad matches just kind of fall off the end. The parser gets derailed and returns partial results. Adding `finished` to the end forces it to acknowledge the error. Hitting the 'Q' breaks `numbers` out of its `many` loop. The next parser in the chain is looking for an end-of-file marker, and instead it sees the 'Q'. Not the best error report though. Anyone who has used a compiler in the past 20 years generally expects to get a line and column of where the parse error occurred. Funcparserlib won't do this on its own, and it is probably why the original author is so fond of using Python's `token` module as the first stage of parsing. `token` will report more useful (but cryptic) information about where the error occurred. Worry about this later. Besides, `numbers` and `numbers2` are a diversion. Moving on. stamp = p.skip(p.a('[')) + decimal + p.skip(p.a(']')) >> (lambda x:('time', x)) # p.skip(p.a()) is annoying char = lambda c: p.skip(p.a(c)) stamp = char('[') + decimal + char(']') >> (lambda x:('time', x)) stamp.parse('[4.5]') == ('time', 4.5) relstamp = char('[') + char('+') + decimal + char(']') >> (lambda x:('reltime', x)) # char() + char() is annoying literal = lambda s: reduce(lambda x,y: x+y, map(char, s)) relstamp = literal('[+') + decimal + char(']') >> (lambda x:('reltime', x)) relstamp.parse('[+0.5]') == ('reltime', 0.5) # those lambdas are getting annoying tag = lambda t: lambda x:(t, x) delay = literal('[++') + decimal + char(']') >> tag('delay') color = char('[') + integer + char(']') >> tag('color') jump = char('[') + integer + char(',') + number + char(']') >> tag('jump') newline = literal('[n]') >> tag('newline') clear = literal('[c]') >> tag('clear') rgb = ... # very similar to jump, but longer clear.parse('[c]') == ('clear', _Ignored(())) `literal` is a bit confusing. Note `x+y` is chaining `char` parsers, not adding numbers. # Gotcha Number Two Yes, all those parentheses around the lambdas are necessary. The order of operations with `>>` is a little odd. But refactoring the lambdas out removes the need for the parens. The larger structure of the parser's output starts to appear. The raw text is converted into a series of tagged tuples, a crude abstract syntax tree. For the most part, the AST is just tuples, numbers and strings. With `clear` and `newline` a small bit of funcparserlib leaks out, but this AST will never be serialized so it is simpler to ignore the `_Ignored(())`. # Putting it all together letter = p.some(lambda x: True) >> tag('text') anything = stamp | relstamp | color | jump | rgb | newline | clear | letter everything = p.many(anything) + p.skip(p.finished) return everything.parse(text_to_be_parsed) everything.parse('[1.0]word[+0.5]') = [('time', 1.0), ('text', 'w'), ('text', 'o'), ('text', 'r'), ('text', 'd'), ('reltime', 0.5)] Usually you'll want to glue individual letters together into words just like digits were assembled to numbers. In this application, every character was an individual animation event anyway. Assembling words only to later break them up again was silly. `letter` uses `some(True)` to match any character. Since it is a catchall, it must come last in `anything`. # Conclusion Parsers can be a bit of a pain to set up. Five long lines just to duplicate the functionality of `float()`! After a slow start, the capability of the parser grows exponentially with each line. All while individual lines remain short and simple. But this is true of all parsers. The three big take home lessons specific to Funcparserlib are * The REPL will ignore your tweaks. * `>>` will eat `lambda`, use extra parentheses. * Use `>>` to clean up your abstract syntax tree. Now, go read the [Brackets Tutorial](http://spb-archlinux.ru/2009/funcparserlib/Brackets) and learn about recursive parsers.