An Introduction

Original An Introduction Editable
version 3 of 3

I really like funcparserlib. By far my favorite parser and have used it in numerous projects over the years. But it took me forever to "get". The author wrote two tutorials, the Official Tutorial and the Bracket Tutorial. Both of these are solid, but here is my take on the library.

First up, do you like EBNF? Are you aware of the limitations of regex? Do you secretly desire to write out formal specifications? Good. Otherwise you'll wonder what all the fuss is about.

The official tutorial spends a lot of time using Python's built in lexer. Good for speed, but a needless complication. It should be funcparserlib all the way down! Let's start with the basics:

import funcparserlib.parser as p

I'll be using p. to access funcparserlib in the code samples.

This library is tiny. Can not stress this enough. There are only thirteen methods. Typically a tutorial would start describing simple examples now. But with a small number of very powerful words, a mile high overview is more helpful.

Major terms

match and parser

These are not commands. But I'll be saying them a lot. These parsers are stream based. Big parsers are made from small parsers. They examine a stream of characters, and each character gets compared against a filtering/parsing function. If there are no errors and the parser manages to bite off enough characters to make it happy, then a match has been found. If something unexpected comes up, the parser aborts the match. The almost-matched characters are returned, and the next parser tries to match them. (See LL for more depth.) The exciting thing is that these parsers have a hierarchy, and can recurse into each other. And just in case your mind is not blown enough, all of these parsers take smaller parsers as input and return bigger parsers as output, to create your full blown parser.

finished

The simplest parser. It matches the end of the stream.

a

The second simplest parser. It takes a single character and will match it.

some

Builds a parser from scratch. Like the filter command, it takes a boolean test function as input.

many

Similar to itertools.takewhile, it takes a single parser and tries to match with it for as long as possible.

kk (2019-12-01-10-16-03-686)

Use this sparingly, and with care when you do. It will always successfully match anything, and so prevents the parser from backtracking if something fails down the line.



oneplus

Very similar to many, but requires at least one match.

maybe

Instead of aborting the match if a parser fails, insert None into the result.

skip

Match something, but don't echo the match to the result.

pure

For inserting stuff into the result. Used by maybe, but not commonly needed.

forward_decl

Used as a placeholder for making a self-referential parser.

Kyle (2012-02-02-14-52-31-270)

Cut forward_decls and with_forward. Recursive parsers are beyond this intro.



with_forward_decls

This is a decorator. Used to create mutually recursive parsers.

Now for the last three. They are very important, more so than any previous method. Maybe these are why funcparserlib is so much fun. Or why it produces such tight code, free of parentheses and dotted calls. Some common operators are redefined for parsers:

+

Concatenates parsers together. Matches the first, then the second.

|

Like its logical OR brother, allows matching either of two parsers.

Kyle (2012-02-02-14-51-14-928)

An important detail, it short circuits to the first match and does not check further.



>>

Send match to a function. Used to change, clean up and process the results.

On to the code: a real parser for a miniature domain specific language. The case study behind this writeup is the project I was working on when everything clicked: the closing credits animation for BAPHL 2. In a nutshell, we made our own knockoff of Portal's credits. I got to do all the programming and graphics, while Nathan Curtis did musical magic to create the catchy tune.

The tricky part was flexibility. It had to be easy to sync the animation to the words. It had to be easy to swap fresh ascii art in. The credits were going to change that morning depending on who showed up. And it had to be extendible, as feature creep was almost a given. Funcparserlib made this flexibility easy to obtain.

The resulting bit of python could read a couple of text files (one per "window") and spit out several thousand PNG frames, which were glued together with the soundtrack in mencoder. Fun fact, ffmpeg (my usual AV swiss army knife of choice) breaks if you try to make a video from ≈1000 still frames.

Grammar time

Those text files would contain a bunch of plain text with tags interspersed. Anything that was not a tag (ie, normal text) would be rendered to the screen. All text was preceded and followed by timestamp tags, and the letters would be interpolated between these times.

[4.5]    floats are absolute timestamps
[+0.5]   relative timestamp
[++10.0] delay (insert dead air w/o changing all absolute timestamps)
[2]      single ints are predefined palette colors
[6,7]    double ints are x,y cursor position
[1,2,3]  triple ints are rgb colors
[n]      newline (newlines in the file are ignored)
[c]      clear the screen

As far as grammars go, this is about as simple as it gets. There is no nesting or branching. The square bracket characters can not even be escaped! Time to build a parser.

Gotcha Number One

Be very careful when playing with funcparserlib in the Python interpreter. Because of how it uses closures, changes to functions don't propagate through.

> parser_a = ....
> parser_b = ....(parser_a, ....)
> # whoops, made a typo
> parser_a = ....

Parser_b will still be using the old definition of parser_a! This screwed me up for so long. My advice for dealing with this is to keep your parser rules in a separate file. Import the file to the REPL, play with the parsers, edit the file, reload() the import to play with the changes.

Basics

digits = p.oneplus(p.some(lambda c: c.isdigit()))
digits.parse('123') == ['1', '2', '3']

to_int = lambda ds: sum(int(d)*10**m for m,d in enumerate(reversed(ds)))
to_int(['1', '2', '3']) == 123

integer = digits >> to_int
integer.parse('123') == 123

to_dec = lambda ds: sum(int(d)*10**(-m-1) for m,d in enumerate(ds))
to_dec(['1', '2', '3']) == 0.12300000000000001  # close enough

decimal = integer + p.skip(p.a('.')) + (digits >> to_dec) >> sum
decimal.parse('123.456') == 123.456

numbers = p.many((decimal | integer) + p.skip(p.maybe(p.a(' '))))
numbers.parse('1 2 3 4.5 6 7.5') == [1, 2, 3, 4.5, 6, 7.5]

There will be a heavy use of lambda functions when building simple parsers. The .parse() method is the primary interface to the parsers themselves. Normally it will be called once on the largest and most complete parser.
digits chomps into the stream character by character, collecting that returns .isdigit()==True.
to_int is not a parser, but a simple function which converts a list of number-characters into a whole integer.
integer combines these two, creating a (marginally) useful parser.
to_dec is based heavily on to_int, but designed for numbers on the right side of the decimal point.
decimal takes a leading integer, decimal point and trailing decimal term. These are summed together to make a float.
numbers is almost a useful parser, capable of dealing with larger structures. But not that useful. Parsers should be able to at least handle malformed data.

numbers.parse('1 2.5 3 Q 4.5') == [1, 2.5, 3]  # something is missing

numbers2 = numbers + p.skip(p.finished)
numbers2.parse('1 2.5 3 Q 4.5') == NoParseError: should have reached <EOF>: Q

By default, bad matches just kind of fall off the end. The parser gets derailed and returns partial results. Adding finished to the end forces it to acknowledge the error. Hitting the 'Q' breaks numbers out of its many loop. The next parser in the chain is looking for an end-of-file marker, and instead it sees the 'Q'. Not the best error report though. Anyone who has used a compiler in the past 20 years generally expects to get a line and column of where the parse error occurred. Funcparserlib won't do this on its own, and it is probably why the original author is so fond of using Python's token module as the first stage of parsing.
token will report more useful (but cryptic) information about where the error occurred. Worry about this later.

Besides, numbers and numbers2 are a diversion. Moving on.

stamp = p.skip(p.a('[')) + decimal + p.skip(p.a(']')) >> (lambda x:('time', x))
# p.skip(p.a()) is annoying
char = lambda c: p.skip(p.a(c))
stamp = char('[') + decimal + char(']') >> (lambda x:('time', x))
stamp.parse('[4.5]') == ('time', 4.5)

relstamp = char('[') + char('+') + decimal + char(']') >> (lambda x:('reltime', x))
# char() + char() is annoying
literal = lambda s: reduce(lambda x,y: x+y, map(char, s))
relstamp = literal('[+') + decimal + char(']') >> (lambda x:('reltime', x))
relstamp.parse('[+0.5]') == ('reltime', 0.5)

# those lambdas are getting annoying
tag = lambda t: lambda x:(t, x)

delay = literal('[++') + decimal + char(']') >> tag('delay')
color = char('[') + integer + char(']') >> tag('color')
jump = char('[') + integer + char(',') + number + char(']') >> tag('jump')
newline = literal('[n]') >> tag('newline')
clear = literal('[c]') >> tag('clear')
rgb = ...  # very similar to jump, but longer
clear.parse('[c]') == ('clear', _Ignored(()))

literal is a bit confusing. Note x+y is chaining char parsers, not adding numbers.

Gotcha Number Two

Yes, all those parentheses around the lambdas are necessary. The order of operations with >> is a little odd. But refactoring the lambdas out removes the need for the parens.

The larger structure of the parser's output starts to appear. The raw text is converted into a series of tagged tuples, a crude abstract syntax tree. For the most part, the AST is just tuples, numbers and strings. With clear and newline a small bit of funcparserlib leaks out, but this AST will never be serialized so it is simpler to ignore the _Ignored(()).

Putting it all together

letter = p.some(lambda x: True) >> tag('text')
anything = stamp | relstamp | color | jump | rgb | newline | clear | letter
everything = p.many(anything) + p.skip(p.finished)
return everything.parse(text_to_be_parsed)

everything.parse('[1.0]word[+0.5]') = 
    [('time', 1.0),
     ('text', 'w'),
     ('text', 'o'),
     ('text', 'r'),
     ('text', 'd'),
     ('reltime', 0.5)]

Usually you'll want to glue individual letters together into words just like digits were assembled to numbers. In this application, every character was an individual animation event anyway. Assembling words only to later break them up again was silly.

letter uses some(True) to match any character. Since it is a catchall, it must come last in anything.

Conclusion

Parsers can be a bit of a pain to set up. Five long lines just to duplicate the functionality of float()! After a slow start, the capability of the parser grows exponentially with each line. All while individual lines remain short and simple. But this is true of all parsers.

The three big take home lessons specific to Funcparserlib are

  • The REPL will ignore your tweaks.
  • >> will eat lambda, use extra parentheses.
  • Use >> to clean up your abstract syntax tree.

Now, go read the Brackets Tutorial and learn about recursive parsers.