I really like funcparserlib. By far my favorite parser and have used it in numerous projects over the years. But it took me forever to "get". The author wrote two tutorials, the Official Tutorial and the Bracket Tutorial. Both of these are solid, but here is my take on the library.
The official tutorial spends a lot of time using Python's built in lexer. Good for speed, but a needless complication. It should be funcparserlib all the way down! Let's start with the basics:
This library is tiny. Can not stress this enough. There are only thirteen methods. Typically a tutorial would start describing simple examples now. But with a small number of very powerful words, a mile high overview is more helpful.
These are not commands. But I'll be saying them a lot. These parsers are stream based. Big parsers are made from small parsers. They examine a stream of characters, and each character gets compared against a filtering/parsing function. If there are no errors and the parser manages to bite off enough characters to make it happy, then a match has been found. If something unexpected comes up, the parser aborts the match. The almost-matched characters are returned, and the next parser tries to match them. (See LL for more depth.) The exciting thing is that these parsers have a hierarchy, and can recurse into each other. And just in case your mind is not blown enough, all of these parsers take smaller parsers as input and return bigger parsers as output, to create your full blown parser.
Similar to
itertools.takewhile
, it takes a single parser and tries to match with it for as long as possible.
Used as a placeholder for making a self-referential parser.
This is a decorator. Used to create mutually recursive parsers.
Now for the last three. They are very important, more so than any previous method. Maybe these are why funcparserlib is so much fun. Or why it produces such tight code, free of parentheses and dotted calls. Some common operators are redefined for parsers:
Like its logical OR brother, allows matching either of two parsers.
On to the code: a real parser for a miniature domain specific language. The case study behind this writeup is the project I was working on when everything clicked: the closing credits animation for BAPHL 2. In a nutshell, we made our own knockoff of Portal's credits. I got to do all the programming and graphics, while Nathan Curtis did musical magic to create the catchy tune.
The tricky part was flexibility. It had to be easy to sync the animation to the words. It had to be easy to swap fresh ascii art in. The credits were going to change that morning depending on who showed up. And it had to be extendible, as feature creep was almost a given. Funcparserlib made this flexibility easy to obtain.
The resulting bit of python could read a couple of text files (one per "window") and spit out several thousand PNG frames, which were glued together with the soundtrack in mencoder
. Fun fact, ffmpeg
(my usual AV swiss army knife of choice) breaks if you try to make a video from ≈1000 still frames.
Those text files would contain a bunch of plain text with tags interspersed. Anything that was not a tag (ie, normal text) would be rendered to the screen. All text was preceded and followed by timestamp tags, and the letters would be interpolated between these times.
[4.5] floats are absolute timestamps
[+0.5] relative timestamp
[++10.0] delay (insert dead air w/o changing all absolute timestamps)
[2] single ints are predefined palette colors
[6,7] double ints are x,y cursor position
[1,2,3] triple ints are rgb colors
[n] newline (newlines in the file are ignored)
[c] clear the screen
As far as grammars go, this is about as simple as it gets. There is no nesting or branching. The square bracket characters can not even be escaped! Time to build a parser.
Be very careful when playing with funcparserlib in the Python interpreter. Because of how it uses closures, changes to functions don't propagate through.
Parser_b will still be using the old definition of parser_a! This screwed me up for so long. My advice for dealing with this is to keep your parser rules in a separate file. Import the file to the REPL, play with the parsers, edit the file, reload()
the import to play with the changes.
digits = p.oneplus(p.some(lambda c: c.isdigit()))
digits.parse('123') == ['1', '2', '3']
to_int = lambda ds: sum(int(d)*10**m for m,d in enumerate(reversed(ds)))
to_int(['1', '2', '3']) == 123
integer = digits >> to_int
integer.parse('123') == 123
to_dec = lambda ds: sum(int(d)*10**(-m-1) for m,d in enumerate(ds))
to_dec(['1', '2', '3']) == 0.12300000000000001 # close enough
decimal = integer + p.skip(p.a('.')) + (digits >> to_dec) >> sum
decimal.parse('123.456') == 123.456
numbers = p.many((decimal | integer) + p.skip(p.maybe(p.a(' '))))
numbers.parse('1 2 3 4.5 6 7.5') == [1, 2, 3, 4.5, 6, 7.5]
There will be a heavy use of lambda functions when building simple parsers. The .parse()
method is the primary interface to the parsers themselves. Normally it will be called once on the largest and most complete parser.
digits
chomps into the stream character by character, collecting that returns .isdigit()==True
.
to_int
is not a parser, but a simple function which converts a list of number-characters into a whole integer.
integer
combines these two, creating a (marginally) useful parser.
to_dec
is based heavily on to_int
, but designed for numbers on the right side of the decimal point.
decimal
takes a leading integer, decimal point and trailing decimal term. These are summed together to make a float.
numbers
is almost a useful parser, capable of dealing with larger structures. But not that useful. Parsers should be able to at least handle malformed data.
numbers.parse('1 2.5 3 Q 4.5') == [1, 2.5, 3] # something is missing
numbers2 = numbers + p.skip(p.finished)
numbers2.parse('1 2.5 3 Q 4.5') == NoParseError: should have reached <EOF>: Q
By default, bad matches just kind of fall off the end. The parser gets derailed and returns partial results. Adding finished
to the end forces it to acknowledge the error. Hitting the 'Q' breaks numbers
out of its many
loop. The next parser in the chain is looking for an end-of-file marker, and instead it sees the 'Q'. Not the best error report though. Anyone who has used a compiler in the past 20 years generally expects to get a line and column of where the parse error occurred. Funcparserlib won't do this on its own, and it is probably why the original author is so fond of using Python's token
module as the first stage of parsing.
token
will report more useful (but cryptic) information about where the error occurred. Worry about this later.
stamp = p.skip(p.a('[')) + decimal + p.skip(p.a(']')) >> (lambda x:('time', x))
# p.skip(p.a()) is annoying
char = lambda c: p.skip(p.a(c))
stamp = char('[') + decimal + char(']') >> (lambda x:('time', x))
stamp.parse('[4.5]') == ('time', 4.5)
relstamp = char('[') + char('+') + decimal + char(']') >> (lambda x:('reltime', x))
# char() + char() is annoying
literal = lambda s: reduce(lambda x,y: x+y, map(char, s))
relstamp = literal('[+') + decimal + char(']') >> (lambda x:('reltime', x))
relstamp.parse('[+0.5]') == ('reltime', 0.5)
# those lambdas are getting annoying
tag = lambda t: lambda x:(t, x)
delay = literal('[++') + decimal + char(']') >> tag('delay')
color = char('[') + integer + char(']') >> tag('color')
jump = char('[') + integer + char(',') + number + char(']') >> tag('jump')
newline = literal('[n]') >> tag('newline')
clear = literal('[c]') >> tag('clear')
rgb = ... # very similar to jump, but longer
clear.parse('[c]') == ('clear', _Ignored(()))
Yes, all those parentheses around the lambdas are necessary. The order of operations with >>
is a little odd. But refactoring the lambdas out removes the need for the parens.
The larger structure of the parser's output starts to appear. The raw text is converted into a series of tagged tuples, a crude abstract syntax tree. For the most part, the AST is just tuples, numbers and strings. With clear
and newline
a small bit of funcparserlib leaks out, but this AST will never be serialized so it is simpler to ignore the _Ignored(())
.
letter = p.some(lambda x: True) >> tag('text')
anything = stamp | relstamp | color | jump | rgb | newline | clear | letter
everything = p.many(anything) + p.skip(p.finished)
return everything.parse(text_to_be_parsed)
everything.parse('[1.0]word[+0.5]') =
[('time', 1.0),
('text', 'w'),
('text', 'o'),
('text', 'r'),
('text', 'd'),
('reltime', 0.5)]
Usually you'll want to glue individual letters together into words just like digits were assembled to numbers. In this application, every character was an individual animation event anyway. Assembling words only to later break them up again was silly.
letter
uses some(True)
to match any character. Since it is a catchall, it must come last in anything
.
Parsers can be a bit of a pain to set up. Five long lines just to duplicate the functionality of float()
! After a slow start, the capability of the parser grows exponentially with each line. All while individual lines remain short and simple. But this is true of all parsers.
- The REPL will ignore your tweaks.
-
>>
will eatlambda
, use extra parentheses. -
Use
>>
to clean up your abstract syntax tree.
Now, go read the Brackets Tutorial and learn about recursive parsers.
Use this sparingly, and with care when you do. It will always successfully match anything, and so prevents the parser from backtracking if something fails down the line.