Generators and iterators form the backbone of how Python deals with infinite lists of data. Computer Science folk might be a little bored now, because this is the only data structure that matters to non-theoretical branches of science, physics and engineering. Why an infinite list? All these fields collect data through experiments. The data takes the form of a list, one measurement after another. This list is extended until the experiment is shut down. Sometimes the experiment is never shut down and results must be calculated while new data is still streaming in. From the smallest digital signal processor in an MP3 player to the largest force sensor embedded in a bridge's roadbed, infinite lists are the norm.
Generators and iterators are Python's means of handling infinite sets of data. As of Python 3000, they are the default for every operation.
An excellent feature of Python's generators is constant memory usage. Short generators and infinite generators are both the same size, a consequence of not storing any values the generator produces. Once a value is produced, it is thrown away. One can not look into a generator's past or future. This article is about using the limitations of generators to your advantage.
Let's start practical. You've got a list of data points, and you need to find the two-point moving average. Groundwork, first loop through the data and echo out a copy.
>>> data_list = range(10) >>> for i in range(len(data_list)): ... print (data_list[i] - data_list[i-1]) * 0.5, -4.5 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5
If you want to be clever, you might try list slicing. This works by making two copies of the original list. One copy is missing the first element, the other copy lacks the last element.
zip() lines up the two lists and marches through them simultaneously.
>>> data_list = range(10) >>> old = data_list[1:] >>> now = data_list[:-1] >>> for n,o in zip(now, old): ... print (n-o) * 0.5, 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5
Amusing bugs caused by negative list indices aside, neither version is acceptable. Both require having the entire list in memory. They will not work on streams of data, produced in real time. It will also break if you have several gigs of historical data to crunch. A better method involves caching:
>>> data_list = xrange(10) >>> previous = 0 >>> for datum in data_list: ... print (datum - previous) * 0.5, ... previous = datum 0.0 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5
This is ugly, but it gets the job done.
xrange() works as expected. Notice how I've been printing data instead of building the results into a new list? We are going to make this example into a generator by simply replacing the print with a yield.
def moving_average(data_list): previous = 0 for datum in data_list: yield (datum - previous) * 0.5 previous = datum >>> list(moving_average(xrange(10))) [0.0, 0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5]
This is not the typical use case, however. Wrapping the generator in
list() removes any benefits. A more practical case would apply the moving average to cut through sensor noise.
for average in moving_average(noisy_sensor): if average > safe_value: alert("Sensor reads dangerous values.")
There are a few problems. The name
data_list is now misleading. Lists, tuples, generators, iterators, any can be taken as input. But naming conventions are the least of our problems. This code is brittle. What do you do when you need to support +3 point moving averages?
If you've heard of the
itertools module, feel free to skip this section. If you do not already love
itertools, read this.
Itertools provides high level operations for infinite lists. There are also a bunch of awesome mathy parts, such as
itertools.permutations(). We'll be using
count() is pretty simple. It acts like
xrange(0, infinity). You can specify any starting point, but it will always keep on counting up.
islice() is a bit more tricky. It has similar semantics to
slice() is the underlying function to Python's
list[start:stop:interval] notation. All these snippets do the same thing, produce even numbers between 4 and 16.
range(20)[4:16:2] slice(range(20), 4, 16, 2) from itertools import islice, count islice(count(), 4, 16, 2)
What makes islice really different is that the ending bound is optional, by setting it to
None. To make all even numbers greater than four, just use
islice(count(), 4, None, 2).
tee() is more complicated. It spawns iterators which are linked back to the original iterator. Basically, it lets you loop over the same iterator multiple times, at the same time. For example, let's say you need the moving average and deviation from the
noisy_sensor. This is wrong:
from itertools import izip averages = moving_average(noisy_sensor) deviations = moving_deviation(noisy_sensor) for ave, dev in izip(averages, deviations): if ave > safe_value: alert("Sensor reads dangerous values.") if dev < expected_deviation: alert("Sensor may be disconnected.")
The problem here is that
moving_deviation() are both consuming values from the generator
noisy_sensor. Each will take half the values, and none will be shared.
tee() fixes this.
from itertools import tee copy1, copy2 = tee(noisy_sensor, 2) averages = moving_average(copy1) deviations = moving_deviation(copy2) ...
Okay, so now that the basics are out of the way, here is a code snippet I use frequently. It provides a moving window over a stream of data.
def moving(iterator, length, step=1): ms = tee(iterator, length) return izip(*[islice(m,i,None,step) for m,i in zip(ms, count())]) >>> for i in moving(xrange(10), 3): ... print i (0, 1, 2) (1, 2, 3) (2, 3, 4) (3, 4, 5) (4, 5, 6) (5, 6, 7) (6, 7, 8) (7, 8, 9) # step makes it very flexible >>> for i in moving(xrange(9), 3, 3): ... print i (0, 1, 2) (3, 4, 5) (6, 7, 8)
From there is is easy (although inefficient) to apply the average or deviation to each chunk. It works identically to the example awhile back; where the list was copied, sliced and zipped together but instead with iterator-friendly commands.
Regarding efficiency, simple operations such as moving averages can be improved tremendously using double-ended queues, or deques. Oddly enough, it is pronounced "deck". Deques are another friend of the engineer, as they make a simple FIFO buffer. Lists can be used as a crude FIFO, but they do not scale well. Adding new elements to a list with
.append() is O(1). But using
.pop(0) to remove the oldest element is O(n). For deques, the equivalent of both are O(1). Here's how to computer a moving average using a deque.
from collections import deque def moving_average(iterator, length): d = deque(islice(iterator, 0, length)) s = sum(d) yield s / length for i in iterator: s -= d.popleft() s += i d.append(i) yield s / length
This is O(n) to start up, and O(1) for all subsequent averages. Feel free to get in touch with me if you need something more exotic.