Word count (Python, functional)

From LiteratePrograms
Jump to: navigation, search
Other implementations: Assembly Intel x86 Linux | C | C++ | Haskell | J | Lua | OCaml | Perl | Python, functional | Rexx

This is not an implementation of the UNIX wc tool, in Python, for which see Word count (Python).

Instead, it is a response to a paper of Gibbons, as discussed on Lambda the Ultimate, which makes the claim that three different wordcount programs might all have arisen from the same high-level design, namely the composition length \circ words. This being a "Word count" program, we favor renaming the composition to count \circ words.

<<wc>>=
wc=lambda s: count(words(s))

[edit] theory

The solutions given by Gibbons follow the normal Algol style (as represented here by the C, Haskell, Perl, and Lua examples) of iterating over an input stream while incrementing a running count. Even the Python example, although it treats whole lines as units, has a for lurking in the comprehension defining the word count for the entire input.

In this case, we make yet another design decision: by placing an ordering on character classes (nonblank < blank), we can replace Algol-style folds with APL-style whole-array operations.

[edit] implementation

Within a word, the classification increases monotonically, so the crux of the program is to flag all the spots where the classification decreases — where a nonblank character follows a blank. We avoid iteration by comparing the entire boolean array with a shifted version of itself: Wcpythonfunctional.png

<<defns>>=
# Gibbons' unlengths are nondecreasing runs, eg 00001111
drops=lambda bs: bs[1:] < bs[:-1]

Blanks are easily found — the expression blank = space \or tab \or linefeed translates straightforwardly:

<<defns>>=
blank=lambda cs: (cs==' ')+(cs=='\t')+(cs=='\n')

Now we have a straight-line definition for words: drops \circ blank:

(Question: how do the spaces prepended to the argument help avoid conditional branching? at which arguments do they make a difference?)

<<defns>>=
chars=lambda cs: array(tuple(cs))
words=lambda cs: drops(blank(chars('  '+cs)))

[edit] wrapping up

We still need a definition for count, but numpy.sum provides a suitable implementation:

<<defns>>=
count=sum

Finally, we include the boilerplate:

<<wcfa.py>>=
from numpy import array, sum
defns
wc
if __name__=="__main__":
        import sys
        print wc(sys.stdin.read())
Download code
hijacker
hijacker
hijacker
hijacker