Lossless compression (Python)

From LiteratePrograms
Jump to: navigation, search

Lossless (de)compression.


[edit] theory

We implement LZ78, an example of lossless compression via a dictionary coder. Modulo the instantiation of the dictionary, accomplished here via side effect, both encoding and decoding are anamorphic reductions.

[edit] implementation

In the compressed domain, the data is represented as a sequence of tokens, consisting of a word already in the dictionary followed by a novel character.

[edit] decoding

The decoder is straightforward: for each token, it produces the concatenation of the dictionary entry for the word and the subsequent character.

[dodict(d,d[w]+c) for (w,c) in ts]

En passant, it enters the token produced into the dictionary.

dodict = lambda d,v: d.__setitem__(len(d),v) or v

We need only initialize the dictionary and arrange to join the resulting list of decoded substrings.

def decode(ts):
	d,j = {0:''}, ''.join
	return j(decompress)

Question: in what monad are we computing?

[edit] encoding

The encoder is more complex. For each character, it checks the token pairing it with the word coding for the characters already seen. If this token is already in the dictionary, the information would be redundant for the decoder, so the token is only produced when it is novel.

[tok for c in cs for tok in [(w,c)]
     for w in [dodict(d,tok)] if not w] + [(w,'')]

Question: Why do we include the final (w,") term? Exercise: modify the entire expression so last term of the final output uses the final character of the input instead of potentially being (0,")

In checking a token, we either return its codeword, or reset the codeword to 0, and, en passant, enter the token into the dictionary.

dodict = lambda d,k: d.get(k) or d.__setitem__(k,len(d)) or 0

Of course, we must initialize the dictionary and arrange to start in the reset state.

def encode(cs):
	d,w = {0:''}, 0
	return compress

[edit] wrapping up

Finally, we include the boilerplate for a simple test case:

if __name__ == '__main__':
	test = "wow! how now brown plow cow"
	print encode(test)
	print decode(encode(test))

The output should be similar to:

[(0, 'w'), (0, 'o'), (1, '!'), (0, ' '), (0, 'h'), (2, 'w'), (4, 'n'), (6, ' '), (0, 'b'), (0, 'r'), (6, 'n'), (4, 'p'), (0, 'l'), (8, 'c'), (6, '')]
wow! how now brown plow cow
Download code