UUencode (C)

From LiteratePrograms
Jump to: navigation, search


[edit] Overview

This is an implementation of UUencode, the encoding used for moving binary files through text-only channels, such as email or netnews. It originated on Usenet, and is named "UUencode" after the "'uucp" (unix to unix copy) program used on Unix systems and underlying the early Usenet implementations.

UUencoding was once the lingua franca of the Net, but it has its weaknesses and has largely been replaced by other forms, most commonly Base64 encoding. The largest problem with uuencoded data is that the set of "printable" characters it uses includes some which do not necessarily pass well through character set translations or which are otherwise prone to "mangling", including space, double-quote ("), dollar sign ($), percent sign(%), ampersand (&), apostrophe ('), opening bracket ([), closing bracket(]), circumflex accent (^) and grave accent (`).

[edit] Encoding

The goal of uuencode encoding is to turn an 8-bit data stream into a sequence of standard ASCII characters which will not be altered in an undesirable way by newsreaders. It accomplishes this by breaking each sequence of three 8-bit bytes in the input into four 6-bit fields. The space character (ASCII 32) is then added to each field to produce the ASCII value of the final character:

Original characters C a t
Original ASCII, decimal 67 97 116
ASCII, binary 0 1 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0
New decimal values 16 54 5 52
+32 48 86 37 84
Uuencoded characters 0 V % T

Assuming the input characters are stored in an array b and the output in an array d, the corresponding code in C is:

<<encode variable definitions>>=
unsigned char b[3];
char *d;
int i;

<<extract uuencoded characters>>=
d[0] = (b[0] >> 2) & 0x3f;
d[1] = ((b[0] << 4) | ((b[1] >> 4) & 0x0f)) & 0x3f;
d[2] = ((b[1] << 2) | ((b[2] >> 6) & 0x03)) & 0x3f;
d[3] = b[2] & 0x3f;

for (i=0; i<4; i++) {
    d[i] += ' ';
    other character processing

Since newsreaders sometimes strip whitespace characters such as spaces, we prefer to encode spaces (' ', ASCII 32) as grave accents ('`', ASCII 96), which has the same last 6 bits and so will decode to the same value:

<<other character processing>>=
if (d[i] == ' ')
    d[i] = '`';

By convention, each line encodes no more than 45 input bytes (using 60 characters), with only the last line allowed to be shorter. Each line begins with a length character, the value of which is the number of input bytes encoded by that line plus the space character (ASCII 32), hence all but the last line will begin with "M" (ASCII 77), because 45 + 32 = 77.


Since the line-length appears at the beginning of the line, and as we can't predict when the end of input will occur, we need to use a buffer to save up each output line prior to writing it:

<<encode variable definitions>>=
size_t num_read;
size_t num_bytes_in_line;
char outbuf[1 + MAX_OUTPUT_CHARS_PER_LINE + 1];
int outbufpos;

Since the C library's fread() routine is typically implemented with a buffer behind it, we take the liberty of reading the input file in chunks of only 3 bytes at a time, encode them to 4 characters, and accumulating up to 60 characters into a single output line:

<<output lines>>=
/* Use maximum length character for all lines except the last line */
outbufpos = 1;
num_bytes_in_line = 0;
while((num_read = fread(b, sizeof(unsigned char), 3, stdin)) > 0) {
    num_bytes_in_line += num_read;
    for(i=num_read; i<3; i++)
        b[i] = '\0';

    d = &outbuf[outbufpos];
    extract uuencoded characters
    outbufpos += 4;
    outbuf[outbufpos] = '\0';

    if(outbufpos == 1 + MAX_OUTPUT_CHARS_PER_LINE) {
        /* Set length character for a full line and write it out. */
        outbuf[0] = ((char)num_bytes_in_line) + ' ';    /* == MAX_INPUT_BYTES_PER_LINE + ' ' == 45 + 32 == 77 == 'M' */
        /* Start the next output line. */
        outbufpos = 1;
        num_bytes_in_line = 0;
/* Set length character for final line and write it out. */
outbuf[0] = ((char)num_bytes_in_line) + ' ';

Although it is often seen as a data encoding algorithm, UUencode was originally intended for encoding entire files, therefore the format includes both start-of-file and end-of-file sequences to make it simple to detect partial files. And because it began life on Unix systems, the start-of-file sequence includes both the file name and the Unix permissions of the file (expressed in the traditional numeric form), allowing it to be restored to disk intact. This is especially important for executable programs, which depend on specific bits in the permissions to indicate that they can be run.

<<encode variable definitions>>=
int mode = 0;
char *name;

printf("begin %o %s\n", mode, name);
output lines

Wrapping it all up, we have a traditional C main() routine that extract the permissions and filename from its arguments and encodes the input data from standard input. If not specified, the permissions default to 6448 (i.e., owner read and write, group read, and world read) and filename defaults to "uufile".

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>

#define EXIT_SUCCESS 0


int main(int argc, char *argv[])
    encode variable definitions

    if((++argv)[0]) {
            mode=strtol((argv++)[0], NULL, 8);
    } else mode=0644;

    return EXIT_SUCCESS;

[edit] Decoding

The goal of UUencode decoding (sometimes called "uudecoding") is to recover an 8-bit data stream from a sequence of standard ASCII characters produced by a uuencode encoding operation. It accomplishes this by breaking the input up into groups of 4 characters, subtracting the space character (ASCII 32) from each and extracting the low-order 6 bits. The resulting 24 bits are then concatenated together and broken back apart as 3 8-bit bytes. This set of operations is repeated for each block of 4 characters until the entire input is consumed.

Uuencoded characters 0 V % T
Decimal values 48 86 37 84
-32 16 54 5 52
Binary 0 1 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0
Original ASCII, decimal 67 97 116
Original characters C a t

Assuming the input bytes are stored in an array pointed to by pointer p and the output characters are to be written into an array bytes, the corresponding code in C is as follows. This code takes advantage of the fact that both space (ASCII 32) and grave accent (ASCII 96) take the binary form xx000000 and are therefore interchangeable. Programs that create uuencoded files (incluing the implementation above) often replace spaces with grave accents because many text-based communication channels do not preserve spaces well.

<<decode variable definitions>>=
    char bytes[3];
    char *p;

<<extract uudecoded bytes>>=
    bytes[0]=((p[0]-' ')<<2) + (((p[1]-' ')&0x30)>>4);
    bytes[1]=(((p[1]-' ')&0x0f)<<4) + (((p[2]-' ')&0x3c)>>2);
    bytes[2]=(((p[2]-' ')&0x03)<<6) + ((p[3]-' ')&0x3f);

Each line begins with single character that encodes the 6-bit length of the decoded data. The length is coded by adding a space character (ASCII 32) to it, therefore the maximum decoded data is 64 bytes, however by convention the limit is 45. The encoded data immediately follows the length character, and any remaining characters on the line are ignored. Some encoding programs take advantage of this and include a trailing grave accent (ASCII 96) character to make the right-hand edge of the lines visually obvious to a human reader.

<<decode variable definitions>>=
    char line[128];
    size_t len;
    FILE *fp=stdout;
    int i;
    size_t wlen;

    len=(line[0]-' ')&0x3f;

    for(i=0; i<len; i+=3) {
        extract uudecoded bytes
        if(len-i<3) wlen=len-i;
        else wlen=3;
        if(fwrite(bytes, sizeof(char), wlen, fp)!=wlen) {
            fprintf(stderr, "ERROR: Write error\n");
            return EXIT_FAILURE;

The data lines are bracketed by start-of-file and end-of-file sequences, so we read and decode the "begin" line, then process all the lines we get until we read the "end" line:

    if(!fgets(line, 128, stdin)) {
        fprintf(stderr, "ERROR: Reading input stream\n");
        return EXIT_FAILURE;
    while((fgets(line, 128, stdin))) {
        if(!strncmp(line, "end", 3)) break;

The start-of-file marker includes the Unix permissions of the file (expressed in the traditional numeric form) and the file name, so we need to parse them out. In a more complete implementation (e.g., one that writes to disk instead of to standard output), we would use them to set the characteristics of the output file. But for now, we'll just parse and ignore them.

<<decode variable definitions>>=
    int mode;
    char *token;


    if(!(token=strtok(line, " \t\r\n")) || strcmp(token, "begin")) {
        fprintf(stderr, "ERROR: Bad format - missing \"begin\"\n");
        return EXIT_FAILURE;

    if(!(token=strtok(NULL, " \t\r\n")) || !isdigit(token[0])) {
        fprintf(stderr, "ERROR: Bad format - missing mode specifier\n");
        return EXIT_FAILURE;
    mode=strtol(token, NULL, 8);

    if((token=strtok(NULL, " \t\r\n"))) {
        if(!(fp=fopen(token, "wb"))) {
            fprintf(stderr, "ERROR: Cannot open file %s\n", token);
            return EXIT_FAILURE;

Wrapping it all up, we have a traditional C main() routine that reads the encoded character stream from standard input and writes the decoded bytes to standard output:


int main(int argc, char *argv[])
    decode variable definitions

    if(fp!=stdout) fclose(fp);

    return EXIT_SUCCESS;

[edit] Building

The following make control file will build both the uuencode and uudecode programs by default, or can build either one independently on request.

all: uuencode uudecode

uuencode: uuencode.o
	cc -o uuencode -Wall -pedantic -ansi uuencode.o

uuencode.o: uuencode.c
	cc -o uuencode.o -Wall -pedantic -ansi -c uuencode.c

uudecode: uudecode.o
	cc -o uudecode -Wall -pedantic -ansi uudecode.o

uudecode.o: uudecode.c
	cc -o uudecode.o -Wall -pedantic -ansi -c uudecode.c
Download code