Hash function comparison (C, sh)

From LiteratePrograms
Jump to: navigation, search

This program is under development.
Please help to debug it. When debugging
is complete, remove the {{develop}} tag.


In this article, we compare how some simple hash functions perform. More specifically, we compare the number of collisions for different hash table sizes and the speed.

Contents

[edit] The program

This program will read a list of words, one for each line, from stdin and print the corresponding hash values. The hash function and table size are specified as command line arguments.

<<hashcmp.c>>=

#include<stdio.h>
#include<string.h>
#include<stdlib.h>

typedef unsigned long hash_t;

hashfuncs

struct hashfunc_s {
	char *name;
	hash_t (*func)(const unsigned char *key);
} funcs[]={
	{"add", add_hash},
	{"xor", xor_hash},
	{"sax", sax_hash},
	{"sdbm", sdbm_hash},
	{"bernstein", bernstein_hash}
};

int main(int argc, char *argv[])
{
	char line[256];
	char *p;
	hash_t (*hash)(const unsigned char *key)=NULL;
	hash_t hsize;
	int i;

	if(argc>1) {
		for(i=0; i<sizeof funcs/sizeof(struct hashfunc_s); ++i) {
			if(!strcmp(argv[1], funcs[i].name)) {
				hash=funcs[i].func;
				break;
			}
		}
		if(!funcs) {
			fprintf(stderr, "ERROR: Could not find the %s hash function\n", argv[1]);
			exit(EXIT_FAILURE);
		}
		
	} else hash=funcs[0].func;

	if(argc>2) hsize=atol(argv[2]);
	else hsize=32768;

	while(fgets(line, 256, stdin)) {
		if((p=strchr(line, '\r'))) *p='\0';
		if((p=strchr(line, '\n'))) *p='\0';

		printf("%lu\n", hash((unsigned char *)line)%hsize);
	}
	
	return EXIT_SUCCESS;
}
<<Makefile>>=
all: hashcmp

hashcmp: hashcmp.c
	cc -o hashcmp -Wall -ansi -pedantic hashcmp.c

[edit] Functions

The hash functions take one argument, a '\0'-terminated string, and return the calculated hash value. This value is later redused to the hash table size before it is printed.

[edit] Additive

This function simply sums up the values of each character in the string.

<<add_hash>>=
hash_t add_hash(const unsigned char *key)
{
	hash_t h=0;
	while(*key) h+=*key++;
	return h;
}

[edit] XOR

This function applies bitwise exclusive-or on all the characters. Note that the generated hash value will never be larger than UCHAR_MAX (normally 255) with this method. If key only consists of ASCII values, it will be maximum 127. So this function is a bad choice for large hash tables.

<<xor_hash>>=
hash_t xor_hash(const unsigned char *key)
{
	hash_t h=0;
	while(*key) h^=*key++;
	return h;
}

[edit] Shift-add-XOR

This function combines the first two functions (add and XOR) with a shift operation.

<<sax_hash>>=
hash_t sax_hash(const unsigned char *key)
{
	hash_t h=0;
	while(*key) h^=(h<<5) + (h>>2) + *key++;
	return h;
}

[edit] SDBM

This is an implementation of the [sdbm hash function].

<<sdbm_hash>>=
hash_t sdbm_hash(const unsigned char *key)
{
	hash_t h=0;
	while(*key) h=*key++ + (h<<6) + (h<<16) - h;
	return h;
}

[edit] Bernstein

This is an efficient hash function created by Daniel J. Bernstein.

<<bernstein_hash>>=
hash_t bernstein_hash(const unsigned char *key)
{
	hash_t h=0;
	while(*key) h=33*h + *key++;
	return h;
}


<<hashfuncs>>=
add_hash
xor_hash
sax_hash
sdbm_hash
bernstein_hash

[edit] Collisions

To check for collisions, we use the following shell script. The file words is created by running shuffle -f /usr/share/dict/words >words on a NetBSD system. The generated file can be downloaded here.

If the parameter max is specified, the output is the size of the "biggest" collision (most keys giving the same hash value). With the avg parameter, the average collission size is calculated. Without a parameter, the output is the total number of keys involved in collsions.

<<collcmp.sh>>=
#!/bin/sh


case "$1" in
	"max")
		AWKCMD='{if($1>max) max=$1} END {print max?max:0}'
		FMT=" %7s"
	;;
	"avg")
		AWKCMD='{sum+=$1} END {print sum/NR}'
		FMT=" % 7.1f"
	;;
	*)
		AWKCMD='{if($1>1) ncoll+=$1} END {print ncoll?ncoll:0}'
		FMT=" %7s"
esac
	
echo "      func:size  10      50     100     500    1000    5000   10000   50000  100000  200000"
for hashfunc in add xor sax sdbm bernstein ; do
	printf "% 10s:" "$hashfunc"
	for size in 10 50 100 500 1000 5000 10000 50000 100000 200000; do
		head -n $size <words |
			./hashcmp $hashfunc $size |
			sort -n | 
			uniq -c | 
			awk "$AWKCMD" |
			xargs printf "$FMT"
	done
	echo
done

[edit] Output

./collcmp.sh
      func:size  10      50     100     500    1000    5000   10000   50000  100000  200000
       add:       7      27      64     317     700    4639    9719   49799   99823  199810
       xor:       7      34      84     478     980    5000   10000   50000  100000  200000
       sax:       4      28      71     311     623    3175    6321   31651   63180  126568
      sdbm:       4      30      56     317     627    3218    6239   31537   63242  126320
 bernstein:       6      31      67     309     608    3164    6312   31835   63093  126034
./collcmp.sh avg
      func:size  10      50     100     500    1000    5000   10000   50000  100000  200000
       add:     2.0     1.5     1.6     1.6     1.8     4.2     7.1    27.8    51.5    95.6
       xor:     1.7     1.8     2.4     5.3     8.8    39.1    78.1   390.6   781.2  1562.5
       sax:     1.2     1.5     1.8     1.6     1.6     1.6     1.6     1.6     1.6     1.6
      sdbm:     1.2     1.5     1.5     1.6     1.6     1.6     1.6     1.6     1.6     1.6
 bernstein:     1.4     1.6     1.6     1.6     1.6     1.6     1.6     1.6     1.6     1.6
./collcmp.sh max
      func:size  10      50     100     500    1000    5000   10000   50000  100000  200000
       add:       4       4       5       6       8      22      34     148     277     543
       xor:       3       4       7      16      24      89     167     768    1488    2915
       sax:       2       3       4       5       5       7       8       7       8       8
      sdbm:       2       4       3       4       6       6       6       8       7       7
 bernstein:       2       4       4       6       6       6       6       6       8      10

[edit] Speed

The following script compares the time usage for the different hash functions.

<<timecmp.sh>>=
#!/bin/sh

cat words words words words words words words words words words >words10
for hashfunc in add xor sax sdbm bernstein ; do
	printf "% 10s: " "$hashfunc"
	time ./hashcmp $hashfunc <words10 >/dev/null
done


[edit] Output

       add:         5.04 real         4.88 user         0.05 sys
       xor:         4.06 real         3.93 user         0.05 sys
       sax:         6.16 real         5.90 user         0.08 sys
      sdbm:         6.08 real         5.87 user         0.09 sys
 bernstein:         6.14 real         5.92 user         0.09 sys

[edit] External links

Download code
hijacker
hijacker
hijacker
hijacker