# Factorials with prime factorization (C)

Other implementations: C | Python

The nth factorial, n!, can be computed straightforwardly by multiplying together the range of integers 1, 2, ... n from the bottom up. However, if n is large, this becomes inefficient. The k:th subproduct has roughly k log k digits, so multiplying by k+1 (assumed to fit in a single machine word) takes k log k time. This leads to a total time of O(n2 log n) for calculating n!.

Many algorithms exist for splitting a product of many factors into smaller pieces to balance the size of each subproduct. However, there is a trick to factorials: we can find the prime factorization of n! quickly, much more quickly than we can compute n! itself. And we can do this without stepping through the list 1, 2, ..., n and factorizing each of these numbers. The crucial observation is that n! must necessarily contain each prime number up to n, and that the power of the prime p in n! is given by

$\sum_{i=1}^{\infty} \, \lfloor n/p^i \rfloor$.

Notice that this is actually a finite sum, since the floor is zero whenever pi > n. This implementation is based on the following references:

We begin by computing the sum above to find the exponent (power) of a prime number in the factorization of n!. The numbers are all small here, so ordinary machine words will work fine.

<<compute multiplicity of a prime in factorial n>>=
/* Return the power of the prime number p in the
factorization of n! */
int multiplicity(int n, int p) {
int q = n, m = 0;
if (p > n) return 0;
if (p > n/2) return 1;
while (q >= p) {
q /= p;
m += q;
}
return m;
}


We now need a way to obtain the prime numbers up to n, for which purpose we implement a light-weight sieve of Eratosthenes. Emphasizing efficiency and simplicity over space, we use a byte array instead of a bit array (keep in mind that the end result will be much larger than this byte array).

<<sieve of Eratosthenes>>=
unsigned char* prime_table(int n) {
int i, j;
unsigned char* sieve = (unsigned char*)calloc(n+1, sizeof(unsigned char));
sieve[0] = sieve[1] = 1;
for (i=2; i*i <= n; i++)
if (sieve[i] == 0)
for (j=i*i; j <= n; j+=i)
sieve[j] = 1;
return sieve;
}


Now it's time to take the prime factorization of n! and actually compute the value of n!. In C, standard 32-bit machine words can only hold factorials up to 12!, and 64-bit words only up to 20!, far below the range of numbers for which this algorithm is beneficial. For this reason, we will produce arbitrary-precision integers instead. Since C doesn't supply arbitrary-precision integers, we supply our own basic implementation from Arbitrary-precision integer arithmetic (C):

<<integer.h>>=
Arbitrary-precision_integer_arithmetic_(C)#6222#integer.h

<<integer.c>>=
Arbitrary-precision_integer_arithmetic_(C)#6222#integer.c


The straightforward approach would be to calculate pa for each prime p by multiplying p by itself a times. But if we do this we derive no benefit over the obvious method of computing factorials. Instead, we will use exponentiation by squaring, which uses only at most 2loga multiplies:

<<exponentiation function>>=
void exponent(unsigned int base, unsigned int power, integer result) {
int bit;
integer temp = create_integer(result.num_components*2);
set_zero_integer(result);
result.c[0] = 1;

bit=sizeof(power)*CHAR_BIT - 1;
while ((power & (1 << bit)) == 0) bit--;
for( ; bit>=0; bit--) {
multiply_integer(result, result, temp);
copy_integer(temp, result);
if ((power & (1 << bit)) != 0) {
multiply_small_integer(result, base, result);
}
}

free_integer(temp);
}


To avoid performing extra multiplications for small powers, we skip any leading zero bits.

We're now prepared to write an efficient factorial function. It retrieves the list of primes, then for each one computes that prime raised to the appropriate exponent, multiplying the result into a running product. To determine the space needed to hold the factorial up front, we simply use $\lceil{n\log_2 n}+1\rceil$ bits, which Stirling's approximation tells us is an upper bound for log2n!.

<<factorial function>>=
integer factorial(int n) {
unsigned char* primes = prime_table(n);
int p;
integer result =
create_integer((int)ceil((n*log((double)n)/log(2) + 1)/COMPONENT_BITS));
integer p_raised = create_integer(result.num_components);
integer temp     = create_integer(result.num_components*2);
set_zero_integer(result);
result.c[0] = 1;

for(p = 2; p <= n; p++) {
if (primes[p] == 1) continue;
exponent(p, multiplicity(n,p), p_raised);
multiply_integer(result, p_raised, temp);
copy_integer(temp, result);
}

free_integer(temp);
free_integer(p_raised);
free(primes);
return result;
}

#include <math.h>  /* for log() */


Finally, we write a simple command-line driver that takes a number and outputs its factorial:

<<factorial.c>>=
#include <stdlib.h>
#include <stdio.h>
#include "integer.h"

compute multiplicity of a prime in factorial n
sieve of Eratosthenes
exponentiation function
factorial function

int main(int argc, char* argv[]) {
int n = atoi(argv[1]);
puts(integer_to_string(factorial(n)));
return 0;
}


Here's some sample output (all values are independently verified as correct):

$./factorial 10 3628800$ ./factorial 50
30414093201713378043612608166064768844377641568960512000000000000
\$ ./factorial 500
12201368259911100687012387854230469262535743428031928421924135883\
85845373153881997605496447502203281863013616477148203584163378722\
07817720048078520515932928547790757193933060377296085908627042917\
45478824249127263443056701732707694610628023104526442188787894657\
54777149863494367781037644274033827365397471386477878495438489595\
53753799042324106127132698432774571554630997720278101456108118837\
37095310163563244329870295638966289116589747695720879269288712817\
80070265174507768410719624390394322536422605234945850129918571501\
24870696156814162535905669342381300885624924689156412677565448188\
65065938479517753608940057452389403357984763639449053130623237490\
66445048824665075946735862074637925184200459369692981022263971952\
59719094521782333175693458150855233282076282002340262690789834245\
17120062077146409794561161276291459512372299133401695523638509428\
85592018727433795173014586357570828355780158735432768888680120399\
88238470215146760544540766353598417443048012893831389688163948746\
96588175045069263653381750554781286400000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000000\
000000000000000000000000000000


The last value took about 0.02 seconds to generate on a Pentium 2.8 GhZ machine, compiled by gcc with the -O3 flag under Linux (not including generating the output string). The obvious algorithm takes about the same amount of time. For factorials in the range of thousands, the naive approach tends to do better, because this algorithm, although superior asymptotically, performs too many multiplications of two large numbers, which are expensive. We could achieve greater benefit by using a better multiplication algorithm, but first there is a way of rearranging our operations to drastically cut the number of multiplications of two large numbers we need to perform.

##  Parallel exponentiation

The concept of this optimization is to reduce the number of calls to the expensive multiply_integer. We perform these calls in two places in the simple algorithm above: whenever squaring a number during exponentiation, and whenever multiplying a prime power into our total product. Imagine we have a matrix where the columns each correspond to the bits of a single exponent, most significant at the top. For example, this is the matrix for 20!, with each column labelled with its corresponding prime:

20! = 218 · 38 · 54 · 72 · 11 · 13 · 17 · 19

$\left[\begin{matrix} \mathbf{2} & \mathbf{3} & \mathbf{5} & \mathbf{7} & \mathbf{11} & \mathbf{13} & \mathbf{17} & \mathbf{19} \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 \end{matrix}\right]$

In the above approach, we compute the contribution of each column separately and multiply them together. Alternatively, we can go by row:

1. Compute the entire prime factorization beforehand. This is fast.
2. Starting with the top row, multiply in all of the factors occurring on the current row. These are all multiplications by small numbers.
3. Square the running product to get to the following row. Repeat until all rows have been visited.

This combines the two stages of exponentiation and multiplication of components, and we need only perform large multiplications when squaring. The code looks like this:

<<parallel exponentiation factorial>>=
integer factorial(int n) {
unsigned char* primes = prime_table(n);
int num_primes;
int* p_list = malloc(sizeof(int)*n);
int* exp_list = malloc(sizeof(int)*n);
int p, i, bit;
integer result =
create_integer((int)ceil((n*log((double)n)/log(2) + 1)/COMPONENT_BITS));
integer temp     = create_integer(result.num_components*2);
set_zero_integer(result);
result.c[0] = 1;

num_primes = 0;
for(p = 2; p <= n; p++) {
if (primes[p] == 1) continue;
p_list[num_primes] = p;
exp_list[num_primes] = multiplicity(n,p);
num_primes++;
}

for(bit=sizeof(exp_list[0])*CHAR_BIT - 1; bit>=0; bit--) {
for(i=0; i<num_primes; i++) {
if ((exp_list[i] & (1 << bit)) != 0) {
break;
}
}
if (i < num_primes) {
break;
}
}

for( ; bit>=0; bit--) {
multiply_integer(result, result, temp);
copy_integer(temp, result);
for(i=0; i<num_primes; i++) {
if ((exp_list[i] & (1 << bit)) != 0) {
multiply_small_integer(result, p_list[i], result);
}
}
}

free_integer(temp);
free(primes);
return result;
}


Here's a test main for it:

<<pefactorial.c>>=
#include <stdlib.h>
#include <stdio.h>
#include "integer.h"

compute multiplicity of a prime in factorial n
sieve of Eratosthenes
parallel exponentiation factorial

int main(int argc, char* argv[]) {
int n = atoi(argv[1]);
puts(integer_to_string(factorial(n)));
return 0;
}


The times for this method are considerably better. It outperforms the naive method for factorials as low as 110!, and outperforms it by about 45% for larger factorials, as this chart shows:

 n 10 100 500 1,000 5,000 10,000 50,000 Naive (ms) 0.0007 0.017 0.5 2.22 67.8 295 8960 pefactorial (ms) 0.0019 0.021 0.29 1.12 33.9 153 4960

Times were taken on a 2.8 GhZ Pentium 4 CPU under Linux, compiled by gcc with "-O3". Time to generate output string not included. For small arguments the main factorial call was placed in a long loop to facilitate accurate measurement. Times could be improved further by using a better multiplication algorithm, or one specialized to squaring.

hijacker
hijacker
hijacker
hijacker