Last updated: March 31st, 2018
I wrote this text simply because I was so glad when I understood what this is all about. I will not take any responsibility for any trouble caused by the errors that this document might contain.
Basically, a number theoretic transform is a Fourier transform. Now, suppose you have a normal discrete Fourier transform. You do it in matrix form by multiplicating your data with a Fourier matrix (for example n=4):
[ 1 1 1 1 ] [ 1 w w^2 w^3 ] [ 1 w^2 w^4 w^6 ] [ 1 w^3 w^6 w^9 ]The unit w is exp(2pi i / n). Now everything a number theoretic transform is all about is that w^n=1. But instead of using a complex number w you do everything in some other number field where w^n=1. Now integers modulo a prime p form a number field. From elementary number theory we know, that for all integers a, when p is prime (and a is positive and less than p)
a^(p-1)=1 (mod p)(From now on we just might suppose that the modulus p is prime). Then there exists a primitive root r, that is, when x goes from 1 to p-1, then r^x (mod p) goes through all the numbers 1...(p-1) in some order.
Now, we want w^n=1 (mod p) (here n is our transform length). So we find (not too difficult with, say, Mathematica) a prime of the form p=k*n+1. So p-1=k*n. Now find the primitive root r (with Mathematica). To know more about primitive roots go to my primitive roots page. We calculate (when p=k*n+1)
w=r^k (mod p)That's it. Now w^n=r^(k*n)=r^(p-1)=1 (mod p). And w^n=1 but w^m != 1 if m < n. So it works.
The matrix can be factored to do a fast number theoretic transform just like a Fourier matrix can be factored. Just take any fft code and change the complex numbers to integers modulo p, and use the w found above.
Unlike the "normal" Fourier transform, a number theoretic transform does not transform the data to any "physically meaningful" space, like a Fourier transform transforms the "time" data to "frequency" space. So one is mostly interested in doing convolutions efficiently. This can be used to multiply very big numbers efficiently, which originally was the author's goal.
Number theoretic transforms are actually very simple. I wrote a program in Mathematica to demonstrate multiple precision arithmetic with ntts: numberth.m
In order to use the above program with Mathematica you need to load first
PrimitiveRoot. Then just load
init (default transform length n=8). Then call
mul[1234, 5678] or whatever. The n is transform length and it's
done so that a number to be multiplied must have at most n/2 digits. So for
n=8 the maximum is
mul[9999, 9999]. If you want to try
multiplicating larger numbers, set bigger n (for example n=64) and then call
init again. Note that n must be a power of two when using the
mulf to use the "slow" O(n^2) transform. This uses simple
matrix multiplications. You have to first call
When using the slow transforms, n doesn't have to be a power of two.
Remember to call both
The code is translated directly from a C fast Fourier transform program (or C++ after I first converted the C code to use C++'s complex numbers). The idea seems good, even the scrambling has been omitted (convolution only!). For big transforms the code could be yet optimized (when written in a serious programming language) by noting that the upper half of the vector to be transformed is zeros. A "four-step" implementation could also improve performance on cache-based systems.
Assembler optimizing might be quite effective. If the modulo p has almost all bits the same (like p = 2^32-2^20+1), the modulo taking can be done just with a few double-length shifts and additions/subtractions (this can't be done in C but it's fast in assembler). For arbitrary moduli, assembler is still the only way to do double-width multiplication and division.
A C++ implementation with an arbitrary precision package plus plenty of other things is available at the apfloat home page.
One good thing about number theoretic transforms is that if you do a long transform (so long, that the 32-bit mantissa overflows and the convolution fails), you can do another transform modulo a different prime (there should be enough of the form k*n+1) and then use the Chinese Remainder Theorem. So instead of having to use double-precision arithmetic in the transform, you do two transforms (or even more) modulo different primes and then apply the Chinese Remainder Theorem to the result. So double-precision arithmetic needs to be used only in a single pass through the data.
When multiplying huge precision numbers, one could use, say base 10^9 for 32-bit integers. Then just do the transform modulo three different primes slightly less than 2^32. And use the CRT with 96-bit arithmetic. This way you could fit 4.5 times more decimals in the memory than if you use base 10^4 for 64-bit doubles. The final result is correct as long as the result from the CRT has a wide enough mantissa (= the product of the different primes used). Now this should always be the case, since the biggest number that can be produced in the convolution is n*base^2 (now obviously base and the transform length, n are all less than 2^32, so the product is less than 2^(3*32)). Note that we don't need any "safety" bits like we do when we calculate with complex numbers, since there is no round-off error. The CRT can be implemented efficiently using only single-precision arithmetic (with double-width multiplication and division, of course), although the result is a multi-precision number.
When using 32-bit integers in the transform, the maximum transform length seems to be 2^27. There are only three primes less than 2^32 of the form k*n+1. For n=2^28 there are only two, so the scheme mentioned above can't work. This becomes a problem only if you have more than 1GB of memory, though.
Mr. Bailey mentions, though, in the article in Math. Comp. that number theoretic transforms are a lot slower than any normal ffts. But he also mentions that the space requirements are much less. So if you want to multiply large numbers with many decimals but you have limited memory, this could be the way to go.
One could also course use a "schoolboy" type multiplication after memory runs out.