Optimizing CRC32 for small payload sizes on x86

Published 6th June 2020

This entry also serves as documentation for the ISO CRC32 implementation in Dynarmic and Ryujinx.

Recently during a set of night shifts, my tired mind wanted to understand how one should go about implementing CRC32 in terms of a carry-less multiply operation. This seemed to be motivated by several thoughts:

  1. x86 supports the pclmulqdq instruction which implements a 64-bit × 64-bit → 127-bit carry-less product operation
  2. While x86 has the SSE4.2 crc32c instruction for the Castagnoli polynomial 0x1EDC6F41, it does not have similar instructions for other polynomials
  3. AArch64 systems provide a CRC32 instruction which uses the ISO/ANSI/gzip/PNG polynomial 0x04C11DB7
  4. Current AArch64 emulators fall-back to a slow software implementation when encountering the above mentioned instructions

I had a vague understanding of CRC32 so I knew it was possible, but my understanding at the time wasn't sufficient for the implementation to be obvious. In retrospect the solution is obvious but I am documenting it here anyway for posterity.

Implementation requirements

The primary motivation is to re-implement the crc32 instructions of AArch64, so let's first take a look at those. There are several instructions with small payload sizes:

CRC32B <Wd>, <Wn>, <Wm>
CRC32H <Wd>, <Wn>, <Wm>
CRC32W <Wd>, <Wn>, <Wm>
CRC32X <Wd>, <Wn>, <Xm>

For all the instructions, the n register is the accumulated crc32 value, and m is the current input value. One can have a look at the ASL implementation for these instructions in the AArch64 manual:

bits(32)   acc  = X[n];    // accumulator
bits(size) val  = X[m];    // input value
bits(32)   poly = 0x04C11DB7;

bits(32+size) tempacc = BitReverse(acc):Zeros(size);
bits(size+32) tempval = BitReverse(val):Zeros(32);

// Poly32Mod2 on a bitstring does a polynomial Modulus over {0,1} operation
X[d] = BitReverse(Poly32Mod2(tempacc EOR tempval, poly));

From the above we can see that we need to implement crc32 for 8, 16, 32, and 64-bit payload sizes.

Cyclic redundancy check (CRC) mathematics

$$ crc(m(x)) = x^{degree(p(x))} \cdot m(x) \mod p(x) $$

As the above suggests, the CRC of a message is the message with zeros appended represented as a polynomial modulo the generator polynomial. The number of zeros appended to the message is the degree of the polynomial.

The coefficients of the polynomial are elements of the Galois field GF(2). Thus, addition of coefficients is equivalent to xor, and the multiplication of two polynomials is equivalent to a carry-less multiply.

For ISO CRC32, degree(p(x)) = 32, and the generator polynomial is:

$$p(x) = x^{32} + x^{26} + x^{23} + x^{22} + x^{16} + x^{12} + x^{11} + x^{10} + x^8 + x^7 + x^5 + x^4 + x^2 + x + 1$$

This is equivalent to a binary representation of 0x104C11DB7.

CRC accumulation

One might ask: Why can I use a hardware accelerated CRC32 primitive with a maximum payload size of 64-bits to calculate a CRC32 for a larger message? How does the maths work out?

If one takes the original message as \(m_1\), the message to append as \(m_2\), and \(l\) as the size of \(m_2\), this would be:

$$ \begin{aligned} crc32(m_1 || m_2) &= x^{32} \cdot ( m_1 || m_2 ) \mod p \\ &= x^{32} \cdot ( m_1 \cdot 2^l + m_2 ) \mod p \\ &= (x^{32 + l} \cdot m_1) + (x^{32} \cdot m_2) \mod p \\ &= (x^{l} \cdot crc32(m_1)) + (x^{32} \cdot m_2) \mod p \end{aligned} $$

In other words, you can shift-xor the accumulated crc value into the message and perform the CRC32 calculation as usual. This demonstrates how a CRC32 can be accumulated incrementally for a long message, and also why it is necessary for a hardware CRC32 implementation to have an accumulator parameter.

Implementing modulo p

We intend to implement CRC32 on x86, but all we have is a carry-less multiply primitive. We do not have a divide or a modulo primitive to work with. Not to worry! We can implement modulo with Barrett reduction!

Barret reduction comes from the observation that:

$$a \textrm{ mod } p = a - \lfloor s a \rfloor p$$


$$s = \frac{1}{p}$$

In practice we can approximate the multiplication by \(s\) with:

$$a \textrm{ mod } p = a - \left\lfloor \left( a \cdot \left\lfloor\frac{2^{96}}{p}\right\rfloor \right) \gg 96 \right\rfloor p$$

This is a sufficiently accurate approximation to produce correct and exact results for 96 bit values of \(a\). In the notation that follows, we shall use \(\mu_{i}\) to represent the value \( \left\lfloor\frac{2^{i}}{p}\right\rfloor \).

Note: We choose to use \(\mu_{96}\). A lot of implementations use \(\mu_{64}\), which unfortunately only provides enough precision for 64 bit values of \(a\). This leaves performance on the table, because this results in more multiplies than necessary!

Implementation practicalities

We now have a sufficient understanding of the mathematics involved to implement CRC32. There are however a few remaining practicalities to consider.

Bit-reflected data

As you might have noticed from the ASL pseudo-code for the AArch64 CRC32 instruction, like a lot of CRC implementations, it works on bit-reflected data. Fortunately carry-less multiplication is bit order agnostic (no carries!). However, one does have to take care because the result of the multiplication is stored in the least significant 127 bits of the result, as demonstrated in the below image (bit positions are bit-reflected):

One can account for this by shifting the result left by 1 bit. Alternatively, since our constant multiplicands would be always less than 64 bits, we can pre-shift our constant multiplicand by 1 for an appropriately bit-shifted result, this is demonstrated below:

Similarly, multiplication by a 33 bit value results in correct alignment. Another neat trick is multiplication by a 65 bit value whose LSB happens to be zero (we exploit this!).

Finding \(\mu_i\)

Since \(\mu_i\) is \( \left\lfloor\frac{2^{i}}{p}\right\rfloor \), this can be found straightforwardly by polynomial division:

// Hastily written polynomial division function.
int find_mu(int i) {
    uint256_t dividend = uint256_t{1} << i;
    const uint256_t divisor = 0x104C11DB7;
    const int bits_in_divisor = 33;

    uint256_t result = 0;
    int bit = 255;
    while (bit >= 0) {
        if ((dividend & (uint256_t{1} << bit)) != 0) {
            int shift = bit - bits_in_divisor + 1;
            if (shift >= 0) {
                dividend ^= divisor << shift;
                result ^= uint256_t{1} << shift;
            } else {
                dividend ^= divisor >> -shift;

    printf("%s\n", result.str(16).c_str());

Remember to bit-reflect the results if your specification requires it.

Easy-to-understand implementation

Let us start with a simple to understand implementation. Remember we are handling bit-reflected data, so be careful.

// This implementation only works for 8-, 16-, 32-bit values
uint32_t crc32(uint32_t value, uint32_t accumulator, int bits) {
    __m128i orig = _mm_set_epi64x(0, (uint64_t)(value ^ accumulator) << (32 - bits));
    __m128i tmp = orig;

    // Multiply by mu_{64}
    tmp = _mm_clmulepi64_si128(tmp, _mm_set_epi64x(0, 0x1F7011641), 0x00);
    // Divide by 2^64 (mask away the unnecessary bits)
    tmp = _mm_and_si128(tmp, _mm_set_epi64x(0, 0xFFFFFFFF));
    // Multiply by p (shifted left by 1 for alignment reasons)
    tmp = _mm_clmulepi64_si128(tmp, _mm_set_epi64x(0, 0x1DB710641), 0x00);
    // Subtract original from result
    tmp = _mm_xor_si128(tmp, orig);

    // Extract the 'lower' (in bit-reflected sense) 32 bits
    return (uint32_t)_mm_extract_epi32(tmp, 1);

// For 64-bit values
// Just accumulate using two 32-bit operations
uint32_t crc32(uint64_t value, uint32_t accumulator) {
    accumulator = crc32((uint32_t)value, accumulator, 32);
    accumulator = crc32((uint32_t)(value >> 32), accumulator, 32);
    return accumulator;

The above is what you'd likely come up with after reading Intel's white-paper on using PCLMULQDQ to compute CRC. This essentially is an exact implementation of figure 12 on page 21 of that white paper.

There are several improvements possible:

  1. The constants can share a register
  2. The masking is unnecessary if you align the input appropriately
  3. It is unnecessary to accumulate for the 64-bit version if you pick a better value for \(\mu\)
  4. The final xor is completely unnecessary in the 32 and 64 bit case

Final implementation

uint32_t crc32_32(uint32_t value, uint32_t accumulator) {
    __m128i xmm_const = _mm_set_epi64x(0x00000001DB710641, 0xB4E5B025F7011641);
    __m128i xmm_value = _mm_set_epi64x(0, (value ^ accumulator) << 32);

    xmm_value = _mm_clmulepi64_si128(xmm_value, xmm_const, 0x00);
    xmm_value = _mm_clmulepi64_si128(xmm_value, xmm_const, 0x10);
    return _mm_extract_epi32(xmm_value, 2);

uint32_t crc32_64(uint64_t value, uint32_t accumulator) {
    __m128i xmm_const = _mm_set_epi64x(0x00000001DB710641, 0xB4E5B025F7011641);
    __m128i xmm_value = _mm_set_epi64x(0, value ^ accumulator);

    xmm_value = _mm_clmulepi64_si128(xmm_value, xmm_const, 0x00);
    xmm_value = _mm_clmulepi64_si128(xmm_value, xmm_const, 0x10);
    return _mm_extract_epi32(xmm_value, 2);

That's it! By shifting the input 32 bits to the right, we avoid a mask. By doing so we also can implement 64-bit CRC32 in two multiplies instead of four. The alignment also happens to work out so that we do a 296 divide without having to do anything.

We take advantage of the fact the 0th-bit of \(\mu_{92}\) is zero, and the fact the lowest 32 bits of the input is padded with zeros to do 92-bit × 65-bit → 156-bit multiply. We do not mind the overflow because we discard the lowest 96 bits of the result. We take advantage of the happy coincidence the 64 bits we do want aligns perfectly with the input to the next multiply. We then do a 64-bit × 33-bit multiply, and extract the required byte-aligned 32 bits with a `pextrd` instruction.

The result is a 64-bit CRC32 implementation in two multiplies.