Binary Vector Search at 350GB/s using ARM NEON

Marek Galovic

July 17, 2025

At TopK, performance and efficiency are the core principles that enable our hybrid retrieval engine (see our benchmarks, it's fast and it scales). In our journey to optimize dense vector retrieval, we turned our focus to one critical component — the Hamming distance function. This metric, essential in measuring bitwise similarity between binary vectors, plays a foundational role in tasks like approximate nearest neighbor (ANN) search. When you operate at scale, thousands of queries per second over millions of documents, even small improvements in this function can translate to massive gains. This post outlines our exploration and results from leveraging ARM NEON instructions to optimize the Hamming distance kernel.

Baseline implementation

Our starting point was a straightforward Rust implementation. The function iterates through two byte slices, computes the XOR byte-by-byte, and accumulates the number of differing bits using count_ones().

pub fn hamming_distance(x: &[u8], y: &[u8]) -> u32 {
    assert_eq!(x.len(), y.len());

    let mut total = 0;
    for i in 0..x.len() {
        total += (x[i] ^ y[i]).count_ones();
    }
    total
}

While simple and portable, this approach left a lot on the table in terms of vectorization and memory throughput.

Threads	Samples/sec	Throughput (GB/s)
1	194,791,802	23.22
10	1,743,320,290	207.82

SIMD 101: Vanilla NEON

Let’s take a look at our first NEON-based implementation that loads 16 bytes (128 bits) from x and y, computes POPCOUNT(XOR(x, y)), sums the register horizontally and adds the partial result to the accumulator. Here are the instructions we’ll need:

veorq_u8 bitwise XOR of 16 bytes
vcntq_u8 per-byte popcount across 16 lanes
vaddvq_u8 horizontal add of all lanes into a single value

fn hamming_distance_neon(x: &[u8], y: &[u8]) -> u32 {
    assert_eq!(x.len(), y.len());
    let x_ptr = x.as_ptr();
    let y_ptr = y.as_ptr();

    // Accumulators
    let mut acc = 0_u32;
    // Operate on 128 bits (16 x u8) at a time
    let n = x.len() / 16;
    for i in 0..n {
        unsafe {
            let x = vld1q_u8(x_ptr.add(i * 16));
            let y = vld1q_u8(y_ptr.add(i * 16));
            // popcount(XOR(x, y))
            // Each of the 16xu8 lanes has value <= 8
            let xor = vcntq_u8(veorq_u8(x, y));
            // Accumulate as u32
            acc += vaddvq_u8(xor) as u32;
        }
    }

    // Handle the remaining bytes
    for i in (n * 16)..x.len() {
        acc += (x[i] ^ y[i]).count_ones();
    }
    acc
}

Threads	Samples/sec	Throughput (GB/s)
1	424,912,063	50.65
10	2,566,529,278	305.95

Even the most basic NEON-based implementation gave us 2x improvement in single-threaded throughput and ~1.5x improvement in multi-threaded throughput. Noice!

Exploiting Instruction Level Parallelism (ILP)

The above implementation accumulates the result into a single scalar register which creates a dependency between memory loads and stores. To leverage instruction level parallelism available in modern CPUs, we changed the implementation to use two separate accumulators with effectively creates two separate pipelines with load-store dependencies.

fn hamming_distance_neon_ilp(x: &[u8], y: &[u8]) -> u32 {
    assert_eq!(x.len(), y.len());
    let x_ptr = x.as_ptr();
    let y_ptr = y.as_ptr();

    // Accumulators
    let mut acc = unsafe { [vdupq_n_u8(0); 2] };
    // Operate on 256 bits (4 x u64, 32 x u8) at a time
    let n = x.len() / 32;
    for i in 0..n {
        let i = i * 32;
        unroll! {
            for j in 0..2 {
                unsafe {
                    let x = vld1q_u8(x_ptr.add(i + j * 16));
                    let y = vld1q_u8(y_ptr.add(i + j * 16));
                    // popcount(XOR(x, y))
                    // Each of the 16xu8 lanes has value <= 8
                    let xor_popcnt = vcntq_u8(veorq_u8(x, y));
                    // Accumulate as u32
                    acc[j] = vaddq_u8(acc[j], xor_popcnt);
                }
            }
        }
    }

    let mut res = unsafe {
        // Horizontal add
        vaddvq_u16(vpaddlq_u8(vaddq_u8(acc[0], acc[1]))) as u32 
    };

    // Handle the remaining bytes
    for i in (n * 32)..x.len() {
        res += (x[i] ^ y[i]).count_ones();
    }
    res
}

Threads	Samples/sec	Throughput (GB/s)
1	431,638,800	51.46
10	2,697,048,892	321.51

This version marginally outperformed the single-lane NEON implementation, particularly under high concurrency. It benefited from better utilization of NEON execution units, reducing stall cycles and enhancing memory-level parallelism.

Fewer Instructions, More Speed

Dense vector retrieval operates on vectors with fixed dimension - usually between 384 and 1536. This allows us to optimize our compute kernels even more by fully unrolling the loops and minimizing the control instructions overhead. Additionally, fully unrolled loops allow the compiler to better optimize register placement which further improves our effective throughput.

fn hamming_distance_neon_1024b(x: &[u8], y: &[u8]) -> u32 {
    assert_eq!(x.len(), y.len());
    let x_ptr = x.as_ptr();
    let y_ptr = y.as_ptr();

    unsafe {
        let mut acc = [vdupq_n_u8(0); 2];
        unroll! {
            for i in 0..4 {
                let x = vld1q_u8(x_ptr.add(i * 32));
                let y = vld1q_u8(y_ptr.add(i * 32));
                let xp1 = vcntq_u8(veorq_u8(x, y));
                acc[0] = vaddq_u8(acc[0], xp1);

                let x = vld1q_u8(x_ptr.add(i * 32 + 16));
                let y = vld1q_u8(y_ptr.add(i * 32 + 16));
                let xp2 = vcntq_u8(veorq_u8(x, y));
                acc[1] = vaddq_u8(acc[1], xp2);
            }
        }

        // Horizontal add
        vaddvq_u16(vpaddlq_u8(vaddq_u8(acc[0], acc[1]))) as u32
    }
}

Threads	Samples/sec	Throughput (GB/s)
1	525,551,619	62.65
10	2,934,356,752	349.80

This final variant delivered the best throughput for both single-threaded and multi-threaded workloads. With minimal branching and perfectly aligned memory access, it approaches the architectural limits of the CPU.

Conclusion

This journey underscores how performance engineering at the instruction level can help us get the most out of the underlying hardware. Leveraging ILP and problem-specific assumptions significantly improved both single-threaded and multi-threaded throughput of our Hamming distance kernel. The 1024-bit optimized NEON kernel now sustains nearly 350 GB/s which makes our billion-scale retrieval lower latency and improves the overall cost efficiency of our offering. Check out our benchmarks to see how optimizations like this allow us to offer highest performance on the market.

If you are passionate about squeezing performance at both low-level and system-level, shoot me an email at marek@topk.io. We’re hiring!