At TopK, performance and efficiency are the core principles that enable our hybrid retrieval engine (see our benchmarks, it's fast and it scales). In our journey to optimize dense vector retrieval, we turned our focus to one critical component — the Hamming distance function. This metric, essential in measuring bitwise similarity between binary vectors, plays a foundational role in tasks like approximate nearest neighbor (ANN) search. When you operate at scale, thousands of queries per second over millions of documents, even small improvements in this function can translate to massive gains. This post outlines our exploration and results from leveraging ARM NEON instructions to optimize the Hamming distance kernel.
Baseline implementation
Our starting point was a straightforward Rust implementation. The function iterates through two byte slices, computes the XOR byte-by-byte, and accumulates the number of differing bits using count_ones()
.
pub fn hamming_distance(x: &[u8], y: &[u8]) -> u32 {
assert_eq!(x.len(), y.len());
let mut total = 0;
for i in 0..x.len() {
total += (x[i] ^ y[i]).count_ones();
}
total
}
While simple and portable, this approach left a lot on the table in terms of vectorization and memory throughput.
Threads | Samples/sec | Throughput (GB/s) |
---|---|---|
1 | 194,791,802 | 23.22 |
10 | 1,743,320,290 | 207.82 |
SIMD 101: Vanilla NEON
Let’s take a look at our first NEON-based implementation that loads 16 bytes (128 bits) from x and y, computes POPCOUNT(XOR(x, y))
, sums the register horizontally and adds the partial result to the accumulator. Here are the instructions we’ll need:
veorq_u8
bitwise XOR of 16 bytesvcntq_u8
per-byte popcount across 16 lanesvaddvq_u8
horizontal add of all lanes into a single value
fn hamming_distance_neon(x: &[u8], y: &[u8]) -> u32 {
assert_eq!(x.len(), y.len());
let x_ptr = x.as_ptr();
let y_ptr = y.as_ptr();
// Accumulators
let mut acc = 0_u32;
// Operate on 128 bits (16 x u8) at a time
let n = x.len() / 16;
for i in 0..n {
unsafe {
let x = vld1q_u8(x_ptr.add(i * 16));
let y = vld1q_u8(y_ptr.add(i * 16));
// popcount(XOR(x, y))
// Each of the 16xu8 lanes has value <= 8
let xor = vcntq_u8(veorq_u8(x, y));
// Accumulate as u32
acc += vaddvq_u8(xor) as u32;
}
}
// Handle the remaining bytes
for i in (n * 16)..x.len() {
acc += (x[i] ^ y[i]).count_ones();
}
acc
}
Threads | Samples/sec | Throughput (GB/s) |
---|---|---|
1 | 424,912,063 | 50.65 |
10 | 2,566,529,278 | 305.95 |
Even the most basic NEON-based implementation gave us 2x improvement in single-threaded throughput and ~1.5x improvement in multi-threaded throughput. Noice!
Exploiting Instruction Level Parallelism (ILP)
The above implementation accumulates the result into a single scalar register which creates a dependency between memory loads and stores. To leverage instruction level parallelism available in modern CPUs, we changed the implementation to use two separate accumulators with effectively creates two separate pipelines with load-store dependencies.
fn hamming_distance_neon_ilp(x: &[u8], y: &[u8]) -> u32 {
assert_eq!(x.len(), y.len());
let x_ptr = x.as_ptr();
let y_ptr = y.as_ptr();
// Accumulators
let mut acc = unsafe { [vdupq_n_u8(0); 2] };
// Operate on 256 bits (4 x u64, 32 x u8) at a time
let n = x.len() / 32;
for i in 0..n {
let i = i * 32;
unroll! {
for j in 0..2 {
unsafe {
let x = vld1q_u8(x_ptr.add(i + j * 16));
let y = vld1q_u8(y_ptr.add(i + j * 16));
// popcount(XOR(x, y))
// Each of the 16xu8 lanes has value <= 8
let xor_popcnt = vcntq_u8(veorq_u8(x, y));
// Accumulate as u32
acc[j] = vaddq_u8(acc[j], xor_popcnt);
}
}
}
}
let mut res = unsafe {
// Horizontal add
vaddvq_u16(vpaddlq_u8(vaddq_u8(acc[0], acc[1]))) as u32
};
// Handle the remaining bytes
for i in (n * 32)..x.len() {
res += (x[i] ^ y[i]).count_ones();
}
res
}
Threads | Samples/sec | Throughput (GB/s) |
---|---|---|
1 | 431,638,800 | 51.46 |
10 | 2,697,048,892 | 321.51 |
This version marginally outperformed the single-lane NEON implementation, particularly under high concurrency. It benefited from better utilization of NEON execution units, reducing stall cycles and enhancing memory-level parallelism.
Fewer Instructions, More Speed
Dense vector retrieval operates on vectors with fixed dimension - usually between 384 and 1536. This allows us to optimize our compute kernels even more by fully unrolling the loops and minimizing the control instructions overhead. Additionally, fully unrolled loops allow the compiler to better optimize register placement which further improves our effective throughput.
fn hamming_distance_neon_1024b(x: &[u8], y: &[u8]) -> u32 {
assert_eq!(x.len(), y.len());
let x_ptr = x.as_ptr();
let y_ptr = y.as_ptr();
unsafe {
let mut acc = [vdupq_n_u8(0); 2];
unroll! {
for i in 0..4 {
let x = vld1q_u8(x_ptr.add(i * 32));
let y = vld1q_u8(y_ptr.add(i * 32));
let xp1 = vcntq_u8(veorq_u8(x, y));
acc[0] = vaddq_u8(acc[0], xp1);
let x = vld1q_u8(x_ptr.add(i * 32 + 16));
let y = vld1q_u8(y_ptr.add(i * 32 + 16));
let xp2 = vcntq_u8(veorq_u8(x, y));
acc[1] = vaddq_u8(acc[1], xp2);
}
}
// Horizontal add
vaddvq_u16(vpaddlq_u8(vaddq_u8(acc[0], acc[1]))) as u32
}
}
Threads | Samples/sec | Throughput (GB/s) |
---|---|---|
1 | 525,551,619 | 62.65 |
10 | 2,934,356,752 | 349.80 |
This final variant delivered the best throughput for both single-threaded and multi-threaded workloads. With minimal branching and perfectly aligned memory access, it approaches the architectural limits of the CPU.
Conclusion
This journey underscores how performance engineering at the instruction level can help us get the most out of the underlying hardware. Leveraging ILP and problem-specific assumptions significantly improved both single-threaded and multi-threaded throughput of our Hamming distance kernel. The 1024-bit optimized NEON kernel now sustains nearly 350 GB/s which makes our billion-scale retrieval lower latency and improves the overall cost efficiency of our offering. Check out our benchmarks to see how optimizations like this allow us to offer highest performance on the market.
If you are passionate about squeezing performance at both low-level and system-level, shoot me an email at marek@topk.io. We’re hiring!