TopK Bench: Benchmarking Real-World Vector Search

Jergus Lejko

December 1, 2025

Vector databases are often evaluated on isolated metrics like query latency or recall, but production workloads depend on more than that. Databases need to be able to ingest data continuously, scale under concurrency, handle filters efficiently, and maintain recall across dataset sizes.

In this benchmark, we evaluate how several of the most widely used managed vector databases (both serverless and instance-based) perform under simulated production-like workloads. We run five core benchmarks across multiple dataset sizes (100k, 1M, and 10M vectors):

Ingest: Measures total ingestion time and throughput of the write path (100k → 10M vectors), along with freshness—the delay from write acknowledgement to data being available in query results.
Concurrency: Assesses how latency and QPS change as the number of concurrent client workers increases (1, 2, 4, 8).
Filtering: Examines how metadata and keyword filters affect latency and QPS at different selectivity levels (100%, 10%, 1%).
Recall: Reports how well systems maintain recall as datasets scale and filters are applied.
Read-write: Evaluates how query performance degrades, if at all, while the database handles concurrent writes.

Scope

This benchmark evaluates dense retrieval performance across a set of popular managed vector databases, including both serverless and instance-based providers. All providers are tested with the same datasets, identical request patterns, and the same evaluation logic. Only TopK's absolute numbers are made public; other providers are anonymized to keep the focus on behavior rather than direct comparison while still reflecting how real, widely used services behave.

1. Dataset

The benchmark is built from MS MARCO passages and queries, using 768-dimensional embeddings generated with nomic-ai/modernbert-embed-base. We provide three corpus sizes—100k, 1M, and 10M vectors—each bundled with 1,000 evaluation queries and precomputed ground-truth nearest neighbors so recall can be measured consistently across systems.

All datasets are hosted on S3 under s3://topk-bench as docs-{100k,1m,10m}.parquet and queries-{100k,1m,10m}.parquet. The topk-io/bench repository contains the benchmarking tool and further details on the dataset format.

Each document includes the original passage text, its dense embedding, and two synthetic filter fields designed for controlled selectivity experiments:

int_filter: An integer field whose value are sampled from a uniform distribution [0, 10_000]. Predicates like int_filter <= 10_000, int_filter <= 1_000, and int_filter <= 100 are then constructed to match ~100%, ~10%, and ~1% of documents.
keyword_filter: A categorical field populated with tokens chosen so that different keyword predicates produce similar 100% / 10% / 1% selectivity levels. Queries like text_match(keyword_filter, "10000"), text_match(keyword_filter, "01000"), or text_match(keyword_filter, "00100") are then constructed to match ~100%, ~10%, and ~1% of documents.

These fields let you precisely control how many documents match a query, making it possible to evaluate how different systems behave as filter selectivity changes.

2. Methodology

All benchmarks run in AWS containers in the same region as each provider. Most providers are serverless; for non-serverless providers, we use minimal viable configurations based on provider recommendations to ensure fair comparison.

To reduce noise from transient issues like network hiccups, tail latencies, and noisy neighbors in serverless environments, each test configuration runs 5 times. We drop the worst run and report the mean of the remaining 4 runs, where "worst" is defined as:

Latency: highest p99 latency
QPS: lowest QPS
Recall: lowest recall

Before each measurement, we perform a warmup run to ensure systems are in a steady state. Query benchmarks warm up with concurrency=1 and a 60-second timeout (2x the measurement timeout). Filter benchmarks warm up with both filters enabled to exercise all code paths.

For measuring latency & QPS, we send queries in a 30-second window. For recall, we run a set of 1000 queries and measure recall@10.

3. The `topk_bench` library

Designing benchmarks that look like real production workloads is non-trivial, so we encoded these workloads into topk_bench, an open-source, reproducible benchmarking tool. The tool includes the datasets, query sets, and evaluation logic used in this benchmark, making it easy to re-run experiments or benchmark your own deployments. You can find the full benchmarking suite at topk-io/bench, which is open source and ready to use. The repository also includes a comprehensive Jupyter notebook for in-depth result analysis and visualization.

import topk_bench as tb

# Ingest documents
tb.ingest(
    provider=tb.TopKProvider(),
    config=tb.IngestConfig(
        input="s3://topk-bench/docs-1m.parquet",
        # ...
    ),
)

# Run queries
tb.query(
    provider=tb.TopKProvider(),
    config=tb.QueryConfig(
        queries="s3://topk-bench/queries-1m.parquet",
        concurrency=4,
        # ...
    ),
)

# Write metrics (locally or to s3://)
tb.write_metrics("bench-1m.parquet")

You can find the full benchmarking suite at topk-io/bench, which is open source and ready to use. The repository also includes a comprehensive Jupyter notebook for in-depth result analysis and visualization.

4. Ingest Performance

Ingest performance evaluates how efficiently systems accept new data. We report ingestion time for each dataset size, and then examine how throughput behaves during long ingests. Lower ingestion time is better; higher throughput (MB/s) is better.

For each provider, we performed a grid search to find the optimal combination of batch size and concurrency that yields the highest throughput.

4.1 Ingest Time

We measure the total time required for systems to accept all data for each dataset size (100k, 1M, and 10M vectors). This is the time from the first write request to the final acknowledgment that all data has been ingested. Lower ingestion time is better.

Loading chart data...

4.2 Ingest Throughput

Throughput measures the rate at which systems can accept data during ingestion, reported as MB per second. Higher throughput indicates more efficient write paths and better utilization of available resources. Higher throughput is better.

Loading chart data...

4.3 Freshness

Freshness measures the time between write acknowledgment and the time when the written document becomes visible to queries. We report p50, p90, and p99 percentiles. Lower freshness (time to visibility) is better.

Some providers offer strong consistency guarantees, which we disable in this benchmark since they would hide freshness characteristics at the cost of higher latency. We test with the default (eventual) consistency setup to measure true freshness behavior.

Loading chart data...

5. Query Throughput & Concurrency Scaling

Here we look at how query throughput and latency change as client-side concurrency increases. For each provider and dataset size (100k, 1M, 10M), we run the same fixed query set with 1, 2, 4, and 8 concurrent clients and observe the overall throughput (QPS) and P99 latency.

5.1 Latency

We measure p99 latency as client-side concurrency increases. Systems that scale well should maintain stable or only slightly increasing latency, while systems that hit bottlenecks will show significant latency degradation at higher concurrency. Lower latency is better.

Loading chart data...

Note: Provider C shows extreme latency outliers at 8 concurrency (436ms at 100k, 439ms at 1M), 20–40x worse than other providers. At concurrency 4→8, latency increases dramatically (159ms→436.5ms at 100k, 71.5ms→439ms at 1M) while QPS remains flat. This strongly suggests a query execution strategy that degrades sharply beyond a certain concurrency threshold, which makes it a risky choice for workloads that need predictable tail latencies as they scale out.

5.2 QPS

We measure query throughput (queries per second) as client-side concurrency increases. Systems that scale well should show increasing QPS with higher concurrency, while systems that hit bottlenecks will plateau or degrade. Higher QPS is better.

Loading chart data...

6. Filtering Performance

Many practical workloads combine vector search with metadata and keyword predicates, which modify the candidate set and affect both query performance and recall. We evaluate how systems handle filters at different selectivities (100%, 10%, 1%). As selectivity decreases, systems that efficiently prune candidates should show improved performance without degrading in result quality. Lower latency is better; higher QPS is better.

Each provider runs the same filtered query sets across all dataset sizes using a single concurrent client.

6.1 No Filter vs 100% Filter

A 100% filter selects the entire dataset but still exercises the filtering path. This isolates the overhead of the filtering mechanism itself.

6.1.1 Latency

Loading chart data...

6.1.2 QPS

Loading chart data...

6.2 Metadata Filters (100% / 10% / 1% Selectivity)

We exercise metadata filtering using integer predicates at three selectivities. Systems that efficiently filter on metadata should show decreasing latency and increasing QPS as selectivity decreases from 100% to 1%, since fewer documents need to be processed. Lower latency is better; higher QPS is better.

6.2.1 Latency

Loading chart data...

6.2.2 QPS

Loading chart data...

6.3 Keyword Filters (100% / 10% / 1% Selectivity)

We test keyword filtering using text predicates at three selectivities. Keyword filters exercise a different execution path than metadata filters. Systems that efficiently handle keyword predicates should show decreasing latency and increasing QPS as selectivity decreases from 100% to 1%. Lower latency is better; higher QPS is better.

6.3.1 Latency

Loading chart data...

6.3.2 QPS

Loading chart data...

7. Recall

We measure recall at top_k=10 using pre-computed ground truth for each dataset size and filter type. Ground truth was computed using exact search, ensuring we have the true nearest neighbors for each query. The recall tests use the same query sets as the previous benchmarks. We examine whether systems maintain high recall accuracy across dataset sizes and filter selectivities. Higher recall is better.

7.1 Metadata Filters Recall

Loading chart data...

Note: Provider D shows lower recall (0.872) for int 1% filter at 10M dataset compared to other providers (~0.98-1.0). This suggests it may be using post-filtering, which can hurt recall performance when filters are highly selective. For applications that depend on high recall under narrow metadata filters, this kind of behavior would be a clear red flag. Perfect recalls (1.0) at high-selectivity filters on small datasets (100k) may indicate that implementations switch to exact search algorithms for small filtered candidate sets.

7.2 Keyword Filters Recall

Loading chart data...

Note: Perfect recalls (1.0) at high-selectivity filters on small datasets (100k) may be due to exact matching behavior.

8. Read-Write Performance

To see how systems behave under mixed workloads, we run the same query tests while a background writer continuously updates documents from the dataset, then compare read-only against read-write runs. The writer updates unrelated metadata fields (vectors and filter fields remain unchanged), and we query the same documents being updated to exercise this path. Lower latency degradation is better; higher QPS maintenance is better.

8.1 Latency

Loading chart data...

8.2 QPS

Loading chart data...

9. Cost Analysis (Simulated Workload)

To get a sense of operating costs, we approximate a production-like workload and apply each provider’s public on‑demand pricing (late 2025). We assume a 10M‑item collection (768‑dimensional vectors with ~1 KB of metadata), 10M writes and 50M queries over the course of a month.

We then run this workload through each pricing model (compute/scan units, read–write units, and storage) to get an order‑of‑magnitude cost comparison across TopK and other managed services.

Under this model, TopK costs roughly $29/month, while other providers' costs span roughly $120–$650/month. The goal is to show how total cost compares on for a fixed, production-like, workload.

10. Conclusion

These benchmarks show how managed vector databases differ on ingestion speed, query concurrency, filtering, recall, mixed read-write workload handling, and opearting cost. They reveal key tradeoffs—write speed vs. freshness, latency vs. concurrency, and search quality vs. filter complexity.

Why TopK behaves this way. TopK’s results in this benchmark are a consequence of its design from first principles, not any single tuning choice. A few key architectural decisions matter in practice:

Write path tuned for throughput: a log‑structured write path with scalable compaction and indexing lets TopK accept writes quickly and make data promptly available for querying.
Separation of read and write paths: queries read optimized data files via our vectorized query engine (reactor), which gives us predictable tail latencies under load.
Separation of storage and compute: TopK uses object storage for durability and elastic compute, making it easy to scale and cost-effective for different workload shapes.

You can explore more benchmarks and engineering deep dives in the TopK blog and benchmarks pages, starting from topk.io/benchmarks. Together, these choices are what enable the “just works” behavior these benchmarks surface: systems that remain fast, predictable, and cost‑efficient even as scale, filters, and hybrid scoring are added.

For more on how TopK is built, see Why Vector DBs Are the Wrong Abstraction and Billion-Scale Hybrid Search.

EDIT: TopK results were updated on 2025-12-08 to reflect the performance of the latest version.

When you’re ready, you can sign up and start ingesting and querying your data at console.topk.io.

TopK Bench: Benchmarking Real-World Vector Search

Scope

1. Dataset

2. Methodology

3. The topk_bench library

4. Ingest Performance

4.1 Ingest Time

4.2 Ingest Throughput

4.3 Freshness

5. Query Throughput & Concurrency Scaling

5.1 Latency

5.2 QPS

6. Filtering Performance

6.1 No Filter vs 100% Filter

6.1.1 Latency

6.1.2 QPS

6.2 Metadata Filters (100% / 10% / 1% Selectivity)

6.2.1 Latency

6.2.2 QPS

6.3 Keyword Filters (100% / 10% / 1% Selectivity)

6.3.1 Latency

6.3.2 QPS

7. Recall

7.1 Metadata Filters Recall

7.2 Keyword Filters Recall

8. Read-Write Performance

8.1 Latency

8.2 QPS

9. Cost Analysis (Simulated Workload)

10. Conclusion

3. The `topk_bench` library