SQLite FTS5 vs rapidfuzz: Fuzzy Search Showdown
The Fuzzy Search Problem
You have 500,000 product names in a database. A user types choclate milk — how do you find Chocolate Milk 2%?
This is the fuzzy search problem, and it’s everywhere: autocomplete, deduplication, record linkage, product catalogs, address matching. The input is imperfect — misspelled, abbreviated, reordered — and you need to find the right match anyway.
Two tools dominate the Python fuzzy search landscape for small-to-medium datasets: SQLite FTS5 (the database’s built-in full-text search engine) and rapidfuzz (a blazing-fast string similarity library). They both live in the “find things that kinda match” space, but they approach the problem from completely different angles.
FTS5 builds an inverted index and answers: “which documents contain these tokens?” Rapidfuzz computes edit distances and answers: “how similar are these two strings?”
In this post, I’ll benchmark both on a 500,000-row product dataset across five real search scenarios, measure speed, compare result quality, and show you a hybrid approach that gets the best of both worlds. Then I’ll test something most benchmarks ignore: batch performance — what happens when you need to run 10, 100, or 1,000 queries against the same dataset? Concrete numbers, runnable code, clear recommendations.
Meet the Contenders
SQLite FTS5
- 🗃 Inverted index on disk
- ⚡ Sub-millisecond token lookups
- 📊 BM25 relevance ranking
- 🔍 Boolean queries, phrases, prefixes
- 💾 Persistent — index survives restarts
rapidfuzz
- 🧬 Edit distance algorithms (C++ backend)
- 🎯 0–100 similarity scores
- ⌨ Typo-tolerant by nature
- 🔄 Token reordering & partial matching
- 💨 Zero setup — just pass a list
SQLite FTS5 in 30 Seconds
FTS5 is a virtual table extension that ships with SQLite. You create a special table, insert text, and query it with MATCH. Under the hood, it builds an inverted index — a mapping from every token (word) to the rows that contain it. This makes token lookups near-instant regardless of table size.
-- Create an FTS5 virtual table
CREATE VIRTUAL TABLE products_fts USING fts5(name);
-- Insert some products
INSERT INTO products_fts(name) VALUES ('Chocolate Milk 2%');
INSERT INTO products_fts(name) VALUES ('Dark Chocolate Bar 70%');
INSERT INTO products_fts(name) VALUES ('Whole Milk Organic');
-- Search with MATCH
SELECT name, rank FROM products_fts
WHERE products_fts MATCH 'chocolate'
ORDER BY rank;
-- Returns: Chocolate Milk 2%, Dark Chocolate Bar 70%
FTS5 supports four tokenizers. The default unicode61 splits on word boundaries. The porter tokenizer adds English stemming (so “running” matches “run”). And the trigram tokenizer breaks text into overlapping 3-character chunks, enabling substring matching — FTS5’s closest feature to fuzzy search.
rapidfuzz in 30 Seconds
rapidfuzz is a Python library (with a C++ core) that computes string similarity scores. It’s the MIT-licensed successor to fuzzywuzzy, typically 10–100x faster. You give it two strings, it tells you how similar they are on a 0–100 scale.
from rapidfuzz import fuzz, process
# Pairwise similarity
fuzz.ratio("chocolate milk", "choclate milk")
# 96.3 — very similar (one missing 'o')
fuzz.token_sort_ratio("milk chocolate", "chocolate milk")
# 100.0 — same words, different order
# Search a list for the best match
products = ["Chocolate Milk 2%", "Dark Chocolate Bar 70%", "Whole Milk Organic"]
process.extractOne("choclate milk", products)
# ('Chocolate Milk 2%', 73.3, 0)
The key algorithmic difference: fuzz.ratio uses the Indel distance (insertions and deletions), while fuzz.token_sort_ratio sorts words alphabetically first. process.extractOne uses WRatio by default — a weighted combination of multiple metrics that handles most real-world cases well.
The fundamental difference: FTS5 is a search engine (index → lookup). rapidfuzz is a similarity calculator (compare every pair). This distinction drives every performance and quality tradeoff that follows.
Building the Benchmark
Talking about performance in the abstract is useless. Let’s build something we can measure. Here’s the plan: generate 500,000 realistic product names, set up both search approaches, and run five test scenarios that represent real fuzzy search needs.
The Dataset
We need a dataset that’s realistic enough to stress-test both tools. I’ll generate 500,000 product names by combining categories, brands, modifiers, and sizes — the kind of thing you’d see in a large grocery or retail database. To hit half a million unique names, we need a large enough combinatorial space.
import random
import sqlite3
from rapidfuzz import fuzz, process
# Product name generator — 500K realistic names
categories = [
"Milk", "Bread", "Cheese", "Yogurt", "Butter", "Juice", "Cereal",
"Pasta", "Rice", "Chicken", "Beef", "Salmon", "Shrimp", "Tofu",
"Apple", "Banana", "Orange", "Grape", "Mango", "Tomato", "Onion",
"Garlic", "Pepper", "Lettuce", "Spinach", "Broccoli", "Carrot",
"Potato", "Chips", "Crackers", "Cookies", "Granola", "Oatmeal",
"Coffee", "Tea", "Soda", "Water", "Chocolate", "Candy", "Gum",
"Soap", "Shampoo", "Detergent", "Napkins", "Towels", "Batteries",
"Almonds", "Walnuts", "Peanuts", "Cashews", "Pecans", "Pistachios",
"Hummus", "Salsa", "Guacamole", "Sour Cream", "Cream Cheese",
"Bagels", "Muffins", "Croissants", "Tortillas", "Pita",
"Ketchup", "Mustard", "Mayo", "Hot Sauce", "Soy Sauce",
"Ice Cream", "Frozen Pizza", "Frozen Waffles", "Fish Sticks",
"Olive Oil", "Coconut Oil", "Avocado Oil", "Vinegar",
"Honey", "Maple Syrup", "Jam", "Peanut Butter", "Nutella",
]
brands = [
"Organic Valley", "Great Value", "Kirkland", "Trader Joe's",
"Nature's Best", "Green Harvest", "Farm Fresh", "Blue Diamond",
"Simply", "Horizon", "Stonyfield", "Annie's", "Bob's Red Mill",
"Kettle Brand", "Clif", "KIND", "Burt's Bees", "Method",
"Pacific", "Amy's", "Newman's Own", "Seventh Generation",
"Whole Foods 365", "Good & Gather", "O Organics", "Simple Truth",
"Wild Harvest", "Market Pantry", "Up & Up", "Signature Select",
"Happy Belly", "Primal Kitchen", "Sir Kensington's", "Rao's",
"Dave's Killer Bread", "Justin's", "Rxbar", "Purely Elizabeth",
]
modifiers = [
"Organic", "Low-Fat", "Whole Grain", "Gluten-Free", "Sugar-Free",
"2%", "1%", "Fat-Free", "Extra Virgin", "Cold-Pressed",
"Unsalted", "Lightly Salted", "Honey Roasted", "Dark",
"Original", "Family Size", "Single Serve", "Variety Pack",
"Wild Caught", "Free Range", "Grass Fed", "Non-GMO",
"Reduced Sodium", "No Added Sugar", "High Protein", "Keto",
"Vegan", "Dairy-Free", "Plant-Based", "Fair Trade",
"Smoked", "Roasted", "Raw", "Seasoned", "Marinated",
"Unsweetened", "Lightly Sweetened", "Double Chocolate",
]
sizes = [
"8 oz", "12 oz", "16 oz", "32 oz", "1 gal", "64 oz",
"6 pack", "12 pack", "24 pack", "5 lb", "10 lb", "1 lb",
"100 ct", "200 ct", "500 ml", "1 L", "2 L",
"3 oz", "4 oz", "20 oz", "28 oz", "48 oz", "3 lb",
"8 ct", "10 ct", "16 ct", "36 pack", "2 lb",
]
random.seed(42)
products = set()
while len(products) < 500_000:
parts = [random.choice(brands)]
if random.random() > 0.3:
parts.append(random.choice(modifiers))
parts.append(random.choice(categories))
if random.random() > 0.4:
parts.append(random.choice(sizes))
products.add(" ".join(parts))
products = list(products)
print(f"Generated {len(products)} unique product names")
print(f"Examples: {products[:3]}")
Setting Up FTS5
We’ll create two FTS5 tables — one with the default unicode61 tokenizer for standard full-text search, and one with the trigram tokenizer for substring matching.
# Create in-memory SQLite database with FTS5
conn = sqlite3.connect(":memory:")
cur = conn.cursor()
# Standard tokenizer (word-boundary splitting)
cur.execute("CREATE VIRTUAL TABLE fts_standard USING fts5(name)")
# Trigram tokenizer (3-character overlapping chunks)
cur.execute("CREATE VIRTUAL TABLE fts_trigram USING fts5(name, tokenize='trigram')")
# Insert all products into both tables
for name in products:
cur.execute("INSERT INTO fts_standard(name) VALUES (?)", (name,))
cur.execute("INSERT INTO fts_trigram(name) VALUES (?)", (name,))
conn.commit()
print("FTS5 tables built and indexed")
The Five Test Scenarios
Each scenario tests a different kind of fuzzy search need. Together, they cover the spectrum from exact token matching to pure typo tolerance.
| # | Scenario | Query | What We're Testing |
|---|---|---|---|
| 1 | Exact token | chocolate |
Find all products containing "chocolate" |
| 2 | Typo tolerance | choclate |
Misspelled query — missing the 'o' |
| 3 | Token reorder | milk chocolate organic |
Words in wrong order vs stored names |
| 4 | Prefix | choc |
Partial input (autocomplete scenario) |
| 5 | Substring | ocolat |
Mid-word fragment matching |
Speed Benchmark: The Numbers
Here’s the benchmark code. Each scenario runs 100 iterations (5 for rapidfuzz on full scans — it’s much slower at 500K) and we take the median time.
import timeit
def bench(label, fn, runs=100):
times = timeit.repeat(fn, number=1, repeat=runs)
median = sorted(times)[len(times) // 2]
return median * 1000 # convert to ms
# Scenario 1: Exact token search — "chocolate"
results = {}
results["S1_fts_std"] = bench("FTS5 standard", lambda:
cur.execute("SELECT name FROM fts_standard WHERE fts_standard MATCH 'chocolate' ORDER BY rank LIMIT 10").fetchall()
)
results["S1_fts_tri"] = bench("FTS5 trigram", lambda:
cur.execute("SELECT name FROM fts_trigram WHERE fts_trigram MATCH 'chocolate' ORDER BY rank LIMIT 10").fetchall()
)
results["S1_rfuzz"] = bench("rapidfuzz", lambda:
process.extract("chocolate", products, scorer=fuzz.WRatio, limit=10),
runs=5
)
# Scenario 2: Typo tolerance — "choclate"
results["S2_fts_std"] = bench("FTS5 standard", lambda:
cur.execute("SELECT name FROM fts_standard WHERE fts_standard MATCH 'choclate' ORDER BY rank LIMIT 10").fetchall()
)
results["S2_fts_tri"] = bench("FTS5 trigram", lambda:
cur.execute("SELECT name FROM fts_trigram WHERE fts_trigram MATCH 'choclate' ORDER BY rank LIMIT 10").fetchall()
) # Trigram matches substrings — "choclate" shares some trigrams with "chocolate"
results["S2_rfuzz"] = bench("rapidfuzz", lambda:
process.extract("choclate", products, scorer=fuzz.WRatio, limit=10),
runs=5
)
# Scenario 3: Token reorder — "milk chocolate organic"
results["S3_fts_std"] = bench("FTS5 standard", lambda:
cur.execute("SELECT name FROM fts_standard WHERE fts_standard MATCH 'milk AND chocolate AND organic' ORDER BY rank LIMIT 10").fetchall()
)
results["S3_rfuzz"] = bench("rapidfuzz", lambda:
process.extract("milk chocolate organic", products, scorer=fuzz.token_sort_ratio, limit=10),
runs=5
)
# Scenario 4: Prefix — "choc"
results["S4_fts_std"] = bench("FTS5 standard", lambda:
cur.execute("SELECT name FROM fts_standard WHERE fts_standard MATCH 'choc*' ORDER BY rank LIMIT 10").fetchall()
)
results["S4_rfuzz"] = bench("rapidfuzz", lambda:
process.extract("choc", products, scorer=fuzz.partial_ratio, limit=10),
runs=5
)
# Scenario 5: Substring — "ocolat"
results["S5_fts_tri"] = bench("FTS5 trigram", lambda:
cur.execute("SELECT name FROM fts_trigram WHERE fts_trigram MATCH 'ocolat' ORDER BY rank LIMIT 10").fetchall()
)
results["S5_rfuzz"] = bench("rapidfuzz", lambda:
process.extract("ocolat", products, scorer=fuzz.partial_ratio, limit=10),
runs=5
)
And here are the actual results on a standard VPS (6 vCPU, 8 GB RAM), Python 3.13, SQLite 3.46, rapidfuzz 3.14:
| Scenario | FTS5 Standard | FTS5 Trigram | rapidfuzz |
|---|---|---|---|
| 1. Exact token — "chocolate" | 48 ms | 67 ms | 2,597 ms |
| 2. Typo — "choclate" | 0 results ❌ | 0.09 ms ** | 2,558 ms |
| 3. Token reorder — "milk chocolate organic" | 3.6 ms | — | 644 ms |
| 4. Prefix — "choc" | 38 ms | — | 882 ms |
| 5. Substring — "ocolat" | 0 results ❌ | 57 ms | 1,020 ms |
** FTS5 trigram on “choclate” actually finds 0 candidates because the trigram “ocl” (from the character deletion) doesn’t exist in “chocolate.” Trigram matching handles substring queries well but struggles with character-level edits like deletions and transpositions. The 0.09 ms time reflects returning an empty result set, not a successful match.
The pattern is clear — and at 500K records, FTS5 is 10–100x faster for any query it can answer. The gap is wide, but not as extreme as raw index-lookup theory suggests: ORDER BY rank requires BM25-scoring all matches, which costs 3–67 ms depending on result count (a query like “chocolate” matches 17,000+ rows that all need scoring). Without ranking, FTS5 achieves sub-millisecond times — but ranked results are what you actually want in production.
For Scenario 2 (typos), FTS5 with the standard tokenizer still returns zero results. It doesn’t know that “choclate” is close to “chocolate.” And critically, the trigram tokenizer also fails here — character deletions break trigram overlap in ways that substring queries don’t.
rapidfuzz is in the 644–2,597 ms range — one to nearly three seconds per query — because it always does the same thing: scan all 500,000 strings and compute a similarity score for each one. That scales linearly with dataset size, exactly as you’d expect from an O(n) algorithm. It doesn’t care whether the query is a perfect word or a mangled typo — the algorithm is the same.
Try It: Benchmark Visualization
Query time in milliseconds (lower is better). FTS5 bars are barely visible at this scale.
Result Quality: Who Finds What?
Speed is half the story. The other half is: do you find the right results? Let’s look at the top results each method returns for our tricky scenarios.
Scenario 2: Typo — “choclate”
| Rank | FTS5 Standard | rapidfuzz (WRatio) |
|---|---|---|
| 1 | (no results) | Kirkland Chocolate 12 oz — 90 |
| 2 | Simply Dark Chocolate — 90 | |
| 3 | KIND Chocolate 16 oz — 90 |
FTS5 with the standard tokenizer completely fails here. The word “choclate” isn’t in the index, so the query returns nothing. No partial credit, no “did you mean” — just silence. Rapidfuzz handles it gracefully, recognizing that “choclate” is one edit away from “chocolate” and returning relevant products with scores around 90.
Scenario 3: Token Reorder — “milk chocolate organic”
| Rank | FTS5 Standard (AND query) | rapidfuzz (token_sort_ratio) |
|---|---|---|
| 1 | Organic Valley Organic Chocolate Milk 32 oz | Organic Valley Organic Chocolate Milk 32 oz — 73 |
| 2 | Horizon Organic Chocolate Milk 1 gal | Horizon Organic Chocolate Milk 1 gal — 71 |
| 3 | Stonyfield Organic Chocolate Milk 64 oz | Stonyfield Organic Chocolate Milk 64 oz — 66 |
Both methods handle token reordering well — but for different reasons. FTS5’s Boolean AND query finds all rows containing every token regardless of order. rapidfuzz’s token_sort_ratio alphabetically sorts both strings before comparing. Both arrive at the same results, but FTS5 does it in 3.6 ms vs 644 ms for rapidfuzz.
Scenario 5: Substring — “ocolat”
| Rank | FTS5 Trigram | rapidfuzz (partial_ratio) |
|---|---|---|
| 1 | Annie's Dark Chocolate 8 oz | Annie's Dark Chocolate 8 oz — 100 |
| 2 | Blue Diamond Chocolate 16 oz | Blue Diamond Chocolate 16 oz — 100 |
| 3 | Clif Chocolate 12 pack | Clif Chocolate 12 pack — 100 |
The trigram tokenizer shines here. “ocolat” is a substring of “chocolate”, and FTS5’s trigram index finds that match in 57 ms (including BM25 ranking of all matches). rapidfuzz’s partial_ratio also finds it (it slides the shorter string along the longer one looking for the best alignment) but takes 1,020 ms to scan all 500K rows.
The quality verdict: if the user types a correct word (even in wrong order), FTS5 matches rapidfuzz’s results at 10–100x the speed. If the user makes a typo, both FTS5 standard and FTS5 trigram can fail — trigram matching breaks down on character deletions, not just missing tokens — while rapidfuzz saves the day, though at 500K records that rescue costs 1–3 seconds per query.
Memory, Storage, and Setup Complexity
Beyond speed and quality, practical concerns matter: how much disk/RAM does each approach consume, and how hard is it to set up?
| Metric | FTS5 Standard | FTS5 Trigram | rapidfuzz |
|---|---|---|---|
| Storage overhead | ~1.5x data size | ~2.5x data size | None (in-memory list) |
| Persistence | On disk — survives restarts | On disk — survives restarts | None — reload each time |
| Index build time (500K rows) | ~3.3 s | ~8.0 s | N/A |
| Incremental updates | INSERT/DELETE auto-update index | INSERT/DELETE auto-update index | Append to list (no index to update) |
| Dependencies | Python stdlib (sqlite3) |
Python stdlib (sqlite3) |
pip install rapidfuzz |
| Setup lines of code | ~5 (CREATE TABLE + INSERT) | ~5 | ~2 (import + call) |
For a 500,000-row dataset of product names (averaging ~30 characters each), the raw data is about 15 MB. FTS5 standard adds roughly 23 MB of index. The trigram index is hungrier at about 38 MB — each string generates many more tokens. At this scale, the trigram index is starting to get chunky, but it’s still well within reason for any modern system.
The meaningful differences are architectural:
- FTS5 persists its index to disk. Restart your app and the index is still there, ready for queries. This is a huge win for web apps and services.
- rapidfuzz requires loading all candidates into memory every time your process starts. For 500K strings, that’s ~15–20 MB and ~100 ms to load. Still manageable, but at 5 million strings you’re looking at 150+ MB of memory and noticeable startup lag.
- FTS5 handles incremental updates natively. INSERT a new product and the index updates automatically. With rapidfuzz, you just append to your list — but if you want sorted or deduplicated results, you manage that yourself.
Zero-dependency wins: FTS5 comes free with Python’s built-in sqlite3 module. No pip install, no C compiler issues, no version conflicts. That alone makes it the default choice for simple full-text search needs.
Batch Search: When You Have More Than One Query
Every benchmark above tests a single query. But real-world search workloads are rarely one-and-done. Consider these scenarios:
- 10 queries — an autocomplete session. The user types, pauses, types more. Each pause fires a search.
- 100 queries — a page of imported records. A user uploads a CSV of product names that need fuzzy matching against your catalog.
- 1,000 queries — a bulk data reconciliation job. Two systems with different product name conventions need to be aligned overnight.
The hypothesis going in: rapidfuzz should dominate batch scenarios because it can vectorize fuzzy matching across queries in-memory, while FTS5 must execute individual SQL queries.
Let’s test that.
Benchmark Setup
We’ll generate batches of queries by randomly sampling from our product list and introducing typos — simulating real user input:
def make_batch_queries(products, n, typo_rate=0.3):
"""Generate n search queries — some exact, some with typos."""
queries = []
for _ in range(n):
base = random.choice(products).split()
# Pick 1-2 words from the product name
q_words = random.sample(base, min(2, len(base)))
query = " ".join(q_words)
# Introduce a typo with some probability
if random.random() < typo_rate and len(query) > 4:
pos = random.randint(1, len(query) - 2)
query = query[:pos] + query[pos+1:] # delete a character
queries.append(query)
return queries
batch_10 = make_batch_queries(products, 10)
batch_100 = make_batch_queries(products, 100)
batch_1000 = make_batch_queries(products, 1000)
def bench_batch(label, fn, runs=5):
"""Benchmark a batch operation — fewer runs since each is longer."""
times = timeit.repeat(fn, number=1, repeat=runs)
return sorted(times)[len(times) // 2] * 1000 # median ms
FTS5 Batch: Individual SQL Queries
FTS5 has no built-in “batch search” API. We just run each query through a MATCH statement in a loop. The question is whether per-query overhead adds up.
def fts5_batch(queries, table="fts_trigram"):
"""Run a batch of FTS5 queries sequentially."""
results = []
for q in queries:
try:
rows = cur.execute(
f"SELECT name FROM {table} WHERE {table} MATCH ? LIMIT 10",
(q,)
).fetchall()
results.append([r[0] for r in rows])
except:
results.append([]) # handle invalid MATCH syntax gracefully
return results
# Benchmark FTS5 trigram on each batch size
fts_10 = bench_batch("FTS5 tri ×10", lambda: fts5_batch(batch_10))
fts_100 = bench_batch("FTS5 tri ×100", lambda: fts5_batch(batch_100))
fts_1000 = bench_batch("FTS5 tri ×1000", lambda: fts5_batch(batch_1000))
rapidfuzz Batch: Sequential and Parallel
For rapidfuzz, we have two options: run queries sequentially (simple loop), or try to parallelize across CPU cores.
from concurrent.futures import ProcessPoolExecutor
def rfuzz_batch_sequential(queries):
"""Run rapidfuzz extract for each query — sequential."""
return [process.extract(q, products, scorer=fuzz.WRatio, limit=10)
for q in queries]
def rfuzz_single_query(q):
"""Single rapidfuzz query — for use with parallel executor."""
return process.extract(q, products, scorer=fuzz.WRatio, limit=10)
def rfuzz_batch_parallel(queries, workers=4):
"""Run rapidfuzz queries in parallel using process pool."""
with ProcessPoolExecutor(max_workers=workers) as executor:
return list(executor.map(rfuzz_single_query, queries))
# Sequential benchmarks
rf_seq_10 = bench_batch("rfuzz seq ×10", lambda: rfuzz_batch_sequential(batch_10))
rf_seq_100 = bench_batch("rfuzz seq ×100", lambda: rfuzz_batch_sequential(batch_100))
rf_seq_1000 = bench_batch("rfuzz seq ×1000",
lambda: rfuzz_batch_sequential(batch_1000), runs=3)
# Parallel benchmarks (4 workers)
rf_par_100 = bench_batch("rfuzz par ×100", lambda: rfuzz_batch_parallel(batch_100))
rf_par_1000 = bench_batch("rfuzz par ×1000",
lambda: rfuzz_batch_parallel(batch_1000), runs=3)
Hybrid Batch
The hybrid approach (FTS5 trigram for retrieval, rapidfuzz for re-ranking) in a simple loop:
def hybrid_batch(queries):
"""Run hybrid search for each query."""
return [hybrid_search(q, conn, limit=10) for q in queries]
hyb_10 = bench_batch("hybrid ×10", lambda: hybrid_batch(batch_10))
hyb_100 = bench_batch("hybrid ×100", lambda: hybrid_batch(batch_100))
hyb_1000 = bench_batch("hybrid ×1000", lambda: hybrid_batch(batch_1000))
The Results: Hypothesis Busted
| Batch Size | FTS5 Trigram | rapidfuzz (seq) | rapidfuzz (parallel) | Hybrid * |
|---|---|---|---|---|
| 10 queries | 3.1 ms | 26.7 s | 5.3 s | 16.6 s |
| 100 queries | 40 ms | 283 s | 56 s | 90 s |
| 1,000 queries | 357 ms | ~2,828 s (47 min) | ~542 s (9.0 min) | ~904 s * |
* Hybrid batch times are dominated by fallback queries. With 30% typo rate in the query generator, ~38% of queries produce zero FTS5 candidates and trigger a full rapidfuzz scan (~2.6 s each). The remaining ~62% of queries complete in 1–6 ms via the FTS5 + re-rank path. Hybrid only outperforms standalone rapidfuzz when the fallback rate is low.
The hypothesis was dead wrong. FTS5 doesn’t just win batch scenarios — it dominates them even more than single-query scenarios. Let’s look at throughput to see why:
| Method | Throughput (queries/sec) | vs FTS5 |
|---|---|---|
| FTS5 Trigram | ~2,800 queries/sec | 1x (baseline) |
| Hybrid (no-fallback queries only) | ~170–830 queries/sec | 0.06–0.3x |
| rapidfuzz (6 cores) | ~1.8 queries/sec | 0.0006x |
| rapidfuzz (sequential) | ~0.35 queries/sec | 0.0001x |
FTS5 processes ~2,800 queries per second. rapidfuzz manages ~0.35 sequentially. That’s an 8,000x throughput difference for pure index-based queries. Even with 6 CPU cores, rapidfuzz only reaches ~1.8 queries per second — still ~1,500x slower than FTS5. The hybrid approach sits in between, but its effective throughput depends entirely on how many queries trigger a fallback to full rapidfuzz scan.
Why “Vectorization” Doesn’t Save rapidfuzz
The intuition that rapidfuzz should benefit from batch processing comes from a misunderstanding of where its time goes. rapidfuzz’s C++ backend is already highly optimized with SIMD vectorization — but that optimization operates within a single query (comparing one query against 500K candidates using vectorized string distance computations). Each additional query pays the full O(n) scan cost again.
FTS5, by contrast, pays O(log n) per query because the B-tree index lookup cost is independent of batch size. One thousand index lookups is still just a thousand index lookups — no linear scan of the data required.
The batch takeaway: if you have more than ~10 queries to run against a dataset of 100K+ records, standalone rapidfuzz is a non-starter. FTS5 is the clear winner for token-based matching. The hybrid approach works well for typo tolerance when most queries produce FTS5 candidates — but be aware that queries with character deletions or heavy misspellings may trigger full-scan fallbacks that negate the speed gains.
Try It: Batch Throughput Comparison
Queries per second at 1,000-query batch size (log scale). Higher is better.
The Hybrid Approach: Best of Both Worlds
Here’s the insight most comparison articles miss: you don’t have to choose. The best approach for typo-tolerant search over a medium-to-large dataset is a two-stage pipeline:
- Stage 1 — FTS5 retrieval: Use the trigram tokenizer to quickly pull candidate matches from the database. This narrows 500,000 rows to maybe 50–200 candidates in 1–6 ms (without BM25 ranking, using
LIMITdirectly). - Stage 2 — rapidfuzz re-ranking: Score those candidates with
fuzz.WRatiofor typo-tolerant similarity. Re-rank by score. This takes under 1 ms on a small candidate set. - Fallback: If FTS5 returns zero candidates (common with character-level typos like deletions), fall back to a full rapidfuzz scan. This is the worst case and costs 1–3 seconds.
This is the same pattern used in production search engines: a fast, coarse retrieval stage followed by a precise scoring stage. When FTS5 finds candidates, the hybrid completes in 1–6 ms. The critical caveat: if the query is too mangled for trigram matching (e.g., character deletions that break trigram overlap), the hybrid falls back to a full rapidfuzz scan.
from rapidfuzz import fuzz, process
def hybrid_search(query, conn, limit=10, candidates=200):
"""
Two-stage fuzzy search:
1. FTS5 trigram for fast candidate retrieval
2. rapidfuzz for typo-tolerant re-ranking
"""
cur = conn.cursor()
# Stage 1: Pull candidates using FTS5 trigram
# For short queries (< 3 chars), fall back to prefix on standard index
if len(query) >= 3:
rows = cur.execute(
"SELECT name FROM fts_trigram WHERE fts_trigram MATCH ? LIMIT ?",
(query, candidates)
).fetchall()
else:
rows = cur.execute(
"SELECT name FROM fts_standard WHERE fts_standard MATCH ? LIMIT ?",
(query + "*", candidates)
).fetchall()
candidate_names = [r[0] for r in rows]
if not candidate_names:
# FTS5 found nothing — fall back to full rapidfuzz scan
# (slower, but handles total mismatches)
results = process.extract(query, products, scorer=fuzz.WRatio, limit=limit)
return [(name, score) for name, score, _ in results]
# Stage 2: Re-rank candidates with rapidfuzz
scored = []
for name in candidate_names:
score = fuzz.WRatio(query, name)
scored.append((name, score))
scored.sort(key=lambda x: x[1], reverse=True)
return scored[:limit]
Let’s benchmark the hybrid approach against both standalone methods:
# Benchmark: hybrid vs standalone on typo query "choclate"
hybrid_time = bench("Hybrid", lambda:
hybrid_search("choclate", conn, limit=10),
runs=100
)
# Results on "choclate" (worst case — FTS5 trigram finds 0 candidates):
# FTS5 standard: 0.05 ms — but 0 results (useless)
# FTS5 trigram: 0.09 ms — 0 results (trigram "ocl" doesn't exist in "chocolate")
# rapidfuzz: 2,558 ms — correct results, but slow
# Hybrid: 2,768 ms — falls back to full rapidfuzz scan!
#
# Results on "chocolat" (best case — FTS5 trigram finds candidates):
# Hybrid: 1.9 ms — correct results AND fast
The results reveal a critical nuance. “choclate” (missing ‘o’) is a worst-case query for the hybrid approach because FTS5 trigram finds zero candidates — the deleted character breaks trigram overlap:
| Method | Time (“choclate”) | Time (“chocolat”) | Correct Results? |
|---|---|---|---|
| FTS5 standard | 0.05 ms (0 results) | 0.05 ms (0 results) | ❌ |
| FTS5 trigram | 0.09 ms (0 results) | 57 ms | Depends on typo type |
| rapidfuzz (full scan) | 2,558 ms | 2,597 ms | ✅ Yes |
| Hybrid (FTS5 + rapidfuzz) | 2,768 ms (fallback!) | 1.9 ms | ✅ Yes |
When FTS5 trigram finds candidates (e.g., “chocolat”, “choco”, “ocolat”), the hybrid approach completes in 1–6 ms — genuinely fast. But when the query breaks trigram overlap (character deletions, severe misspellings), the hybrid falls back to a full rapidfuzz scan and takes just as long as standalone rapidfuzz.
The fallback to full rapidfuzz scan on zero FTS5 results is essential — without it, these queries would return nothing. But it means the hybrid’s speed guarantee is conditional: it’s fast when FTS5 can narrow the field, and slow when it can’t. In our batch benchmarks with 30% typo rate, ~38% of queries triggered this fallback.
When to Use What: A Decision Framework
After benchmarking, measuring, and combining both tools, here’s my practical decision framework:
Do you need typo tolerance?
No → FTS5 It’s faster, persistent, zero-dependency, and handles exact tokens, prefixes, phrases, and boolean queries natively.
Yes ↓
How large is your dataset?
Under 10K rows → rapidfuzz Full scan at this size takes < 15 ms. The simplicity of just passing a list wins.
10K – 1M rows → Hybrid FTS5 trigram for retrieval + rapidfuzz for re-ranking. Fast and accurate, even in batch.
Over 1M rows → FTS5 + spellfix1 or a dedicated search engine (Meilisearch, Typesense). rapidfuzz full-scan becomes too slow even as a fallback.
Do you need to run batch queries (10+)?
Yes → FTS5 or Hybrid At 500K records, FTS5 handles ~2,800 queries/sec vs rapidfuzz’s ~0.35. The gap only widens with scale.
Is this a one-off task (deduplication, record linkage)?
rapidfuzz Use process.cdist() with workers=-1 for parallel pairwise comparison. No need for persistent indexes.
Do you need boolean logic (AND/OR/NOT), phrase matching, or BM25 ranking?
FTS5 rapidfuzz has no concept of these. FTS5 has them built in.
Quick Reference
| Use Case | Best Tool | Why |
|---|---|---|
| Search box in a web app | Hybrid | Need speed + typo tolerance |
| Autocomplete / type-ahead | FTS5 (prefix queries) | Sub-ms response needed; users type correct prefixes |
| Deduplicating a CSV | rapidfuzz | One-off pairwise comparison; cdist parallelizes well |
| Product catalog search | Hybrid | Users misspell product names constantly |
| Bulk import matching (100+ queries) | FTS5 or Hybrid | ~2,800 vs ~0.35 queries/sec — rapidfuzz can’t keep up |
| Log search / grep-like queries | FTS5 trigram | Substring matching at index speed |
| Matching addresses across databases | rapidfuzz | Addresses vary wildly in format; need token_set_ratio |
| Full-text search with ranking | FTS5 | BM25 ranking, snippet extraction, column weighting |
Conclusion
FTS5 and rapidfuzz aren’t competitors — they’re complementary tools that solve different slices of the fuzzy search problem. FTS5 is a search engine that excels at indexed, token-based lookups with BM25 ranking. rapidfuzz is a similarity calculator that excels at comparing imperfect strings character by character.
If you only remember four things from this post:
- FTS5 is 10–100x faster at 500K records for any query it can answer (with ranked results). The gap is driven by index lookups (O(log n)) vs linear scans (O(n)), but BM25 ranking on high-cardinality matches adds measurable cost.
- rapidfuzz handles typos that both FTS5 standard and trigram are blind to, because edit distance doesn’t require exact token or trigram matches.
- The hybrid approach (FTS5 trigram for retrieval → rapidfuzz for re-ranking) gives you 1–6 ms typo-tolerant search on a 500K dataset when FTS5 finds candidates. For queries that break trigram overlap (character deletions, heavy misspellings), it falls back to a full scan. Design your system to handle both cases.
- For batch workloads, FTS5 dominates. FTS5 handles ~2,800 queries/second vs rapidfuzz’s ~0.35. If you need to match a CSV of 1,000 records, FTS5 finishes in under 400 ms while rapidfuzz takes nearly 47 minutes.
The right tool depends on what “fuzzy” means for your use case. Now you have the numbers — at real scale — to decide.
Verified Code
All code in this post has been tested and verified. View the runnable code on GitHub →
References & Further Reading
- SQLite — FTS5 Full-Text Search — Official documentation covering tokenizers, query syntax, ranking functions, and all configuration options
- Max Bachmann — rapidfuzz GitHub — The MIT-licensed successor to fuzzywuzzy, featuring C++ backed string similarity algorithms
- rapidfuzz Documentation — Full API reference for fuzz, process, and distance modules
- Andrew Mara — Faster SQLite LIKE Queries Using FTS5 Trigram Indexes — Benchmarks on 18.2 million rows showing 50–125x speedup with trigram indexing
- David Muraya — SQLite FTS5 Trigram Name Matching — 200K name search benchmark demonstrating 28ms query times
- SQLite — The Spellfix1 Virtual Table — Edit-distance extension for SQLite that bridges the gap between FTS5 and true fuzzy matching