Fuzzy Retriever
✦₊⁺ Overview
Fuzzy retrieval finds documents that are similar to a query, not just exact matches. It tolerates typos, abbreviations, transpositions, and partial strings — making it ideal for user-facing search, entity lookup, and deduplication scenarios where exact lexical matching is too strict.
msgFlux ships one fuzzy provider powered by RapidFuzz, a fast C-extension library for approximate string matching.
| Provider | Class | Dependency |
|---|---|---|
rapidfuzz |
RapidFuzzFuzzyRetriever |
rapidfuzz |
Install: pip install rapidfuzz
1. Quick Start
Example
import msgflux as mf
retriever = mf.Retriever.fuzzy("rapidfuzz")
retriever.add([
"Alice Johnson",
"Bob Smith",
"Carlos Mendoza",
"Diana Prince",
])
response = retriever("Allice Jonson", top_k=2, return_score=True)
for result in response.data[0]:
print(f"[{result.score:.1f}] {result.data}")
# [93.3] Alice Johnson
# [51.4] Carlos Mendoza
2. Parameters
| Parameter | Default | Description |
|---|---|---|
scorer |
"WRatio" |
Scoring function — see Scorers |
top_k |
5 |
Max results returned per query |
threshold |
0.0 |
Minimum similarity score (0–100) to include a result |
return_score |
False |
Include the similarity score in each result |
3. Scorers
The scorer controls how similarity is measured between the query and each document:
| Scorer | Best For |
|---|---|
"WRatio" |
General purpose — combines multiple strategies (default) |
"ratio" |
Full-string character similarity |
"partial_ratio" |
Query appears as a substring of the document |
"token_sort_ratio" |
Word-order-insensitive matching |
"token_set_ratio" |
Handles extra or missing words between strings |
import msgflux as mf
# Best for partial name lookup
retriever = mf.Retriever.fuzzy("rapidfuzz", scorer="partial_ratio")
retriever.add(["São Paulo", "Rio de Janeiro", "Belo Horizonte"])
response = retriever("Paulo", top_k=1, return_score=True)
print(response.data[0][0].data) # São Paulo
print(response.data[0][0].score) # 100.0
4. Adding Documents
Call .add() with a list of strings before searching. Documents accumulate — multiple calls are supported:
import msgflux as mf
retriever = mf.Retriever.fuzzy("rapidfuzz")
# First batch
retriever.add([
"Aspirin 500mg",
"Ibuprofen 400mg",
])
# Add more later
retriever.add([
"Paracetamol 750mg",
"Amoxicillin 250mg",
])
response = retriever("paracetamol 750", top_k=1)
print(response.data[0][0].data) # Paracetamol 750mg
5. Threshold Filtering
threshold accepts a value between 0 and 100. Results with similarity below the threshold are excluded:
Filtering weak matches
import msgflux as mf
retriever = mf.Retriever.fuzzy("rapidfuzz")
retriever.add([
"João da Silva",
"Maria Oliveira",
"Carlos Souza",
])
# Only return results with >= 70% similarity
response = retriever("Joao Silva", threshold=70.0, return_score=True)
for result in response.data[0]:
print(f"[{result.score:.1f}] {result.data}")
# [91.4] João da Silva
# (others excluded — score below threshold)
6. Batch Queries
Pass a list to search multiple queries in a single call. Results are returned in the same order:
import msgflux as mf
retriever = mf.Retriever.fuzzy("rapidfuzz")
retriever.add([
"Python",
"JavaScript",
"TypeScript",
"Rust",
"Go",
])
queries = ["pyton", "javasript"]
response = retriever(queries, top_k=1, return_score=True)
for i, query in enumerate(queries):
result = response.data[i][0]
print(f"{query!r} → [{result.score:.1f}] {result.data}")
# 'pyton' → [90.9] Python
# 'javasript' → [94.1] JavaScript
7. Async Support
import msgflux as mf
retriever = mf.Retriever.fuzzy("rapidfuzz")
retriever.add([
"invoice_2024_01.pdf",
"invoice_2024_02.pdf",
"contract_draft_v3.docx",
])
response = await retriever.acall("invioce 2024", top_k=2, return_score=True)
for result in response.data[0]:
print(f"[{result.score:.1f}] {result.data}")
8. When to use Fuzzy vs Lexical
| Scenario | Recommended |
|---|---|
| User typed a query with a typo | Fuzzy |
| Searching acronyms or abbreviations | Fuzzy (partial_ratio) |
| Word-order varies ("Silva João" vs "João Silva") | Fuzzy (token_sort_ratio) |
| Exact keyword matching over large corpora | Lexical (BM25) |
| Relevance ranking with term frequency weighting | Lexical (BM25) |