Skip to content

Fuzzy Retriever

✦₊⁺ Overview

Fuzzy retrieval finds documents that are similar to a query, not just exact matches. It tolerates typos, abbreviations, transpositions, and partial strings — making it ideal for user-facing search, entity lookup, and deduplication scenarios where exact lexical matching is too strict.

msgFlux ships one fuzzy provider powered by RapidFuzz, a fast C-extension library for approximate string matching.

Provider Class Dependency
rapidfuzz RapidFuzzFuzzyRetriever rapidfuzz

Install: pip install rapidfuzz


1. Quick Start

Example
import msgflux as mf

retriever = mf.Retriever.fuzzy("rapidfuzz")

retriever.add([
    "Alice Johnson",
    "Bob Smith",
    "Carlos Mendoza",
    "Diana Prince",
])

response = retriever("Allice Jonson", top_k=2, return_score=True)

for result in response.data[0]:
    print(f"[{result.score:.1f}] {result.data}")
# [93.3] Alice Johnson
# [51.4] Carlos Mendoza

2. Parameters

Parameter Default Description
scorer "WRatio" Scoring function — see Scorers
top_k 5 Max results returned per query
threshold 0.0 Minimum similarity score (0–100) to include a result
return_score False Include the similarity score in each result

3. Scorers

The scorer controls how similarity is measured between the query and each document:

Scorer Best For
"WRatio" General purpose — combines multiple strategies (default)
"ratio" Full-string character similarity
"partial_ratio" Query appears as a substring of the document
"token_sort_ratio" Word-order-insensitive matching
"token_set_ratio" Handles extra or missing words between strings
import msgflux as mf

# Best for partial name lookup
retriever = mf.Retriever.fuzzy("rapidfuzz", scorer="partial_ratio")
retriever.add(["São Paulo", "Rio de Janeiro", "Belo Horizonte"])

response = retriever("Paulo", top_k=1, return_score=True)
print(response.data[0][0].data)   # São Paulo
print(response.data[0][0].score)  # 100.0

4. Adding Documents

Call .add() with a list of strings before searching. Documents accumulate — multiple calls are supported:

import msgflux as mf

retriever = mf.Retriever.fuzzy("rapidfuzz")

# First batch
retriever.add([
    "Aspirin 500mg",
    "Ibuprofen 400mg",
])

# Add more later
retriever.add([
    "Paracetamol 750mg",
    "Amoxicillin 250mg",
])

response = retriever("paracetamol 750", top_k=1)
print(response.data[0][0].data)  # Paracetamol 750mg

5. Threshold Filtering

threshold accepts a value between 0 and 100. Results with similarity below the threshold are excluded:

Filtering weak matches
import msgflux as mf

retriever = mf.Retriever.fuzzy("rapidfuzz")
retriever.add([
    "João da Silva",
    "Maria Oliveira",
    "Carlos Souza",
])

# Only return results with >= 70% similarity
response = retriever("Joao Silva", threshold=70.0, return_score=True)

for result in response.data[0]:
    print(f"[{result.score:.1f}] {result.data}")
# [91.4] João da Silva
# (others excluded — score below threshold)

6. Batch Queries

Pass a list to search multiple queries in a single call. Results are returned in the same order:

import msgflux as mf

retriever = mf.Retriever.fuzzy("rapidfuzz")
retriever.add([
    "Python",
    "JavaScript",
    "TypeScript",
    "Rust",
    "Go",
])

queries = ["pyton", "javasript"]
response = retriever(queries, top_k=1, return_score=True)

for i, query in enumerate(queries):
    result = response.data[i][0]
    print(f"{query!r} → [{result.score:.1f}] {result.data}")
# 'pyton' → [90.9] Python
# 'javasript' → [94.1] JavaScript

7. Async Support

import msgflux as mf

retriever = mf.Retriever.fuzzy("rapidfuzz")
retriever.add([
    "invoice_2024_01.pdf",
    "invoice_2024_02.pdf",
    "contract_draft_v3.docx",
])

response = await retriever.acall("invioce 2024", top_k=2, return_score=True)

for result in response.data[0]:
    print(f"[{result.score:.1f}] {result.data}")

8. When to use Fuzzy vs Lexical

Scenario Recommended
User typed a query with a typo Fuzzy
Searching acronyms or abbreviations Fuzzy (partial_ratio)
Word-order varies ("Silva João" vs "João Silva") Fuzzy (token_sort_ratio)
Exact keyword matching over large corpora Lexical (BM25)
Relevance ranking with term frequency weighting Lexical (BM25)