Blaze
A high-performance, full-text search engine in Go featuring an
inverted index with skip lists, roaring bitmaps, BM25 ranking, and
type-safe query builder.
Overview
Blaze is a Go engine that provides fast, full-text search capabilities
through an inverted index implementation. It's designed for
applications that need to search through text documents efficiently
without relying on external search engines.
Inverted Index
Maps terms to document positions for instant lookups
Skip Lists
Probabilistic data structure providing O(log n) operations
Query Builder
Type-safe, fluent API for boolean queries with roaring bitmaps
BM25 Ranking
Industry-standard relevance scoring with IDF and length
normalization
Text Analysis
Tokenization, stemming, stopword filtering, case normalization
Hybrid Storage
Roaring bitmaps for speed + skip lists for precision
Installation
go get github.com/wizenheimer/blaze
Quick Start
package main
import (
"fmt"
"github.com/wizenheimer/blaze"
)
func main() {
// Create a new inverted index
idx := blaze.NewInvertedIndex()
// Index some documents
idx.Index(1, "The quick brown fox jumps over the lazy dog")
idx.Index(2, "A quick brown dog runs fast")
idx.Index(3, "The lazy cat sleeps all day")
// Search with BM25 ranking
matches := idx.RankBM25("quick brown", 10)
// Print results
for _, match := range matches {
fmt.Printf("Document %d (score: %.2f)\n", match.DocID, match.Score)
}
}
Query Builder API
The Query Builder provides a type-safe, fluent API for constructing
complex boolean queries with roaring bitmaps. No string parsing, no
syntax errors - just clean, composable code.
Basic Queries
Single Term
// Find all documents containing "machine"
results := blaze.NewQueryBuilder(idx).
Term("machine").
Execute()
AND Query
// Find documents with BOTH "machine" AND "learning"
results := blaze.NewQueryBuilder(idx).
Term("machine").
And().
Term("learning").
Execute()
OR Query
// Find documents with "python" OR "javascript"
results := blaze.NewQueryBuilder(idx).
Term("python").
Or().
Term("javascript").
Execute()
NOT Query
// Find documents with "python" but NOT "snake"
results := blaze.NewQueryBuilder(idx).
Term("python").
And().Not().
Term("snake").
Execute()
Complex Grouped Query
// (machine OR deep) AND learning
results := blaze.NewQueryBuilder(idx).
Group(func(q *blaze.QueryBuilder) {
q.Term("machine").Or().Term("deep")
}).
And().
Term("learning").
Execute()
BM25 Ranked Results
// Get top 10 results ranked by relevance
matches := blaze.NewQueryBuilder(idx).
Term("machine").
And().
Term("learning").
ExecuteWithBM25(10)
for _, match := range matches {
fmt.Printf("Doc %d: score=%.2f\n", match.DocID, match.Score)
}
Convenience Functions
// AllOf: Documents containing ALL terms (AND operation)
results := blaze.AllOf(idx, "machine", "learning", "python")
// AnyOf: Documents containing ANY term (OR operation)
results := blaze.AnyOf(idx, "cat", "dog", "bird")
// TermExcluding: Term with exclusion (AND NOT operation)
results := blaze.TermExcluding(idx, "python", "snake")
Architecture
Query Processor Architecture
The query processor uses a hybrid storage approach combining roaring
bitmaps for document-level operations and skip lists for
position-level operations.
┌─────────────────────────────────────────────────────────────────────────┐
│ QUERY PROCESSOR ARCHITECTURE │
└─────────────────────────────────────────────────────────────────────────┘
User Query "machine AND learning" │ ▼ ┌─────────────────────────────┐
│ Text Analyzer │ │ (tokenize, stem, etc.) │
└──────────────┬──────────────┘ │ ["machine", "learning"] │ ▼
┌─────────────────────────────┐ │ Query Builder │ │ (constructs query
tree) │ └──────────────┬──────────────┘ │ Query Tree: AND(machine,
learning) │ ┌─────────────────────┼─────────────────────┐ │ │ │ ▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ Bitmap Ops │ │
Skip List │ │ BM25 Scorer │ │ (fast AND/OR)│ │ (positions) │ │
(ranking) │ └───────┬───────┘ └───────┬───────┘ └───────┬───────┘ │ │
│ └─────────────────────┼─────────────────────┘ │ ▼ ┌───────────────┐
│ Results │ │ (ranked docs)│ └───────────────┘
Hybrid Storage Model
Blaze uses a sophisticated hybrid storage model that combines the
speed of roaring bitmaps with the precision of skip lists.
┌─────────────────────────────────────────────────────────────────────────┐
│ HYBRID STORAGE MODEL │
└─────────────────────────────────────────────────────────────────────────┘
For each term "machine":
┌─────────────────────────────────────────────────────────────────────────┐
│ DOCUMENT LEVEL (Roaring Bitmap) │ │
──────────────────────────────────────────────────────────────────────
│ │ │ │ DocBitmaps["machine"] = {1, 2, 4, 5, 100, 500, 1000, ...} │ │
│ │ Compressed representation of ALL documents containing "machine" │
│ Use: Fast boolean operations (AND, OR, NOT) │ │ Size: ~60 KB for
500k documents (400x compression!) │ │ │
└─────────────────────────────────────────────────────────────────────────┘
│ │ Links to ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ POSITION LEVEL (Skip List) │ │
──────────────────────────────────────────────────────────────────────
│ │ │ │ PostingsList["machine"] = SkipList: │ │ │ │ Level 2:
[Doc1:Pos5] ────────────────────────> [Doc100:Pos12] │ │ │ │ │ │ Level
1: [Doc1:Pos5] ──> [Doc2:Pos3] ───────────> [Doc100:Pos12] │ │ │ │ │ │
│ Level 0: [Doc1:Pos5] -> [Doc2:Pos3] -> [Doc4:Pos1] -> [Doc5:Pos7] │
│ -> [Doc100:Pos12] -> [Doc500:Pos2] -> ... │ │ │ │ Detailed position
information for EVERY occurrence │ │ Use: Phrase search, proximity
ranking, snippets │ │ Size: ~24 MB for 500k positions │ │ │
└─────────────────────────────────────────────────────────────────────────┘
WHY HYBRID? ─────────── 1. Bitmaps: Lightning-fast document filtering
(AND, OR, NOT in microseconds) 2. Skip Lists: Precise position
tracking for phrases and proximity 3. Best of both worlds: Speed +
Precision
Query Execution Flow
Here's how a complex query executes step-by-step, showcasing the power
of the hybrid storage approach.
QUERY: (machine OR deep) AND learning AND NOT neural
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 1: BITMAP PHASE (Fast Document Filtering) │
└─────────────────────────────────────────────────────────────────────────┘
Term Lookups (O(1) hash map): DocBitmaps["machine"] = {1, 2, 4, 5, 7,
8, 9, 10} DocBitmaps["deep"] = {2, 3, 5, 6, 8, 9}
DocBitmaps["learning"]= {1, 2, 4, 5, 6, 7, 8, 9, 10}
DocBitmaps["neural"] = {3, 6, 8, 9} Boolean Operations (O(1) per
chunk): Step 1: machine OR deep {1, 2, 4, 5, 7, 8, 9, 10} ∪ {2, 3, 5,
6, 8, 9} = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} Step 2: (machine OR deep)
AND learning {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} ∩ {1, 2, 4, 5, 6, 7, 8,
9, 10} = {1, 2, 4, 5, 6, 7, 8, 9, 10} Step 3: Result AND NOT neural
{1, 2, 4, 5, 6, 7, 8, 9, 10} \ {3, 6, 8, 9} = {1, 2, 4, 5, 7, 10} ←
CANDIDATE DOCUMENTS Time: ~10 microseconds for 1M documents!
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 2: POSITION PHASE (Optional - for phrases/proximity) │
└─────────────────────────────────────────────────────────────────────────┘
IF phrase search needed: For each candidate doc {1, 2, 4, 5, 7, 10}:
Use skip lists to verify exact positions Check consecutive positions
for phrases Extract position data for snippets Time: O(log n) per
position lookup
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 3: RANKING PHASE (BM25 Scoring) │
└─────────────────────────────────────────────────────────────────────────┘
For each candidate document: 1. Calculate IDF (Inverse Document
Frequency): - Uses bitmap cardinality for instant document counts -
IDF("machine") = log((N - df + 0.5) / (df + 0.5)) - df =
DocBitmaps["machine"].GetCardinality() 2. Calculate TF (Term
Frequency): - Retrieves from pre-computed DocStats - TF("machine",
Doc1) = termFreqs["machine"] 3. Apply BM25 formula: - Combines IDF,
TF, and length normalization - Score = IDF × (TF × (k1 + 1)) / (TF +
k1 × length_norm) 4. Sum scores for all query terms Results sorted by
score: Doc 5: 8.45 Doc 2: 7.23 Doc 1: 6.91 ... Time: O(candidates ×
terms)
Roaring Bitmap Internals
Roaring bitmaps provide exceptional compression and performance
through adaptive container selection.
┌─────────────────────────────────────────────────────────────────────────┐
│ ROARING BITMAP STRUCTURE │
└─────────────────────────────────────────────────────────────────────────┘
Document IDs: {1, 2, 3, 100, 101, 102, 500000, 500001, 999999}
Traditional Bitmap (naive): [1,1,1,0,0...0,1,1,1,0...0,1,1,0...0,1]
Size: 1,000,000 bits = 125 KB (wasteful for sparse data) Roaring
Bitmap (smart): Split into 65,536 chunks (high 16 bits = chunk ID):
Chunk 0 (docs 0-65535): [1,2,3,100,101,102] Chunk 7 (docs
458752-524287): [500000, 500001] Chunk 15 (docs 983040-1048575):
[999999] Storage per chunk (adaptive):
┌────────────────────────────────────────────────────┐ │ If
cardinality < 4096: │ │ → Use Array Container │ │ → Store sorted
uint16 values directly │ │ → Size: 2 bytes × cardinality │ │ │ │ If
cardinality > 4096: │ │ → Use Bitmap Container │ │ → Store 65536-bit
bitmap (8 KB) │ │ → Size: 8 KB fixed │ │ │ │ If cardinality = 65536
(all docs): │ │ → Use Run Container │ │ → Store: [0-65535] │ │ → Size:
4 bytes │ └────────────────────────────────────────────────────┘ Total
Size: ~60 bytes (vs 125 KB!) Operations: AND: Container-by-container
intersection Skip non-matching chunks (O(1)) Intersect matching chunks
(O(min(n,m))) OR: Container-by-container union Merge all chunks
(O(n+m)) NOT: Complement within document space Flip all bits in each
chunk
Memory Layout
Understanding the memory layout helps optimize for large-scale
deployments.
┌─────────────────────────────────────────────────────────────────────────┐
│ INVERTED INDEX STRUCTURE │
└─────────────────────────────────────────────────────────────────────────┘
InvertedIndex {
┌─────────────────────────────────────────────────────────────────┐ │
DocBitmaps: map[string]*roaring.Bitmap │ │
───────────────────────────────────────────────────────────────│ │
"machine" → [Compressed Bitmap: 512 bytes] │ │ "learning" →
[Compressed Bitmap: 448 bytes] │ │ "deep" → [Compressed Bitmap: 256
bytes] │ │ ... │ │ │ │ Memory: ~100 bytes per term (compressed) │
└─────────────────────────────────────────────────────────────────┘ │
│ Parallel Storage ▼
┌─────────────────────────────────────────────────────────────────┐ │
PostingsList: map[string]SkipList │ │
───────────────────────────────────────────────────────────────│ │
"machine" → SkipList with 10,000 position nodes │ │ "learning" →
SkipList with 8,000 position nodes │ │ "deep" → SkipList with 5,000
position nodes │ │ ... │ │ │ │ Memory: ~48 bytes per position (node
overhead) │
└─────────────────────────────────────────────────────────────────┘ │
│ Statistics ▼
┌─────────────────────────────────────────────────────────────────┐ │
DocStats: map[int]DocumentStats │ │
───────────────────────────────────────────────────────────────│ │
Doc1 → {Length: 150, TermFreqs: {"machine": 3, ...}} │ │ Doc2 →
{Length: 200, TermFreqs: {"learning": 5, ...}} │ │ ... │ │ │ │ Memory:
~16 bytes per term per document │
└─────────────────────────────────────────────────────────────────┘
Mutex for thread safety (sync.RWMutex) } MEMORY BREAKDOWN (for 1M
documents, 10M unique positions):
────────────────────────────────────────────────────────────
DocBitmaps: ~10 MB (compressed bitmaps) PostingsList: ~480 MB (skip
list nodes) DocStats: ~500 MB (per-doc statistics) Overhead: ~10 MB
(maps, pointers, etc.)
──────────────────────────────────────────────────────────── TOTAL: ~1
GB
Performance
Benchmarks
Operation |
Ops/sec |
Time/op |
Allocs/op |
Query Builder (Simple) |
440,616 |
2,511 ns |
39 |
Query Builder (Complex) |
222,024 |
5,333 ns |
98 |
Query Builder (BM25) |
411,124 |
2,955 ns |
46 |
Term Search |
300,000 |
4,123 ns |
3 |
Phrase Search |
100,000 |
12,456 ns |
12 |
Operation Costs
Operation |
Complexity |
Why It's Fast |
Bitmap AND |
O(1) per chunk |
Roaring bitmap intersection |
Bitmap OR |
O(1) per chunk |
Roaring bitmap union |
Bitmap NOT |
O(1) per chunk |
Roaring bitmap difference |
Term Lookup |
O(1) |
Direct hash map access |
Position Lookup |
O(log n) |
Skip list binary search |
Search Operations
Phrase Search
Find exact sequences of words. Blaze uses skip lists to efficiently
verify consecutive positions.
SEARCHING FOR PHRASE: "brown fox" Document: "the quick brown dog
jumped over the brown fox" Positions: 0 1 2 3 4 5 6 7 8 Phase 1: Find
END (last word "fox")
┌─────────────────────────────────────────────────────────┐ │ Find
"brown" → Doc:Pos2 │ │ Find "fox" after Pos2 → Doc:Pos8 ← END position
│ └─────────────────────────────────────────────────────────┘ Phase 2:
Walk BACKWARDS from END to find START
┌─────────────────────────────────────────────────────────┐ │ From
Pos9, find previous "brown" → Doc:Pos7 ← START │
└─────────────────────────────────────────────────────────┘ Phase 3:
Validate ┌─────────────────────────────────────────────────────────┐ │
Start: Pos7, End: Pos8 │ │ Distance: 8 - 7 = 1 │ │ Expected: 2 words -
1 = 1 MATCH! │ │ │ │ "brown" "fox" │ │ ▲ ▲ │ │ Pos7 Pos8 (consecutive
positions) │
└─────────────────────────────────────────────────────────┘
// Find exact phrase "quick brown fox"
matches := idx.FindAllPhrases("quick brown fox", blaze.BOFDocument)
for _, match := range matches {
start, end := match[0], match[1]
fmt.Printf("Found in Doc %d from Pos %d to %d\n",
int(start.DocumentID), int(start.Offset), int(end.Offset))
}
Proximity Search
Find documents containing all terms (not necessarily consecutive).
Score documents by how close the terms appear together.
SEARCHING FOR: ["quick", "fox"] (any order, not necessarily
consecutive) Document: "the quick brown dog jumped over the lazy fox"
Positions: 0 1 2 3 4 5 6 7 8 Phase 1: Find COVER END (furthest term)
┌──────────────────────────────────────────────────────────────┐ │
Find "quick" after BOF → Doc:Pos1 │ │ Find "fox" after BOF → Doc:Pos8
← FURTHEST (cover end) │
└──────────────────────────────────────────────────────────────┘ Phase
2: Find COVER START (earliest term before end)
┌──────────────────────────────────────────────────────────────┐ │
Find "quick" before Pos9 → Doc:Pos1 ← EARLIEST (cover start)│ │ Find
"fox" before Pos9 → Doc:Pos8 │
└──────────────────────────────────────────────────────────────┘ Phase
3: Calculate Proximity Score
┌──────────────────────────────────────────────────────────────┐ │
Cover: [Pos1, Pos8] │ │ Same document? Yes │ │ All terms present? Yes
│ │ │ │ "quick" ... ... ... ... ... ... ... "fox" │ │ ▲ ▲ │ │ Pos1
Pos8 │ │ └────────── Cover Range ──────────────┘ │ │ │ │ Proximity
Score: 1 / (8 - 1 + 1) = 1/8 = 0.125 │
└──────────────────────────────────────────────────────────────┘
PROXIMITY SCORING EXAMPLES: ─────────────────────────── Doc 1:
"machine learning is machine learning" Pos:0 1 2 3 4 Cover 1: [Pos
0-1] → score += 1/(1-0+1) = 1/2 = 0.500 Cover 2: [Pos 3-4] → score +=
1/(4-3+1) = 1/2 = 0.500 ───────────────────────────── Total Score:
1.000 Doc 2: "learning about machine and learning" Pos:0 1 2 3 4 Cover
1: [Pos 0-2] → score += 1/(2-0+1) = 1/3 = 0.333 Cover 2: [Pos 2-4] →
score += 1/(4-2+1) = 1/3 = 0.333 ───────────────────────────── Total
Score: 0.666 Doc 3: "machine ... ... ... ... learning" Pos:0 1 2 3 4 5
Cover 1: [Pos 0-5] → score += 1/(5-0+1) = 1/6 = 0.167
───────────────────────────── Total Score: 0.167 Ranking: Doc 1
(1.000) > Doc 2 (0.666) > Doc 3 (0.167) ▲ ▲ ▲ Terms closest Terms
medium Terms far apart
// Rank documents by proximity
matches := idx.RankProximity("quick brown", 5)
for _, match := range matches {
fmt.Printf("Doc %d: Score %.2f\n",
int(match.Offsets[0].DocumentID),
match.Score)
}
Skip Lists Explained
Skip lists are the core data structure for position tracking. They
provide O(log n) operations while being simpler than balanced trees.
SKIP LIST STRUCTURE: Multiple levels like express lanes Level 3: HEAD
────────────────────────────────> [30] ───────> NULL ↓ ↓ Level 2: HEAD
─────────────> [15] ────────────> [30] ───────> NULL ↓ ↓ ↓ Level 1:
HEAD ─────> [10] ──> [15] ──> [20] ──> [30] ───────> NULL ↓ ↓ ↓ ↓ ↓
Level 0: HEAD ─> [5] ─> [10] ─> [15] ─> [20] ─> [25] ─> [30] ─> [35]
─> NULL (ALL NODES AT LEVEL 0) ┌───────┐ │ Node │ Each node has a
"tower" of forward pointers ├───────┤ │ Key │ Example: Node [15]
├───────┤ │ Lvl 3 │ ──> [30] (skip far ahead) │ Lvl 2 │ ──> [30] (skip
ahead) │ Lvl 1 │ ──> [20] (skip a little) │ Lvl 0 │ ──> [20] (next
node) └───────┘ SEARCHING FOR KEY = 20: ───────────────────────
Step-by-Step Search: Level 3: [HEAD] ───────────────────────────────>
[30] (30 > 20, drop down) ↓ Level 2: [HEAD] ──────────────> [15]
─────────> [30] (15 < 20, advance) ↓ Level 2: [15] ─────────> [30] (30
> 20, drop down) ↓ Level 1: [15] ──> [20] (20 = 20, FOUND!) ^^^^
Journey Recorded (used for insertions/deletions):
┌───────────┬─────────────────┐ │ Level 3 │ HEAD │ │ Level 2 │ [15] │
│ Level 1 │ [15] │ │ Level 0 │ [15] │ └───────────┴─────────────────┘
Time Complexity: O(log n) on average HEIGHT DISTRIBUTION
(Probabilistic): ──────────────────────────────────── For 1000 nodes:
Level 0: ~1000 nodes (all) ████████████████████████████████████████
Level 1: ~500 nodes ████████████████████ Level 2: ~250 nodes
██████████ Level 3: ~125 nodes █████ Level 4: ~62 nodes ██
Text Analysis Pipeline
Blaze transforms raw text into searchable tokens through a multi-stage
pipeline.
┌─────────────────────────────────────────────────────────────────────┐
│ Text Analysis Pipeline │
└─────────────────────────────────────────────────────────────────────┘
│ ▼ ┌────────────────────────────────────────┐ │ 1. Tokenization │ │
Split on non-alphanumeric chars │
└────────────────┬───────────────────────┘ ▼
┌────────────────────────────────────────┐ │ 2. Lowercasing │ │
Normalize case ("Quick" → "quick") │
└────────────────┬───────────────────────┘ ▼
┌────────────────────────────────────────┐ │ 3. Stopword Filtering │ │
Remove common words (the, a, is) │
└────────────────┬───────────────────────┘ ▼
┌────────────────────────────────────────┐ │ 4. Length Filtering │ │
Remove tokens < 2 chars │ └────────────────┬───────────────────────┘ ▼
┌────────────────────────────────────────┐ │ 5. Stemming
(Snowball/Porter2) │ │ Reduce to root ("running" → "run") │
└────────────────┬───────────────────────┘ ▼ Final Tokens EXAMPLE
TRANSFORMATION: ─────────────────────── Input: "The Quick Brown Fox
Jumps!" │ ├─ Step 1: Tokenization │ └─> ["The", "Quick", "Brown",
"Fox", "Jumps"] │ ├─ Step 2: Lowercasing │ └─> ["the", "quick",
"brown", "fox", "jumps"] │ ├─ Step 3: Stopword Filtering (remove
"the") │ └─> ["quick", "brown", "fox", "jumps"] │ ├─ Step 4: Length
Filtering (all pass >= 2 chars) │ └─> ["quick", "brown", "fox",
"jumps"] │ └─ Step 5: Stemming ("jumps" → "jump") └─> ["quick",
"brown", "fox", "jump"]
BM25 Algorithm
BM25 (Best Matching 25) is the industry-standard ranking function used
by Elasticsearch, Solr, and Lucene. It considers term frequency,
document length, and term rarity.
BM25 FORMULA: ───────────── IDF(q_i) × TF(q_i, D) × (k1 + 1) BM25(D,
Q) = SUM ───────────────────────────────────────── q_i TF(q_i, D) + k1
× (1 - b + b × |D|/avgdl) in Q Where: D = Document being scored Q =
Query (set of terms q_1, q_2, ..., q_n) q_i = Individual query term
IDF = Inverse Document Frequency (term rarity) TF = Term Frequency
(occurrences in document) k1 = 1.5 (term frequency saturation
parameter) b = 0.75 (length normalization parameter) |D| = Document
length avgdl = Average document length WHAT BM25 CONSIDERS:
──────────────────── 1. Term Frequency: How often does the term
appear? More occurrences = higher relevance 2. TF Saturation:
Diminishing returns 3→10 occurrences matters less than 0→3 3. Document
Length: Normalize by document size Prevents long docs from dominating
results 4. Term Rarity: Rare terms are more important "quantum" >
"the" in importance TERM FREQUENCY SATURATION:
────────────────────────── Score Contribution (with k1=1.5, b=0.75) ^
| /--------------- (saturation) | / 3 | / | / 2 | / | / 1 | / | / 0
|______/ +---+---+---+---+---+---+---+---+---+---+---> Term Frequency
0 1 2 3 4 5 6 7 8 9 10 Key: Going from 0→3 occurrences adds more to
the score than going from 7→10 occurrences (diminishing returns)
// Search and rank using BM25
results := idx.RankBM25("machine learning", 10)
for _, match := range results {
fmt.Printf("Doc %d: Score %.2f\n",
match.DocID,
match.Score)
}
Real-World Examples
E-commerce Search
func SearchProducts(idx *blaze.InvertedIndex, query string,
category string, excludeOutOfStock bool) []blaze.Match {
qb := blaze.NewQueryBuilder(idx).Term(query)
// Add category filter
if category != "" {
qb.And().Term(category)
}
// Exclude out of stock items
if excludeOutOfStock {
qb.And().Not().Term("outofstock")
}
return qb.ExecuteWithBM25(20)
}
Multi-Category Search
func SearchInCategories(idx *blaze.InvertedIndex,
query string, categories []string) []blaze.Match {
qb := blaze.NewQueryBuilder(idx).Term(query)
if len(categories) > 0 {
qb.And().Group(func(q *blaze.QueryBuilder) {
q.Term(categories[0])
for i := 1; i < len(categories); i++ {
q.Or().Term(categories[i])
}
})
}
return qb.ExecuteWithBM25(50)
}
Advanced Boolean Search
// Find documents about (machine learning OR deep learning)
// AND (python OR tensorflow)
results := blaze.NewQueryBuilder(idx).
Group(func(q *blaze.QueryBuilder) {
q.Phrase("machine learning").Or().Phrase("deep learning")
}).
And().
Group(func(q *blaze.QueryBuilder) {
q.Term("python").Or().Term("tensorflow")
}).
ExecuteWithBM25(20)
Use Cases
-
Document Search Systems: Build full-text search for
documentation, articles, or knowledge bases
-
Log Analysis: Search through application logs with
boolean queries and ranking
-
Code Search: Index and search source code
repositories
-
E-commerce: Product catalog search with filters and
facets
-
Email Search: Fast, local email search without
external services
-
Content Management: Search articles, posts, and
media metadata
Key Features
Search Capabilities
- Term Search: Find documents containing specific terms
- Phrase Search: Exact multi-word matching
- Boolean Queries: Type-safe AND, OR, NOT operations
- BM25 Ranking: Industry-standard relevance scoring
- Proximity Ranking: Score by term proximity
- Position Tracking: Track exact word positions
Text Processing
- Tokenization: Unicode-aware text splitting
- Stemming: Snowball (Porter2) stemmer
- Stopword Filtering: Remove common words
- Case Normalization: Case-insensitive search
- Configurable Pipeline: Customize analysis behavior
Data Structures
- Roaring Bitmaps: 400x compression for document sets
- Skip Lists: O(log n) position lookups
- Inverted Index: Efficient term-to-position mapping
- Binary Serialization: Compact storage format