Full-Text Search

Full-text search uses BM25 ranking with the Tantivy search engine to find documents by keyword relevance.

text_search() UDTF

The text_search() User-Defined Table Function performs full-text search:

text_search(
  table_name,
  query,
  [column_name],
  [limit],
  [include_score]
)

Parameters

table_name (required) - Name of the table with indexed text
query (required) - Search query string
column_name (optional) - Specific column to search
limit (optional) - Maximum results to return (default: 1000)
include_score (optional) - Include _score column (default: true)

Basic Usage

-- Simple text search
SELECT * FROM text_search(documents, 'database query');

-- Limit results
SELECT id, title, _score
FROM text_search(articles, 'machine learning', limit => 10);

-- Search specific column
SELECT question, answer, _score
FROM text_search(faqs, 'password reset', question, 5);

BM25 Ranking

BM25 (Best Matching 25) is a probabilistic ranking function that scores documents based on:

Term Frequency (TF) - How often query terms appear in the document
Inverse Document Frequency (IDF) - Rarity of terms across all documents
Document Length - Normalized by average document length

Score interpretation: Higher scores = better matches. Scores are not normalized to [0,1].

Query Syntax

Text search supports simple, space-delimited queries:

-- Multiple terms (OR by default)
SELECT * FROM text_search(docs, 'rust programming');
-- Matches documents containing "rust" OR "programming"

-- Phrase matching
SELECT * FROM text_search(docs, 'machine learning');
-- Matches documents with "machine" OR "learning"

-- Special characters are tokenized
SELECT * FROM text_search(docs, 'C++ developer');
-- Searches for "c", "developer"

Note: The query is tokenized and lowercased. Complex boolean operators (AND, OR, NOT) are treated as regular terms.

Configuration

Enable full-text search in your spicepod.yaml:

datasets:
  - name: articles
    from: postgres:articles
    acceleration:
      enabled: true
    search:
      full_text:
        enabled: true
        columns:
          - title
          - content

Multi-Column Indexing

Index multiple columns for comprehensive search:

datasets:
  - name: support_tickets
    search:
      full_text:
        enabled: true
        columns:
          - subject
          - description
          - resolution_notes

Search specific indexed columns:

-- Search in subject field
SELECT * FROM text_search(support_tickets, 'billing issue', subject);

-- Search in description
SELECT * FROM text_search(support_tickets, 'payment failed', description);

Integration with SQL

Filtering Results

SELECT 
  id,
  title,
  category,
  _score
FROM text_search(articles, 'kubernetes deployment')
WHERE category = 'DevOps'
  AND publish_date > '2024-01-01'
  AND _score > 10.0
ORDER BY _score DESC;

Joining with Tables

SELECT 
  t.question_id,
  t.title,
  t._score,
  u.username,
  u.reputation
FROM text_search(questions, 'gitignore untracked') t
JOIN users u ON t.user_id = u.id
WHERE t._score > 5.0
ORDER BY t._score DESC
LIMIT 10;

Aggregations

SELECT 
  category,
  COUNT(*) as match_count,
  AVG(_score) as avg_score
FROM text_search(documents, 'artificial intelligence')
GROUP BY category
ORDER BY avg_score DESC;

Tantivy Indexing

Spice uses Tantivy, a full-text search library written in Rust, providing:

Fast indexing and searching
Low memory footprint
BM25 ranking algorithm
Simple tokenization and lowercasing

Index Storage

Full-text indexes are stored alongside accelerated data:

datasets:
  - name: documents
    acceleration:
      enabled: true
      engine: duckdb  # or arrow, sqlite, postgres
    search:
      full_text:
        enabled: true
        columns: [content]

The Tantivy index is created and maintained automatically as data is loaded and updated.

Performance

Index Creation

Indexes are built during dataset acceleration:

# Monitor index creation
2024-01-20T10:30:15.123Z  INFO Loading data for dataset documents
2024-01-20T10:30:20.456Z  INFO Creating full-text index for column: content
2024-01-20T10:30:25.789Z  INFO Loaded 100,000 rows for dataset documents

Query Performance

Full-text search is optimized for:

Small to medium datasets (< 10M documents)
Keyword-based queries (not semantic search)
Fast exact matching (microseconds for indexed terms)

For large datasets or semantic search, consider Vector Search.

Optimization Tips

Limit results: Use the limit parameter in the UDTF
Pre-filter: Apply WHERE clauses to reduce result set
Accelerate data: Enable acceleration for the dataset
Selective columns: Only index columns you need to search

Special Columns

_score

BM25 relevance score (higher = more relevant):

SELECT title, _score
FROM text_search(docs, 'rust programming')
WHERE _score > 8.0;

_value

The matched content from the search column:

SELECT id, _value as matched_text, _score
FROM text_search(articles, 'database')
LIMIT 5;

Examples

Document Search

-- Find technical documentation
SELECT 
  doc_id,
  title,
  section,
  _score
FROM text_search(documentation, 'authentication oauth token')
WHERE section IN ('API', 'Security')
ORDER BY _score DESC
LIMIT 20;

Log Analysis

-- Search error logs
SELECT 
  timestamp,
  service_name,
  _value as log_message,
  _score
FROM text_search(logs, 'connection timeout error')
WHERE timestamp > NOW() - INTERVAL '1 hour'
ORDER BY _score DESC, timestamp DESC
LIMIT 50;

E-commerce Product Search

-- Find products by description
SELECT 
  product_id,
  name,
  description,
  price,
  _score
FROM text_search(products, 'wireless bluetooth headphones', description)
WHERE price < 100.00
  AND in_stock = true
ORDER BY _score DESC
LIMIT 10;

FAQ Search

-- Search support articles
SELECT 
  article_id,
  question,
  answer,
  category,
  _score
FROM text_search(faqs, 'forgot password reset', question, 10)
ORDER BY _score DESC;

Limitations

No boolean operators: AND, OR, NOT are treated as regular terms
No phrase search: Multi-word queries are tokenized into OR terms
No wildcards: Pattern matching not supported
No filters in UDTF: Use SQL WHERE clauses instead
Single-column queries: Search one column per query

Comparison: Text Search vs Vector Search

Feature	Full-Text Search (BM25)	Vector Search
Best for	Exact keyword matching	Semantic similarity
Query type	Keywords, terms	Natural language
Speed	Very fast	Fast (depends on index)
Setup	Automatic indexing	Requires embeddings
Storage	Tantivy index	Vector store
Scoring	BM25 (TF-IDF based)	Distance metrics
Relevance	Keyword frequency	Semantic meaning

Recommendation: Use Hybrid Search to combine both methods for optimal results.

Troubleshooting

No results found

Verify column is indexed: Check search.full_text.columns in spicepod.yaml
Check query terms: Try individual words
Verify data is loaded: SELECT COUNT(*) FROM table

Unexpected results

Remember queries are tokenized and lowercased
Special characters are removed during tokenization
BM25 scores are not normalized

Index not created

Ensure acceleration.enabled: true
Check logs for indexing errors
Verify column data types (text/string columns only)

Get Started

Core Concepts

Data Connectors

Data Accelerators

Search

AI & ML

Deployment

text_search() UDTF

Parameters

Basic Usage

BM25 Ranking

Query Syntax

Configuration

Multi-Column Indexing

Integration with SQL

Filtering Results

Joining with Tables

Aggregations

Tantivy Indexing

Index Storage

Performance

Index Creation

Query Performance

Optimization Tips

Special Columns

_score

_value

Examples

Document Search

Log Analysis

E-commerce Product Search

FAQ Search

Limitations

Comparison: Text Search vs Vector Search

Troubleshooting

No results found

Unexpected results

Index not created

See Also

Get Started

Core Concepts

Data Connectors

Data Accelerators

Search

AI & ML

Deployment

Documentation Index

​text_search() UDTF

​Parameters

​Basic Usage

​BM25 Ranking

​Query Syntax

​Configuration

​Multi-Column Indexing

​Integration with SQL

​Filtering Results

​Joining with Tables

​Aggregations

​Tantivy Indexing

​Index Storage

​Performance

​Index Creation

​Query Performance

​Optimization Tips

​Special Columns

​_score

​_value

​Examples

​Document Search

​Log Analysis

​E-commerce Product Search

​FAQ Search

​Limitations

​Comparison: Text Search vs Vector Search

​Troubleshooting

​No results found

​Unexpected results

​Index not created

​See Also

text_search() UDTF

Parameters

Basic Usage

BM25 Ranking

Query Syntax

Configuration

Multi-Column Indexing

Integration with SQL

Filtering Results

Joining with Tables

Aggregations

Tantivy Indexing

Index Storage

Performance

Index Creation

Query Performance

Optimization Tips

Special Columns

_score

_value

Examples

Document Search

Log Analysis

E-commerce Product Search

FAQ Search

Limitations

Comparison: Text Search vs Vector Search

Troubleshooting

No results found

Unexpected results

Index not created

See Also