Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/spiceai/spiceai/llms.txt

Use this file to discover all available pages before exploring further.

Vector similarity search finds semantically similar content by comparing vector embeddings using distance metrics.

vector_search() UDTF

The vector_search() User-Defined Table Function performs vector similarity search:
vector_search(
  table_name,
  query,
  [column_name],
  [limit],
  [include_score]
)

Parameters

  • table_name (required) - Name of the table with embedded data
  • query (required) - Search query string (automatically embedded)
  • column_name (optional) - Specific column to search (for multi-column tables)
  • limit (optional) - Maximum results to return (default: 1000)
  • include_score (optional) - Include _score column (default: true)

Basic Usage

-- Simple vector search
SELECT * FROM vector_search(documents, 'artificial intelligence');

-- Limit results
SELECT id, title, _score
FROM vector_search(articles, 'machine learning', limit => 5);

-- Search specific column
SELECT product_name, description, _score
FROM vector_search(products, 'wireless earbuds', description, 10);

Integration with SQL

The vector_search() UDTF is a full table expression supporting WHERE, ORDER BY, and JOIN:
-- Filter results
SELECT title, content, _score
FROM vector_search(articles, 'climate change')
WHERE publish_date > '2024-01-01'
  AND category = 'science'
ORDER BY _score DESC
LIMIT 10;

-- Join with other tables
SELECT 
  v.title,
  v._score,
  a.author_name
FROM vector_search(articles, 'quantum computing') v
JOIN authors a ON v.author_id = a.id
WHERE v._score > 0.7;

-- Aggregations
SELECT 
  category,
  COUNT(*) as matches,
  AVG(_score) as avg_relevance
FROM vector_search(documents, 'neural networks')
GROUP BY category;

Distance Metrics

Spice supports three distance metrics for vector similarity:

Cosine Similarity (Default)

Measures angle between vectors, normalized to [-1, 1]. Best for text embeddings.
datasets:
  - name: documents
    embeddings:
      - column: content
        distance_metric: cosine  # default
Score interpretation: 1.0 = identical, 0.0 = orthogonal, -1.0 = opposite

Euclidean Distance (L2)

Measures straight-line distance between points. Good for spatial data.
datasets:
  - name: images
    embeddings:
      - column: image_features
        distance_metric: euclidean
Score interpretation: Lower is more similar (converted to 1/(1+distance))

Dot Product

Raw inner product of vectors. Efficient but sensitive to magnitude.
datasets:
  - name: products
    embeddings:
      - column: description
        distance_metric: dot_product
Score interpretation: Higher is more similar

Embedding Configuration

Configure automatic embedding generation in your spicepod.yaml:

AWS Bedrock

datasets:
  - name: knowledge_base
    embeddings:
      - column: content
        model:
          from: bedrock
          name: amazon.titan-embed-text-v2:0
        params:
          aws_region: us-east-1

HuggingFace

datasets:
  - name: documents
    embeddings:
      - column: text
        model:
          from: huggingface
          name: sentence-transformers/all-MiniLM-L6-v2

Model2Vec (Fast Static Embeddings)

datasets:
  - name: articles
    embeddings:
      - column: body
        model:
          from: model2vec
          name: minishlab/M2V_base_output

OpenAI

datasets:
  - name: faqs
    embeddings:
      - column: question
        model:
          from: openai
          name: text-embedding-3-small
        params:
          openai_api_key: ${secrets:openai_key}
Embed and search multiple columns independently:
datasets:
  - name: products
    embeddings:
      - column: title
        model:
          from: openai
          name: text-embedding-3-small
      - column: description
        model:
          from: openai
          name: text-embedding-3-small
Query specific columns:
-- Search titles
SELECT * FROM vector_search(products, 'laptop', title) LIMIT 5;

-- Search descriptions
SELECT * FROM vector_search(products, 'portable computer', description) LIMIT 5;

-- Default searches first embedded column
SELECT * FROM vector_search(products, 'notebook') LIMIT 5;
For large text fields, enable chunking to search within document segments:
datasets:
  - name: documents
    embeddings:
      - column: content
        chunking:
          enabled: true
          target_chunk_size: 512
          overlap: 50
Chunked searches return a special _match column with the matched substring:
SELECT 
  id,
  title,
  _match,  -- The specific chunk that matched
  _score
FROM vector_search(documents, 'machine learning algorithms')
ORDER BY _score DESC
LIMIT 5;

Performance Considerations

Pre-filtering

Filter before search for better performance:
-- Good: Filter pushdown to base table
SELECT * FROM vector_search(docs, 'query')
WHERE category = 'tech';  -- Pushed down if possible

-- Better: Use view with pre-filter
CREATE VIEW tech_docs AS SELECT * FROM documents WHERE category = 'tech';
SELECT * FROM vector_search(tech_docs, 'query');

Limit Early

Use the limit parameter in the UDTF rather than SQL LIMIT:
-- Good: Limit in UDTF
SELECT * FROM vector_search(docs, 'query', limit => 100)
WHERE condition;

-- Less efficient: SQL LIMIT after large search
SELECT * FROM vector_search(docs, 'query')
LIMIT 100;

Dimension Optimization

Smaller embedding dimensions = faster search:
  • 384 dims: Good balance (MiniLM models)
  • 768 dims: Better accuracy (BERT base)
  • 1536 dims: OpenAI text-embedding-3-small
  • 3072 dims: Highest accuracy, slower (text-embedding-3-large)

Examples

SELECT 
  question,
  answer,
  category,
  _score
FROM vector_search(faqs, 'how do I reset my password')
WHERE _score > 0.6
ORDER BY _score DESC
LIMIT 3;

Product Recommendations

-- Find similar products
WITH current_product AS (
  SELECT description FROM products WHERE id = 12345
)
SELECT 
  p.id,
  p.name,
  p.price,
  v._score
FROM current_product cp
CROSS JOIN LATERAL (
  SELECT * FROM vector_search(products, cp.description, description)
) v
JOIN products p ON v.id = p.id
WHERE v.id != 12345
LIMIT 5;

Context Retrieval for RAG

-- Retrieve context for LLM prompt
SELECT 
  chunk_text,
  source_document,
  _score
FROM vector_search(knowledge_base, 'explain gradient descent')
ORDER BY _score DESC
LIMIT 3;

Troubleshooting

No results returned

  • Verify embeddings are generated: SELECT COUNT(*) FROM table WHERE embedding_column IS NOT NULL
  • Check search column exists and is embedded
  • Try increasing the limit parameter

Low relevance scores

  • Ensure embedding model matches your domain (code, general text, etc.)
  • Consider different distance metrics
  • Verify embeddings are normalized if using cosine similarity

Slow performance

  • Use Amazon S3 Vectors for large datasets
  • Reduce embedding dimensions
  • Add pre-filters to reduce search space
  • Consider partitioning large tables

See Also