Documentation Index
Fetch the complete documentation index at: https://mintlify.com/spiceai/spiceai/llms.txt
Use this file to discover all available pages before exploring further.
What is a Spicepod?
A Spicepod is a YAML configuration file (spicepod.yaml) that defines your Spice application. It describes:
- Datasets: Data sources to query or accelerate
- Models: LLM and ML models for inference
- Embeddings: Embedding models for vector search
- Views: SQL views over datasets
- Catalogs: External catalog connections
- Runtime: Runtime configuration (caching, parameters, distributed)
- Secrets: Secret store configurations
- Tools: MCP tool integrations
- Evals: Model and data evaluations
Think of a Spicepod as a declarative manifest for your data and AI infrastructure.
Basic Structure
version: v2
kind: Spicepod
name: my_app
runtime:
# Runtime configuration
datasets:
# Data sources
models:
# LLM/ML models
embeddings:
# Embedding models
views:
# SQL views
Complete Example
Here’s the real spicepod.yaml from the Spice repository:
version: v2
kind: Spicepod
name: spiceai
runtime:
caching:
sql_results:
enabled: true
item_ttl: 5s
params:
github_max_concurrent_connections: 5
dataset_load_parallelism: 1
datasets:
# GitHub stargazers with append-only refresh
- from: github:github.com/spiceai/spiceai/stargazers
name: stargazers
description: github.com/spiceai/spiceai GitHub Stargazers
time_column: starred_at
time_format: timestamp
params: &github_params
github_client_id: ${secrets:GITHUB_CLIENT_ID}
github_private_key: ${secrets:GITHUB_PRIVATE_KEY}
github_installation_id: ${secrets:GITHUB_INSTALLATION_ID}
acceleration: &github_acceleration
enabled: true
engine: duckdb
refresh_mode: append
refresh_append_overlap: 5m
refresh_check_interval: 1h
refresh_jitter_enabled: true
refresh_jitter_max: 5m
# CSV file from GitHub
- from: https://raw.githubusercontent.com/spiceai/spiceai/refs/heads/trunk/docs/release_notes/qa_analytics.csv
name: qa_analytics
acceleration:
enabled: true
engine: duckdb
refresh_check_interval: 1d
# GitHub issues
- from: github:github.com/spiceai/spiceai/issues
name: issues
description: github.com/spiceai/spiceai GitHub Issues
params: *github_params
time_column: updated_at
time_format: timestamp
acceleration: *github_acceleration
# GitHub pull requests with comments
- from: github:github.com/spiceai/spiceai/pulls
name: pulls
params:
github_client_id: ${secrets:GITHUB_CLIENT_ID}
github_private_key: ${secrets:GITHUB_PRIVATE_KEY}
github_installation_id: ${secrets:GITHUB_INSTALLATION_ID}
github_include_comments: all
time_column: updated_at
time_format: timestamp
acceleration: *github_acceleration
Key features demonstrated:
- YAML anchors (
&github_params, *github_params) for reusing config
- Secret references (
${secrets:GITHUB_CLIENT_ID})
- Time-series configuration (
time_column, time_format)
- Append-mode acceleration with overlap and jitter
- Multiple dataset types (GitHub API, CSV from URL)
Spicepod Sections
version: v2 # Spicepod schema version (v1 or v2)
kind: Spicepod # Always "Spicepod"
name: my_app # Application name
metadata: # Optional custom metadata
author: "Your Name"
description: "My Spice application"
Runtime Configuration
runtime:
caching:
sql_results:
enabled: true
item_ttl: 5s # Cache query results for 5 seconds
max_size: 128MiB # Max cache size
params:
# Global runtime parameters
github_max_concurrent_connections: 5
dataset_load_parallelism: 4 # Load datasets in parallel
distributed:
enabled: true
workers: 4 # Multi-node query execution
See Architecture for runtime details.
Datasets
datasets:
- from: postgres:public.orders
name: orders
description: Order transaction data
# Connection parameters
params:
pg_host: localhost
pg_port: 5432
pg_db: ecommerce
pg_user: ${secrets:pg_user}
pg_pass: ${secrets:pg_pass}
# Time-series configuration
time_column: order_date
time_format: timestamp
# Acceleration
acceleration:
enabled: true
engine: duckdb
mode: file
refresh_mode: append
refresh_check_interval: 5m
indexes:
customer_id: enabled
primary_key: order_id
# Embeddings for search
columns:
- name: description
embeddings:
- from: openai
model: text-embedding-3-small
row_ids:
- order_id
# Vector store
vectors:
store: s3_vectors
params:
s3_vectors_bucket: order-vectors
Key fields:
from: Data source (format: connector:path)
name: Table name in Spice
params: Connector-specific parameters
acceleration: Local materialization config
columns: Column-specific config (embeddings, search)
vectors: Vector storage for search
See Data Federation and Data Acceleration.
Models
models:
# Hosted model
- from: openai
name: gpt-4o
params:
openai_api_key: ${secrets:openai_key}
# Local model with GPU acceleration
- from: file
name: llama-3.1-8b-instruct
files:
- path: /models/llama-3.1-8b-instruct-q4.gguf
params:
llm_context_length: 8192
llm_n_gpu_layers: 35
# HuggingFace model
- from: huggingface
name: meta-llama/Llama-3.1-8B-Instruct
params:
huggingface_token: ${secrets:hf_token}
See AI Inference.
Embeddings
embeddings:
- from: openai
name: text-embedding-3-small
params:
openai_api_key: ${secrets:openai_key}
- from: bedrock
name: amazon.titan-embed-text-v1
params:
aws_region: us-east-1
aws_access_key_id: ${secrets:aws_key}
aws_secret_access_key: ${secrets:aws_secret}
- from: model2vec
name: minishlab/M2V_base_output
See Search.
Views
Define SQL views over datasets:
views:
- name: recent_orders
sql: |
SELECT
o.order_id,
o.customer_id,
c.customer_name,
o.total,
o.order_date
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 days'
dependsOn:
- orders
- customers
Query like a table:
SELECT * FROM recent_orders WHERE total > 100;
Catalogs
Connect to external catalogs:
catalogs:
- from: unity_catalog
name: databricks_catalog
params:
unity_catalog_url: https://my-workspace.cloud.databricks.com
databricks_token: ${secrets:databricks_token}
- from: iceberg
name: iceberg_catalog
params:
iceberg_catalog_uri: http://iceberg-rest:8181
Query catalog tables:
SHOW TABLES FROM databricks_catalog.production;
SELECT * FROM databricks_catalog.production.sales;
See Data Federation.
Secrets
Configure secret stores:
secrets:
# Environment variables (default)
- from: env
name: env
# AWS Secrets Manager
- from: aws_secrets_manager
name: aws
params:
aws_region: us-east-1
# Kubernetes secrets
- from: kubernetes
name: k8s
Use secrets:
params:
pg_user: ${secrets:POSTGRES_USER} # From env
api_key: ${secrets:aws:prod/api_key} # From AWS Secrets Manager
Snapshots
Configure acceleration snapshots:
snapshots:
from: s3://my-bucket/snapshots
params:
aws_region: us-east-1
aws_access_key_id: ${secrets:aws_key}
aws_secret_access_key: ${secrets:aws_secret}
Enable per dataset:
datasets:
- name: large_dataset
acceleration:
enabled: true
snapshots: enabled # Bootstrap from/create snapshots
See Data Acceleration.
Dependencies
Include other Spicepods:
dependencies:
- spiceai/quickstart
- github.com/myorg/shared-datasets
Dependencies are resolved from:
- Spicerack registry (https://spicerack.org)
- GitHub repositories
- Local file paths
Example:
spice add spiceai/quickstart
Adds the spiceai/quickstart Spicepod as a dependency.
Model Context Protocol integrations:
tools:
- from: mcp
name: weather_api
params:
mcp_endpoint: http://weather-service:8080/mcp
- from: mcp
name: database_tools
params:
mcp_endpoint: http://db-tools:8080/mcp
See AI Inference.
Evaluations
evals:
- name: rag_accuracy
from: dataset:qa_test_set
type: llm_graded
params:
judge_model: gpt-4o
criteria: accuracy
- name: data_quality
from: dataset:prod_data
type: data_quality
params:
rules:
- column: email
type: email_format
Advanced Features
YAML Anchors for Reuse
Avoid repetition with YAML anchors:
datasets:
- from: postgres:public.orders
name: orders
params: &pg_params
pg_host: localhost
pg_port: 5432
pg_user: ${secrets:pg_user}
pg_pass: ${secrets:pg_pass}
acceleration: &pg_acceleration
enabled: true
engine: duckdb
refresh_check_interval: 5m
- from: postgres:public.customers
name: customers
params: *pg_params # Reuse params
acceleration: *pg_acceleration # Reuse acceleration config
Access Modes
Control dataset access:
datasets:
- name: readonly_data
access: ro # Read-only (default)
- name: writable_data
access: rw # Read-write (enables INSERT, UPDATE, DELETE)
Unsupported Type Handling
Handle unsupported data types:
datasets:
- name: mixed_types
unsupported_type_action: string # Convert to string
# Options: error, warn, ignore, string
Ready State
Control when dataset becomes queryable:
datasets:
- name: critical_data
ready_state: on_load # Wait for initial load (default)
- name: optional_data
ready_state: on_registration # Available immediately, fallback to source
Partitioning
Partition accelerated data:
acceleration:
enabled: true
partition_by:
- year(order_date)
- region
Enables partition pruning for better performance.
Replication
Enable dataset replication:
datasets:
- name: replicated_data
replication:
enabled: true
Spicepod Lifecycle
1. Initialize
spice init my_app
cd my_app
Creates a blank spicepod.yaml:
version: v2
kind: Spicepod
name: my_app
Edit spicepod.yaml to add datasets, models, etc.
3. Run
Spice runtime:
- Parses
spicepod.yaml
- Registers datasets
- Loads models
- Starts acceleration refreshes
- Serves APIs (HTTP, Flight, ODBC, etc.)
4. Query
Interactive SQL REPL to query your datasets.
Configuration Best Practices
- Use secrets: Never hardcode credentials
- YAML anchors: Reduce duplication
- Descriptive names: Use clear dataset/model names
- Comments: Document complex configurations
- Version control: Track
spicepod.yaml in git (exclude .env)
- Modular: Use dependencies for shared configs
- Start simple: Add complexity incrementally
- Test locally: Validate before deploying
Spicepod vs. Traditional Config
| Traditional Config | Spicepod |
|---|
| Database connection strings | Declarative dataset definitions |
| Manual schema management | Automatic schema inference |
| Separate model serving | Unified data + AI config |
| Code-based pipelines | YAML-based orchestration |
| Scattered configs | Single source of truth |
CLI Commands
# Initialize new Spicepod
spice init my_app
# Add dependency
spice add spiceai/quickstart
# Configure dataset interactively
spice dataset configure
# Run runtime
spice run
# SQL REPL
spice sql
# Validate Spicepod
spice validate
# Login to Spice.ai Cloud
spice login
Schema Validation
Spicepods have a JSON schema for validation:
# Validate your spicepod.yaml
spice validate
IDEs with YAML LSP support can provide autocomplete and validation.
Example: Full-Stack Application
version: v2
kind: Spicepod
name: ecommerce_app
runtime:
caching:
sql_results:
enabled: true
item_ttl: 10s
distributed:
enabled: true
workers: 2
snapshots:
from: s3://my-snapshots/ecommerce
params:
aws_region: us-east-1
secrets:
- from: env
name: env
- from: aws_secrets_manager
name: aws
params:
aws_region: us-east-1
datasets:
# Operational database
- from: postgres:public.orders
name: orders
params:
pg_host: ${secrets:aws:prod/pg_host}
pg_db: ecommerce
pg_user: ${secrets:aws:prod/pg_user}
pg_pass: ${secrets:aws:prod/pg_pass}
time_column: order_date
acceleration:
enabled: true
engine: duckdb
mode: file
refresh_mode: append
refresh_check_interval: 1m
snapshots: enabled
primary_key: order_id
# Data warehouse
- from: snowflake:analytics.customer_ltv
name: customer_ltv
params:
snowflake_account: ${secrets:aws:prod/sf_account}
snowflake_warehouse: analytics_wh
snowflake_username: ${secrets:aws:prod/sf_user}
snowflake_password: ${secrets:aws:prod/sf_pass}
acceleration:
enabled: true
engine: duckdb
refresh_check_interval: 1h
# Product catalog with search
- from: postgres:public.products
name: products
acceleration:
enabled: true
engine: sqlite
indexes:
category: enabled
brand: enabled
columns:
- name: description
embeddings:
- from: openai
model: text-embedding-3-small
row_ids:
- product_id
full_text_search:
enabled: true
row_ids:
- product_id
vectors:
store: s3_vectors
params:
s3_vectors_bucket: product-vectors
views:
- name: order_summary
sql: |
SELECT
o.order_id,
o.customer_id,
o.total,
c.ltv,
o.order_date
FROM orders o
LEFT JOIN customer_ltv c ON o.customer_id = c.customer_id
dependsOn:
- orders
- customer_ltv
models:
- from: openai
name: gpt-4o-mini
params:
openai_api_key: ${secrets:OPENAI_API_KEY}
- from: file
name: llama-3.1-8b-instruct
files:
- path: /models/llama-3.1-8b.gguf
params:
llm_n_gpu_layers: 35
embeddings:
- from: openai
name: text-embedding-3-small
params:
openai_api_key: ${secrets:OPENAI_API_KEY}
This Spicepod provides:
- Federated query across Postgres + Snowflake
- Local acceleration with snapshots
- Hybrid search on products (vector + keyword)
- LLM inference (OpenAI + local Llama)
- SQL views for analytics
- Distributed query execution
- Results caching
Next Steps
Architecture
Understand how Spice works
Data Federation
Query multiple data sources
Data Acceleration
Materialize data locally
Spicepod Reference
Complete YAML schema reference