Documentation Index
Fetch the complete documentation index at: https://mintlify.com/spiceai/spiceai/llms.txt
Use this file to discover all available pages before exploring further.
The DuckDB connector enables Spice to query DuckDB databases (both file-based and in-memory) with full SQL push-down support and bidirectional data operations.
Status
Stable - Production-ready with comprehensive testing
Supported Features
- Full SQL query push-down
- Embedded DuckDB engine
- File-based and in-memory databases
- Read and write operations
- DELETE operations with transaction support
- Data acceleration
- Native Arrow integration
- Extensions support
Configuration
File-Based Database
version: v1
kind: Spicepod
name: duckdb-demo
datasets:
- from: duckdb:my_table
name: my_table
params:
duckdb_open: ./data/my_database.db
In-Memory Database
datasets:
- from: duckdb:analytics
name: analytics
params:
duckdb_open: :memory:
Multiple Tables
datasets:
- from: duckdb:catalog_sales
name: catalog_sales
params: &duckdb_params
duckdb_open: ./data/tpcds.db
- from: duckdb:customer
name: customer
params: *duckdb_params
- from: duckdb:store_sales
name: store_sales
params: *duckdb_params
With Write Support
datasets:
- from: duckdb:output_table
name: output_table
params:
duckdb_open: ./data/output.db
mode: readwrite # Enable writes
As an Accelerator
DuckDB can also be used as an acceleration engine:
datasets:
- from: postgres:public.large_table
name: large_table
params:
pg_host: postgres.internal
pg_db: warehouse
pg_user: readonly
pg_pass: ${secrets:pg_password}
acceleration:
enabled: true
engine: duckdb
mode: file # or memory
refresh_interval: 5m
Parameters
Path to DuckDB database file or :memory: for in-memory database
Open database in read-only mode
Number of threads for query execution (default: number of CPU cores)
Memory limit for DuckDB (e.g., 4GB, 512MB)
List of DuckDB extensions to load (e.g., parquet, json, httpfs)
Use Cases
Local Analytics Database
datasets:
- from: duckdb:sales_data
name: sales
params:
duckdb_open: ./analytics/sales.duckdb
Query local DuckDB:
SELECT
product_category,
SUM(revenue) as total_revenue,
AVG(profit_margin) as avg_margin
FROM sales
WHERE sale_date >= '2024-01-01'
GROUP BY product_category
ORDER BY total_revenue DESC;
Data Processing Pipeline
datasets:
# Read from S3
- from: s3://raw-data/events/*.parquet
name: raw_events
params:
file_format: parquet
# Write to DuckDB for processing
- from: duckdb:processed_events
name: processed_events
params:
duckdb_open: ./processing/events.db
mode: readwrite
Process and store:
-- Insert processed data into DuckDB
INSERT INTO processed_events
SELECT
event_id,
user_id,
event_type,
DATE_TRUNC('hour', timestamp) as event_hour,
properties
FROM raw_events
WHERE event_type IN ('click', 'purchase');
Temporary In-Memory Analysis
datasets:
- from: postgres:public.transactions
name: transactions
params:
pg_host: prod-db.internal
pg_db: finance
pg_user: analyst
pg_pass: ${secrets:pg_password}
acceleration:
enabled: true
engine: duckdb
mode: memory # Fast in-memory processing
refresh_interval: 1m
Fast analytical queries:
-- Complex aggregation runs on accelerated DuckDB
SELECT
DATE_TRUNC('day', transaction_date) as day,
merchant_category,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) as median_amount,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY amount) as p95_amount,
COUNT(*) as transaction_count
FROM transactions
WHERE transaction_date >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY day, merchant_category
ORDER BY day DESC, transaction_count DESC;
Multi-Source Federation with DuckDB
datasets:
# DuckDB local cache
- from: duckdb:customer_cache
name: customers
params:
duckdb_open: ./cache/customers.db
# PostgreSQL orders
- from: postgres:public.orders
name: orders
params:
pg_host: postgres.internal
pg_db: orders
pg_user: app
pg_pass: ${secrets:pg_password}
# S3 product catalog
- from: s3://catalog/products.parquet
name: products
params:
file_format: parquet
Join across sources:
SELECT
c.name as customer_name,
p.name as product_name,
o.order_date,
o.quantity,
o.total_amount
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '7' DAY
ORDER BY o.order_date DESC;
Delete Operations
datasets:
- from: duckdb:audit_log
name: audit_log
params:
duckdb_open: ./logs/audit.db
mode: readwrite
Delete old records:
-- Delete records older than 90 days
DELETE FROM audit_log
WHERE log_date < CURRENT_DATE - INTERVAL '90' DAY;
-
File vs Memory Mode:
- Use
file mode for large datasets that exceed memory
- Use
memory mode for fastest queries on smaller datasets
-
Threading: DuckDB automatically uses all CPU cores; adjust with
duckdb_threads if needed
-
Memory Management: Set
duckdb_memory_limit to prevent OOM on large queries
-
Acceleration: DuckDB is an excellent acceleration engine for other connectors
-
Extensions: Load only required extensions to minimize overhead
DuckDB as Accelerator
DuckDB provides high-performance acceleration with two modes:
Memory Mode
acceleration:
enabled: true
engine: duckdb
mode: memory
refresh_interval: 5m
- Fastest queries (pure in-memory)
- Limited by available RAM
- Best for frequently accessed smaller datasets
File Mode
acceleration:
enabled: true
engine: duckdb
mode: file
refresh_interval: 1h
- Larger-than-memory datasets
- Persistent across restarts
- Excellent compression
- Best for large analytical datasets
Extensions
DuckDB supports many extensions:
params:
duckdb_open: ./data/extended.db
duckdb_extensions:
- parquet
- json
- httpfs
- spatial
Common extensions:
parquet: Parquet file support
json: JSON processing
httpfs: HTTP(S) file access
spatial: Geospatial functions
fts: Full-text search
Data Type Support
DuckDB has excellent Arrow integration with native support for:
- All numeric types
- Strings (UTF-8)
- Dates and timestamps
- Lists and structs
- NULL handling
Limitations
- File-based databases support concurrent reads but single writer
- Very large datasets (>100GB) may require careful memory management
- Some PostgreSQL-specific features not supported