Documentation Index
Fetch the complete documentation index at: https://mintlify.com/spiceai/spiceai/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Spice.ai supports multi-node distributed query execution using Apache Ballista integration. This enables horizontal scaling of query processing for large datasets and compute-intensive workloads.
Architecture
Distributed query execution in Spice uses a scheduler-executor model:
- Scheduler Node: Coordinates query planning and task distribution
- Executor Nodes: Execute query tasks on partitioned data
Queries submitted to the scheduler are broken into tasks, distributed to executors, and results are aggregated.
┌─────────────┐
│ Client │
└──────┬──────┘
│ Submit query
▼
┌─────────────┐
│ Scheduler │ (Query planning & coordination)
└──────┬──────┘
│ Distribute tasks
├──────────────┬──────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│Executor 1│ │Executor 2│ │Executor 3│
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
└──────────────┴──────────────┘
│
Aggregate results
Deployment
Scheduler Node
Deploy a scheduler node using the --role scheduler flag:
spiced \
--http 0.0.0.0:8090 \
--flight 0.0.0.0:50051 \
--metrics 0.0.0.0:9090 \
--role scheduler \
--node-bind-address 0.0.0.0:50052
Key arguments:
--role scheduler: Sets node role to scheduler
--node-bind-address: Internal gRPC address for cluster communication (default: 0.0.0.0:50052)
Executor Nodes
Deploy executor nodes that connect to the scheduler:
spiced \
--http 0.0.0.0:8090 \
--flight 0.0.0.0:50051 \
--metrics 0.0.0.0:9090 \
--role executor \
--scheduler-address https://scheduler.example.com:50052 \
--node-bind-address 0.0.0.0:50052 \
--node-advertise-address executor-1.example.com
Key arguments:
--role executor: Sets node role to executor
--scheduler-address: URL of the scheduler’s internal gRPC service
--node-advertise-address: Hostname/IP that this executor advertises to the scheduler
Security: mTLS Configuration
Cluster communication should use mutual TLS (mTLS) in production:
Generate Certificates
# Generate CA
openssl req -x509 -newkey rsa:4096 -days 365 -nodes \
-keyout ca-key.pem -out ca-cert.pem \
-subj "/CN=Spice Cluster CA"
# Generate scheduler certificate
openssl req -newkey rsa:4096 -nodes \
-keyout scheduler-key.pem -out scheduler-req.pem \
-subj "/CN=scheduler.example.com"
openssl x509 -req -in scheduler-req.pem -CA ca-cert.pem \
-CAkey ca-key.pem -CAcreateserial -out scheduler-cert.pem -days 365
# Generate executor certificate (repeat for each executor)
openssl req -newkey rsa:4096 -nodes \
-keyout executor-key.pem -out executor-req.pem \
-subj "/CN=executor-1.example.com"
openssl x509 -req -in executor-req.pem -CA ca-cert.pem \
-CAkey ca-key.pem -CAcreateserial -out executor-cert.pem -days 365
Scheduler with mTLS
spiced \
--http 0.0.0.0:8090 \
--flight 0.0.0.0:50051 \
--role scheduler \
--node-bind-address 0.0.0.0:50052 \
--node-mtls-ca-certificate-file ca-cert.pem \
--node-mtls-certificate-file scheduler-cert.pem \
--node-mtls-key-file scheduler-key.pem
Executor with mTLS
spiced \
--http 0.0.0.0:8090 \
--flight 0.0.0.0:50051 \
--role executor \
--scheduler-address https://scheduler.example.com:50052 \
--node-bind-address 0.0.0.0:50052 \
--node-advertise-address executor-1.example.com \
--node-mtls-ca-certificate-file ca-cert.pem \
--node-mtls-certificate-file executor-cert.pem \
--node-mtls-key-file executor-key.pem
Kubernetes Deployment
Deploy a distributed cluster on Kubernetes:
Scheduler Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: spice-scheduler
spec:
replicas: 1
selector:
matchLabels:
app: spice-scheduler
template:
metadata:
labels:
app: spice-scheduler
spec:
containers:
- name: spiceai
image: spiceai/spiceai:latest
command:
- /usr/local/bin/spiced
- --http
- 0.0.0.0:8090
- --flight
- 0.0.0.0:50051
- --metrics
- 0.0.0.0:9090
- --role
- scheduler
- --node-bind-address
- 0.0.0.0:50052
- --node-advertise-address
- spice-scheduler.default.svc.cluster.local
- --node-mtls-ca-certificate-file
- /certs/ca-cert.pem
- --node-mtls-certificate-file
- /certs/scheduler-cert.pem
- --node-mtls-key-file
- /certs/scheduler-key.pem
ports:
- containerPort: 8090
name: http
- containerPort: 50051
name: flight
- containerPort: 9090
name: metrics
- containerPort: 50052
name: cluster
volumeMounts:
- name: certs
mountPath: /certs
readOnly: true
- name: spicepod
mountPath: /app/spicepod.yaml
subPath: spicepod.yaml
volumes:
- name: certs
secret:
secretName: cluster-certs
- name: spicepod
configMap:
name: spice-config
---
apiVersion: v1
kind: Service
metadata:
name: spice-scheduler
spec:
selector:
app: spice-scheduler
ports:
- port: 8090
name: http
- port: 50051
name: flight
- port: 9090
name: metrics
- port: 50052
name: cluster
Executor StatefulSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: spice-executor
spec:
serviceName: spice-executor
replicas: 3
selector:
matchLabels:
app: spice-executor
template:
metadata:
labels:
app: spice-executor
spec:
containers:
- name: spiceai
image: spiceai/spiceai:latest
command:
- /usr/local/bin/spiced
- --http
- 0.0.0.0:8090
- --flight
- 0.0.0.0:50051
- --metrics
- 0.0.0.0:9090
- --role
- executor
- --scheduler-address
- https://spice-scheduler.default.svc.cluster.local:50052
- --node-bind-address
- 0.0.0.0:50052
- --node-mtls-ca-certificate-file
- /certs/ca-cert.pem
- --node-mtls-certificate-file
- /certs/executor-cert.pem
- --node-mtls-key-file
- /certs/executor-key.pem
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NODE_ADVERTISE_ADDRESS
value: "$(POD_NAME).spice-executor.default.svc.cluster.local"
ports:
- containerPort: 8090
name: http
- containerPort: 50051
name: flight
- containerPort: 9090
name: metrics
- containerPort: 50052
name: cluster
volumeMounts:
- name: certs
mountPath: /certs
readOnly: true
- name: spicepod
mountPath: /app/spicepod.yaml
subPath: spicepod.yaml
volumes:
- name: certs
secret:
secretName: cluster-certs
- name: spicepod
configMap:
name: spice-config
---
apiVersion: v1
kind: Service
metadata:
name: spice-executor
spec:
clusterIP: None
selector:
app: spice-executor
ports:
- port: 50052
name: cluster
Create Certificates Secret
kubectl create secret generic cluster-certs \
--from-file=ca-cert.pem=ca-cert.pem \
--from-file=scheduler-cert.pem=scheduler-cert.pem \
--from-file=scheduler-key.pem=scheduler-key.pem \
--from-file=executor-cert.pem=executor-cert.pem \
--from-file=executor-key.pem=executor-key.pem
Query Execution
Once the cluster is running, queries submitted to the scheduler are automatically distributed:
# Query via HTTP
curl -X POST http://scheduler-host:8090/v1/sql \
-H "Content-Type: application/json" \
-d '{"sql": "SELECT COUNT(*) FROM large_dataset"}'
# Query via Flight SQL
spice sql --repl
The scheduler:
- Parses and optimizes the query plan
- Partitions the plan into executable tasks
- Distributes tasks to available executors
- Aggregates results from executors
- Returns final result to client
Monitoring Distributed Queries
Monitor cluster health and query execution:
# Check cluster metrics
curl http://scheduler-host:9090/metrics?scope=cluster
# View active executors
curl http://scheduler-host:8090/v1/status
Metrics include:
- Active executor count
- Task distribution across executors
- Query execution time per executor
- Data shuffle statistics
Data Partitioning
For optimal distributed query performance, partition datasets appropriately:
datasets:
- from: s3://large-bucket/data/
name: large_dataset
acceleration:
enabled: true
engine: arrow
params:
file_format: parquet
# Ensure data is pre-partitioned in S3
partition_cols:
- year
- month
Executors can read partitioned data in parallel, improving query performance.
Best Practices
- Use mTLS in production: Always secure cluster communication with mutual TLS
- Scale executors horizontally: Add more executors for increased query throughput
- Partition large datasets: Pre-partition data for parallel processing
- Monitor resource usage: Track CPU, memory, and network metrics per executor
- Co-locate with data: Deploy executors close to data sources to minimize network latency
- Use persistent storage: Mount volumes for file-based accelerators (DuckDB, Cayenne)
Limitations
- Single scheduler (high availability requires external orchestration)
- Executors must have network connectivity to scheduler and each other
- Data shuffle requires inter-executor communication
- Not all queries benefit from distribution (OLTP workloads with small datasets)
Development Mode
For testing without mTLS:
# Scheduler (insecure - development only)
spiced --role scheduler --allow-insecure-connections
# Executor (insecure - development only)
spiced --role executor \
--scheduler-address http://localhost:50052 \
--allow-insecure-connections
WARNING: --allow-insecure-connections disables authentication. Never use in production.
Next Steps