Documentation Index
Fetch the complete documentation index at: https://mintlify.com/spiceai/spiceai/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Spice.ai provides comprehensive observability through Prometheus metrics, health endpoints, and OpenTelemetry integration. Monitor runtime performance, query execution, data acceleration, and cluster health.
Health Endpoints
Liveness Probe: /health
Returns “ok” when the runtime process is alive:
curl http://localhost:8090/health
# ok
- Use for: Kubernetes liveness probes, load balancer health checks
- Always returns 200 OK when the process is running
- Fast response (< 1ms typical)
Readiness Probe: /v1/ready
Returns ready status when all datasets are loaded and the runtime is ready to serve queries:
curl http://localhost:8090/v1/ready
Response when ready:
Response when not ready:
{
"ready": false,
"pending_components": ["dataset:large_table"]
}
- Use for: Kubernetes readiness probes, ensuring traffic only reaches ready instances
- Returns 503 Service Unavailable when datasets are still loading
- Returns 200 OK when all datasets are loaded
Kubernetes Configuration
apiVersion: v1
kind: Pod
metadata:
name: spiceai
spec:
containers:
- name: spiceai
image: spiceai/spiceai:latest
ports:
- containerPort: 8090
name: http
livenessProbe:
httpGet:
path: /health
port: 8090
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
readinessProbe:
httpGet:
path: /v1/ready
port: 8090
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 3
startupProbe:
httpGet:
path: /health
port: 8090
initialDelaySeconds: 0
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 30
Prometheus Metrics
Metrics Endpoint
Expose Prometheus metrics on port 9090:
spiced --metrics 0.0.0.0:9090
Scrape metrics:
curl http://localhost:9090/metrics
Key Runtime Metrics
Query Execution
# Query latency histogram
spice_runtime_query_duration_seconds{status="success"}
# Active queries
spice_runtime_active_queries
# Query throughput
rate(spice_runtime_query_total[5m])
# Query errors
rate(spice_runtime_query_errors_total[5m])
Dataset Acceleration
# Acceleration refresh duration
dataset_refresh_duration_seconds{dataset="my_dataset"}
# Rows loaded per dataset
dataset_rows_loaded{dataset="my_dataset"}
# Memory used by accelerated datasets
dataset_memory_bytes{dataset="my_dataset"}
# Refresh errors
rate(dataset_refresh_errors_total{dataset="my_dataset"}[5m])
# Cache hit rate
rate(spice_cache_hits_total[5m]) / rate(spice_cache_requests_total[5m])
# Cache size
spice_cache_size_bytes{cache_type="sql_results"}
# Cache evictions
rate(spice_cache_evictions_total[5m])
Resource Usage
# Memory usage
process_resident_memory_bytes
# CPU usage
rate(process_cpu_seconds_total[5m])
# Open file descriptors
process_open_fds
# Thread count
process_threads
Cluster Metrics (Distributed Mode)
When running in cluster mode, access cluster-wide metrics:
curl http://scheduler:9090/metrics?scope=cluster
Cluster-specific metrics:
# Active executors
cluster_active_executors
# Tasks distributed per executor
cluster_executor_tasks_total{executor_id="executor-1"}
# Executor resource utilization
cluster_executor_cpu_utilization{executor_id="executor-1"}
cluster_executor_memory_bytes{executor_id="executor-1"}
# Inter-executor data shuffle
cluster_shuffle_bytes_total
Prometheus Configuration
Scrape configuration for prometheus.yml:
scrape_configs:
- job_name: 'spiceai'
static_configs:
- targets: ['localhost:9090']
scrape_interval: 15s
scrape_timeout: 10s
Kubernetes service discovery:
scrape_configs:
- job_name: 'spiceai-k8s'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- default
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: spiceai
- source_labels: [__meta_kubernetes_pod_container_port_name]
action: keep
regex: metrics
Prometheus Operator Integration
For Kubernetes deployments using Prometheus Operator:
PodMonitor Resource
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: spiceai-podmonitor
labels:
prometheus: kube-prometheus
spec:
selector:
matchLabels:
app: spiceai
podMetricsEndpoints:
- port: metrics
interval: 30s
path: /metrics
Enable in Helm chart:
monitoring:
podMonitor:
enabled: true
additionalLabels:
prometheus: kube-prometheus
Grafana Dashboards
Import Dashboard
Use the official Spice.ai Grafana dashboard:
- Download dashboard JSON from GitHub
- Import in Grafana: Dashboards → Import → paste JSON
- Select your Prometheus datasource
Key Panels
- Query Performance: Latency percentiles (p50, p95, p99), throughput
- Dataset Status: Refresh times, row counts, memory usage
- Cache Efficiency: Hit rates, eviction rates, size trends
- Resource Usage: CPU, memory, network I/O
- Error Rates: Query errors, refresh failures, connection errors
Sample PromQL Queries
Query latency 95th percentile:
histogram_quantile(0.95,
rate(spice_runtime_query_duration_seconds_bucket[5m])
)
Dataset refresh success rate:
sum(rate(dataset_refresh_total{status="success"}[5m]))
/
sum(rate(dataset_refresh_total[5m]))
Memory usage per dataset:
topk(10, dataset_memory_bytes)
OpenTelemetry Integration
OTLP Metrics Export
Export runtime metrics to OpenTelemetry collectors:
runtime:
otel_exporter:
endpoint: http://otel-collector:4317
push_interval: 60s
metrics:
- spice_runtime_*
- dataset_*
- spice_cache_*
Supported endpoints:
- gRPC:
http://host:4317 or https://host:4317
- HTTP:
http://host:4318/v1/metrics
OpenTelemetry Collector Configuration
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
send_batch_size: 1024
exporters:
prometheus:
endpoint: 0.0.0.0:8889
otlp:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus, otlp]
Task History Export
Export query execution traces to OpenTelemetry:
runtime:
task_history:
enabled: true
otel_endpoint: http://otel-collector:4317
Traces include:
- Query parsing and planning time
- Execution time per stage
- Data source fetch latency
- Acceleration lookup time
Alerting
Prometheus Alerting Rules
groups:
- name: spiceai
interval: 30s
rules:
- alert: SpiceAIHighQueryLatency
expr: |
histogram_quantile(0.95,
rate(spice_runtime_query_duration_seconds_bucket[5m])
) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "High query latency detected"
description: "95th percentile query latency is {{ $value }}s"
- alert: SpiceAIDatasetRefreshFailing
expr: |
rate(dataset_refresh_errors_total[5m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Dataset refresh failures"
description: "Dataset {{ $labels.dataset }} is failing to refresh"
- alert: SpiceAILowCacheHitRate
expr: |
rate(spice_cache_hits_total[5m])
/
rate(spice_cache_requests_total[5m]) < 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "Low cache hit rate"
description: "Cache hit rate is {{ $value | humanizePercentage }}"
- alert: SpiceAIHighMemoryUsage
expr: |
process_resident_memory_bytes
/
node_memory_MemTotal_bytes > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Memory usage is {{ $value | humanizePercentage }} of available"
- alert: SpiceAIClusterExecutorDown
expr: |
cluster_active_executors < 2
for: 5m
labels:
severity: critical
annotations:
summary: "Cluster executor unavailable"
description: "Only {{ $value }} executor(s) active in cluster"
Logging
Log Levels
Control log verbosity:
# Environment variable
export RUST_LOG=info
# Detailed component logging
export RUST_LOG=runtime=debug,datafusion=info,spice=trace
Levels: error, warn, info, debug, trace
Structured Logging
JSON-formatted logs for ingestion:
export RUST_LOG_FORMAT=json
spiced
Output:
{"timestamp":"2025-03-03T10:15:30.123Z","level":"info","message":"Dataset loaded","dataset":"orders","rows":100000}
Log Aggregation
Ship logs to centralized logging:
# Fluentd configuration
<source>
@type tail
path /var/log/spiced.log
pos_file /var/log/spiced.log.pos
tag spiceai
<parse>
@type json
time_key timestamp
time_format %Y-%m-%dT%H:%M:%S.%LZ
</parse>
</source>
<match spiceai>
@type elasticsearch
host elasticsearch.default.svc.cluster.local
port 9200
logstash_format true
logstash_prefix spiceai
</match>
Status API
Query runtime status programmatically:
curl http://localhost:8090/v1/status
Response:
{
"version": "1.11.0",
"datasets": [
{
"name": "orders",
"status": "ready",
"rows": 1000000,
"last_refresh": "2025-03-03T10:00:00Z"
}
],
"cluster": {
"role": "executor",
"scheduler": "https://scheduler:50052",
"executors": 3
}
}
- Monitor query latency: Track p50, p95, p99 latencies
- Watch cache hit rates: Optimize cache sizing and TTLs
- Track dataset refresh times: Ensure refresh completes within intervals
- Monitor memory usage: Prevent OOM with proper limits
- Set up alerts: Proactive detection of issues
- Use distributed tracing: Track query execution across cluster
- Aggregate logs centrally: Simplify troubleshooting
- Benchmark regularly: Detect performance regressions
Next Steps