Observability and Monitoring

Overview

Spice.ai provides comprehensive observability through Prometheus metrics, health endpoints, and OpenTelemetry integration. Monitor runtime performance, query execution, data acceleration, and cluster health.

Health Endpoints

Liveness Probe: `/health`

Returns “ok” when the runtime process is alive:

curl http://localhost:8090/health
# ok

Use for: Kubernetes liveness probes, load balancer health checks
Always returns 200 OK when the process is running
Fast response (< 1ms typical)

Readiness Probe: `/v1/ready`

Returns ready status when all datasets are loaded and the runtime is ready to serve queries:

curl http://localhost:8090/v1/ready

Response when ready:

{
  "ready": true
}

Response when not ready:

{
  "ready": false,
  "pending_components": ["dataset:large_table"]
}

Use for: Kubernetes readiness probes, ensuring traffic only reaches ready instances
Returns 503 Service Unavailable when datasets are still loading
Returns 200 OK when all datasets are loaded

Kubernetes Configuration

apiVersion: v1
kind: Pod
metadata:
  name: spiceai
spec:
  containers:
    - name: spiceai
      image: spiceai/spiceai:latest
      ports:
        - containerPort: 8090
          name: http
      livenessProbe:
        httpGet:
          path: /health
          port: 8090
        initialDelaySeconds: 10
        periodSeconds: 10
        timeoutSeconds: 3
        failureThreshold: 3
      readinessProbe:
        httpGet:
          path: /v1/ready
          port: 8090
        initialDelaySeconds: 5
        periodSeconds: 5
        timeoutSeconds: 3
        successThreshold: 1
        failureThreshold: 3
      startupProbe:
        httpGet:
          path: /health
          port: 8090
        initialDelaySeconds: 0
        periodSeconds: 10
        timeoutSeconds: 3
        failureThreshold: 30

Prometheus Metrics

Metrics Endpoint

Expose Prometheus metrics on port 9090:

spiced --metrics 0.0.0.0:9090

Scrape metrics:

curl http://localhost:9090/metrics

Key Runtime Metrics

Query Execution

# Query latency histogram
spice_runtime_query_duration_seconds{status="success"}

# Active queries
spice_runtime_active_queries

# Query throughput
rate(spice_runtime_query_total[5m])

# Query errors
rate(spice_runtime_query_errors_total[5m])

Dataset Acceleration

# Acceleration refresh duration
dataset_refresh_duration_seconds{dataset="my_dataset"}

# Rows loaded per dataset
dataset_rows_loaded{dataset="my_dataset"}

# Memory used by accelerated datasets
dataset_memory_bytes{dataset="my_dataset"}

# Refresh errors
rate(dataset_refresh_errors_total{dataset="my_dataset"}[5m])

Cache Performance

# Cache hit rate
rate(spice_cache_hits_total[5m]) / rate(spice_cache_requests_total[5m])

# Cache size
spice_cache_size_bytes{cache_type="sql_results"}

# Cache evictions
rate(spice_cache_evictions_total[5m])

Resource Usage

# Memory usage
process_resident_memory_bytes

# CPU usage
rate(process_cpu_seconds_total[5m])

# Open file descriptors
process_open_fds

# Thread count
process_threads

Cluster Metrics (Distributed Mode)

When running in cluster mode, access cluster-wide metrics:

curl http://scheduler:9090/metrics?scope=cluster

Cluster-specific metrics:

# Active executors
cluster_active_executors

# Tasks distributed per executor
cluster_executor_tasks_total{executor_id="executor-1"}

# Executor resource utilization
cluster_executor_cpu_utilization{executor_id="executor-1"}
cluster_executor_memory_bytes{executor_id="executor-1"}

# Inter-executor data shuffle
cluster_shuffle_bytes_total

Prometheus Configuration

Scrape configuration for prometheus.yml:

scrape_configs:
  - job_name: 'spiceai'
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 15s
    scrape_timeout: 10s

Kubernetes service discovery:

scrape_configs:
  - job_name: 'spiceai-k8s'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - default
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: spiceai
      - source_labels: [__meta_kubernetes_pod_container_port_name]
        action: keep
        regex: metrics

Prometheus Operator Integration

For Kubernetes deployments using Prometheus Operator:

PodMonitor Resource

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: spiceai-podmonitor
  labels:
    prometheus: kube-prometheus
spec:
  selector:
    matchLabels:
      app: spiceai
  podMetricsEndpoints:
    - port: metrics
      interval: 30s
      path: /metrics

Enable in Helm chart:

monitoring:
  podMonitor:
    enabled: true
    additionalLabels:
      prometheus: kube-prometheus

Grafana Dashboards

Import Dashboard

Use the official Spice.ai Grafana dashboard:

Download dashboard JSON from GitHub
Import in Grafana: Dashboards → Import → paste JSON
Select your Prometheus datasource

Key Panels

Query Performance: Latency percentiles (p50, p95, p99), throughput
Dataset Status: Refresh times, row counts, memory usage
Cache Efficiency: Hit rates, eviction rates, size trends
Resource Usage: CPU, memory, network I/O
Error Rates: Query errors, refresh failures, connection errors

Sample PromQL Queries

Query latency 95th percentile:

histogram_quantile(0.95, 
  rate(spice_runtime_query_duration_seconds_bucket[5m])
)

Dataset refresh success rate:

sum(rate(dataset_refresh_total{status="success"}[5m])) 
/ 
sum(rate(dataset_refresh_total[5m]))

Memory usage per dataset:

topk(10, dataset_memory_bytes)

OpenTelemetry Integration

OTLP Metrics Export

Export runtime metrics to OpenTelemetry collectors:

runtime:
  otel_exporter:
    endpoint: http://otel-collector:4317
    push_interval: 60s
    metrics:
      - spice_runtime_*
      - dataset_*
      - spice_cache_*

Supported endpoints:

gRPC: http://host:4317 or https://host:4317
HTTP: http://host:4318/v1/metrics

OpenTelemetry Collector Configuration

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
  otlp:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus, otlp]

Task History Export

Export query execution traces to OpenTelemetry:

runtime:
  task_history:
    enabled: true
    otel_endpoint: http://otel-collector:4317

Traces include:

Query parsing and planning time
Execution time per stage
Data source fetch latency
Acceleration lookup time

Alerting

Prometheus Alerting Rules

groups:
  - name: spiceai
    interval: 30s
    rules:
      - alert: SpiceAIHighQueryLatency
        expr: |
          histogram_quantile(0.95, 
            rate(spice_runtime_query_duration_seconds_bucket[5m])
          ) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High query latency detected"
          description: "95th percentile query latency is {{ $value }}s"
      
      - alert: SpiceAIDatasetRefreshFailing
        expr: |
          rate(dataset_refresh_errors_total[5m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Dataset refresh failures"
          description: "Dataset {{ $labels.dataset }} is failing to refresh"
      
      - alert: SpiceAILowCacheHitRate
        expr: |
          rate(spice_cache_hits_total[5m]) 
          / 
          rate(spice_cache_requests_total[5m]) < 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low cache hit rate"
          description: "Cache hit rate is {{ $value | humanizePercentage }}"
      
      - alert: SpiceAIHighMemoryUsage
        expr: |
          process_resident_memory_bytes 
          / 
          node_memory_MemTotal_bytes > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage"
          description: "Memory usage is {{ $value | humanizePercentage }} of available"
      
      - alert: SpiceAIClusterExecutorDown
        expr: |
          cluster_active_executors < 2
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Cluster executor unavailable"
          description: "Only {{ $value }} executor(s) active in cluster"

Logging

Log Levels

Control log verbosity:

# Environment variable
export RUST_LOG=info

# Detailed component logging
export RUST_LOG=runtime=debug,datafusion=info,spice=trace

Levels: error, warn, info, debug, trace

Structured Logging

JSON-formatted logs for ingestion:

export RUST_LOG_FORMAT=json

spiced

Output:

{"timestamp":"2025-03-03T10:15:30.123Z","level":"info","message":"Dataset loaded","dataset":"orders","rows":100000}

Log Aggregation

Ship logs to centralized logging:

# Fluentd configuration
<source>
  @type tail
  path /var/log/spiced.log
  pos_file /var/log/spiced.log.pos
  tag spiceai
  <parse>
    @type json
    time_key timestamp
    time_format %Y-%m-%dT%H:%M:%S.%LZ
  </parse>
</source>

<match spiceai>
  @type elasticsearch
  host elasticsearch.default.svc.cluster.local
  port 9200
  logstash_format true
  logstash_prefix spiceai
</match>

Status API

Query runtime status programmatically:

curl http://localhost:8090/v1/status

Response:

{
  "version": "1.11.0",
  "datasets": [
    {
      "name": "orders",
      "status": "ready",
      "rows": 1000000,
      "last_refresh": "2025-03-03T10:00:00Z"
    }
  ],
  "cluster": {
    "role": "executor",
    "scheduler": "https://scheduler:50052",
    "executors": 3
  }
}

Performance Monitoring Best Practices

Monitor query latency: Track p50, p95, p99 latencies
Watch cache hit rates: Optimize cache sizing and TTLs
Track dataset refresh times: Ensure refresh completes within intervals
Monitor memory usage: Prevent OOM with proper limits
Set up alerts: Proactive detection of issues
Use distributed tracing: Track query execution across cluster
Aggregate logs centrally: Simplify troubleshooting
Benchmark regularly: Detect performance regressions

Next Steps

Configuration - Configure runtime behavior
Distributed Query - Monitor cluster deployments
Docker Deployment - Container deployment with health checks

Get Started

Core Concepts

Data Connectors

Data Accelerators

Search

AI & ML

Deployment

Documentation Index

​Overview

​Health Endpoints

​Liveness Probe: /health

​Readiness Probe: /v1/ready

​Kubernetes Configuration

​Prometheus Metrics

​Metrics Endpoint

​Key Runtime Metrics

​Query Execution

​Dataset Acceleration

​Cache Performance

​Resource Usage

​Cluster Metrics (Distributed Mode)

​Prometheus Configuration

​Prometheus Operator Integration

​PodMonitor Resource

​Grafana Dashboards

​Import Dashboard

​Key Panels

​Sample PromQL Queries

​OpenTelemetry Integration

​OTLP Metrics Export

​OpenTelemetry Collector Configuration

​Task History Export

​Alerting

​Prometheus Alerting Rules

​Logging

​Log Levels

​Structured Logging

​Log Aggregation

​Status API

​Performance Monitoring Best Practices

​Next Steps