Distributed Query Execution

Overview

Spice.ai supports multi-node distributed query execution using Apache Ballista integration. This enables horizontal scaling of query processing for large datasets and compute-intensive workloads.

Architecture

Distributed query execution in Spice uses a scheduler-executor model:

Scheduler Node: Coordinates query planning and task distribution
Executor Nodes: Execute query tasks on partitioned data

Queries submitted to the scheduler are broken into tasks, distributed to executors, and results are aggregated.

┌─────────────┐
│   Client    │
└──────┬──────┘
       │ Submit query
       ▼
┌─────────────┐
│  Scheduler  │  (Query planning & coordination)
└──────┬──────┘
       │ Distribute tasks
       ├──────────────┬──────────────┐
       ▼              ▼              ▼
┌──────────┐   ┌──────────┐   ┌──────────┐
│Executor 1│   │Executor 2│   │Executor 3│
└────┬─────┘   └────┬─────┘   └────┬─────┘
     │              │              │
     └──────────────┴──────────────┘
                    │
             Aggregate results

Deployment

Scheduler Node

Deploy a scheduler node using the --role scheduler flag:

spiced \
  --http 0.0.0.0:8090 \
  --flight 0.0.0.0:50051 \
  --metrics 0.0.0.0:9090 \
  --role scheduler \
  --node-bind-address 0.0.0.0:50052

Key arguments:

--role scheduler: Sets node role to scheduler
--node-bind-address: Internal gRPC address for cluster communication (default: 0.0.0.0:50052)

Executor Nodes

Deploy executor nodes that connect to the scheduler:

spiced \
  --http 0.0.0.0:8090 \
  --flight 0.0.0.0:50051 \
  --metrics 0.0.0.0:9090 \
  --role executor \
  --scheduler-address https://scheduler.example.com:50052 \
  --node-bind-address 0.0.0.0:50052 \
  --node-advertise-address executor-1.example.com

Key arguments:

--role executor: Sets node role to executor
--scheduler-address: URL of the scheduler’s internal gRPC service
--node-advertise-address: Hostname/IP that this executor advertises to the scheduler

Security: mTLS Configuration

Cluster communication should use mutual TLS (mTLS) in production:

Generate Certificates

# Generate CA
openssl req -x509 -newkey rsa:4096 -days 365 -nodes \
  -keyout ca-key.pem -out ca-cert.pem \
  -subj "/CN=Spice Cluster CA"

# Generate scheduler certificate
openssl req -newkey rsa:4096 -nodes \
  -keyout scheduler-key.pem -out scheduler-req.pem \
  -subj "/CN=scheduler.example.com"

openssl x509 -req -in scheduler-req.pem -CA ca-cert.pem \
  -CAkey ca-key.pem -CAcreateserial -out scheduler-cert.pem -days 365

# Generate executor certificate (repeat for each executor)
openssl req -newkey rsa:4096 -nodes \
  -keyout executor-key.pem -out executor-req.pem \
  -subj "/CN=executor-1.example.com"

openssl x509 -req -in executor-req.pem -CA ca-cert.pem \
  -CAkey ca-key.pem -CAcreateserial -out executor-cert.pem -days 365

Scheduler with mTLS

spiced \
  --http 0.0.0.0:8090 \
  --flight 0.0.0.0:50051 \
  --role scheduler \
  --node-bind-address 0.0.0.0:50052 \
  --node-mtls-ca-certificate-file ca-cert.pem \
  --node-mtls-certificate-file scheduler-cert.pem \
  --node-mtls-key-file scheduler-key.pem

Executor with mTLS

spiced \
  --http 0.0.0.0:8090 \
  --flight 0.0.0.0:50051 \
  --role executor \
  --scheduler-address https://scheduler.example.com:50052 \
  --node-bind-address 0.0.0.0:50052 \
  --node-advertise-address executor-1.example.com \
  --node-mtls-ca-certificate-file ca-cert.pem \
  --node-mtls-certificate-file executor-cert.pem \
  --node-mtls-key-file executor-key.pem

Kubernetes Deployment

Deploy a distributed cluster on Kubernetes:

Scheduler Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: spice-scheduler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: spice-scheduler
  template:
    metadata:
      labels:
        app: spice-scheduler
    spec:
      containers:
        - name: spiceai
          image: spiceai/spiceai:latest
          command:
            - /usr/local/bin/spiced
            - --http
            - 0.0.0.0:8090
            - --flight
            - 0.0.0.0:50051
            - --metrics
            - 0.0.0.0:9090
            - --role
            - scheduler
            - --node-bind-address
            - 0.0.0.0:50052
            - --node-advertise-address
            - spice-scheduler.default.svc.cluster.local
            - --node-mtls-ca-certificate-file
            - /certs/ca-cert.pem
            - --node-mtls-certificate-file
            - /certs/scheduler-cert.pem
            - --node-mtls-key-file
            - /certs/scheduler-key.pem
          ports:
            - containerPort: 8090
              name: http
            - containerPort: 50051
              name: flight
            - containerPort: 9090
              name: metrics
            - containerPort: 50052
              name: cluster
          volumeMounts:
            - name: certs
              mountPath: /certs
              readOnly: true
            - name: spicepod
              mountPath: /app/spicepod.yaml
              subPath: spicepod.yaml
      volumes:
        - name: certs
          secret:
            secretName: cluster-certs
        - name: spicepod
          configMap:
            name: spice-config
---
apiVersion: v1
kind: Service
metadata:
  name: spice-scheduler
spec:
  selector:
    app: spice-scheduler
  ports:
    - port: 8090
      name: http
    - port: 50051
      name: flight
    - port: 9090
      name: metrics
    - port: 50052
      name: cluster

Executor StatefulSet

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: spice-executor
spec:
  serviceName: spice-executor
  replicas: 3
  selector:
    matchLabels:
      app: spice-executor
  template:
    metadata:
      labels:
        app: spice-executor
    spec:
      containers:
        - name: spiceai
          image: spiceai/spiceai:latest
          command:
            - /usr/local/bin/spiced
            - --http
            - 0.0.0.0:8090
            - --flight
            - 0.0.0.0:50051
            - --metrics
            - 0.0.0.0:9090
            - --role
            - executor
            - --scheduler-address
            - https://spice-scheduler.default.svc.cluster.local:50052
            - --node-bind-address
            - 0.0.0.0:50052
            - --node-mtls-ca-certificate-file
            - /certs/ca-cert.pem
            - --node-mtls-certificate-file
            - /certs/executor-cert.pem
            - --node-mtls-key-file
            - /certs/executor-key.pem
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: NODE_ADVERTISE_ADDRESS
              value: "$(POD_NAME).spice-executor.default.svc.cluster.local"
          ports:
            - containerPort: 8090
              name: http
            - containerPort: 50051
              name: flight
            - containerPort: 9090
              name: metrics
            - containerPort: 50052
              name: cluster
          volumeMounts:
            - name: certs
              mountPath: /certs
              readOnly: true
            - name: spicepod
              mountPath: /app/spicepod.yaml
              subPath: spicepod.yaml
      volumes:
        - name: certs
          secret:
            secretName: cluster-certs
        - name: spicepod
          configMap:
            name: spice-config
---
apiVersion: v1
kind: Service
metadata:
  name: spice-executor
spec:
  clusterIP: None
  selector:
    app: spice-executor
  ports:
    - port: 50052
      name: cluster

Create Certificates Secret

kubectl create secret generic cluster-certs \
  --from-file=ca-cert.pem=ca-cert.pem \
  --from-file=scheduler-cert.pem=scheduler-cert.pem \
  --from-file=scheduler-key.pem=scheduler-key.pem \
  --from-file=executor-cert.pem=executor-cert.pem \
  --from-file=executor-key.pem=executor-key.pem

Query Execution

Once the cluster is running, queries submitted to the scheduler are automatically distributed:

# Query via HTTP
curl -X POST http://scheduler-host:8090/v1/sql \
  -H "Content-Type: application/json" \
  -d '{"sql": "SELECT COUNT(*) FROM large_dataset"}'

# Query via Flight SQL
spice sql --repl

The scheduler:

Parses and optimizes the query plan
Partitions the plan into executable tasks
Distributes tasks to available executors
Aggregates results from executors
Returns final result to client

Monitoring Distributed Queries

Monitor cluster health and query execution:

# Check cluster metrics
curl http://scheduler-host:9090/metrics?scope=cluster

# View active executors
curl http://scheduler-host:8090/v1/status

Metrics include:

Active executor count
Task distribution across executors
Query execution time per executor
Data shuffle statistics

Data Partitioning

For optimal distributed query performance, partition datasets appropriately:

datasets:
  - from: s3://large-bucket/data/
    name: large_dataset
    acceleration:
      enabled: true
      engine: arrow
    params:
      file_format: parquet
      # Ensure data is pre-partitioned in S3
      partition_cols:
        - year
        - month

Executors can read partitioned data in parallel, improving query performance.

Best Practices

Use mTLS in production: Always secure cluster communication with mutual TLS
Scale executors horizontally: Add more executors for increased query throughput
Partition large datasets: Pre-partition data for parallel processing
Monitor resource usage: Track CPU, memory, and network metrics per executor
Co-locate with data: Deploy executors close to data sources to minimize network latency
Use persistent storage: Mount volumes for file-based accelerators (DuckDB, Cayenne)

Limitations

Single scheduler (high availability requires external orchestration)
Executors must have network connectivity to scheduler and each other
Data shuffle requires inter-executor communication
Not all queries benefit from distribution (OLTP workloads with small datasets)

Development Mode

For testing without mTLS:

# Scheduler (insecure - development only)
spiced --role scheduler --allow-insecure-connections

# Executor (insecure - development only)
spiced --role executor \
  --scheduler-address http://localhost:50052 \
  --allow-insecure-connections

WARNING: --allow-insecure-connections disables authentication. Never use in production.

Next Steps

Configuration - Configure runtime behavior
Monitoring - Set up observability for distributed clusters
Kubernetes Deployment - Deploy to Kubernetes

Get Started

Core Concepts

Data Connectors

Data Accelerators

Search

AI & ML

Deployment

Distributed Query Execution

Overview

Architecture

Deployment

Scheduler Node

Executor Nodes

Security: mTLS Configuration

Generate Certificates

Scheduler with mTLS

Executor with mTLS

Kubernetes Deployment

Scheduler Deployment

Executor StatefulSet

Create Certificates Secret

Query Execution

Monitoring Distributed Queries

Data Partitioning

Best Practices

Limitations

Development Mode

Next Steps

Get Started

Core Concepts

Data Connectors

Data Accelerators

Search

AI & ML

Deployment

Documentation Index

​Overview

​Architecture

​Deployment

​Scheduler Node

​Executor Nodes

​Security: mTLS Configuration

​Generate Certificates

​Scheduler with mTLS

​Executor with mTLS

​Kubernetes Deployment

​Scheduler Deployment

​Executor StatefulSet

​Create Certificates Secret

​Query Execution

​Monitoring Distributed Queries

​Data Partitioning

​Best Practices

​Limitations

​Development Mode

​Next Steps

Overview

Architecture

Deployment

Scheduler Node

Executor Nodes

Security: mTLS Configuration

Generate Certificates

Scheduler with mTLS

Executor with mTLS

Kubernetes Deployment

Scheduler Deployment

Executor StatefulSet

Create Certificates Secret

Query Execution

Monitoring Distributed Queries

Data Partitioning

Best Practices

Limitations

Development Mode

Next Steps