Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/spiceai/spiceai/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Data partitioning divides large datasets into smaller, manageable pieces called partitions. Partitioning improves query performance by:
  • Reducing data scanned (partition pruning)
  • Enabling parallel processing
  • Improving cache locality
  • Reducing memory usage

Partitioning Strategies

Time-Based Partitioning

Most common for time-series data:
datasets:
  - name: events
    from: s3://bucket/events/
    params:
      # Hive-style partitioning: year=2024/month=01/day=15/
      hive_partitioning_enabled: 'true'
Query with partition pruning:
-- Only scans January 2024 partitions
SELECT * FROM events 
WHERE year = 2024 
  AND month = 1 
  AND event_type = 'click';

Range Partitioning

Partition by value ranges:
datasets:
  - name: orders
    from: postgres:orders
    acceleration:
      enabled: true
      engine: duckdb
-- Partition by order amount ranges
CREATE TABLE orders_partitioned AS
SELECT *,
  CASE 
    WHEN amount < 100 THEN 'small'
    WHEN amount < 1000 THEN 'medium'
    ELSE 'large'
  END as size_partition
FROM orders;

Hash Partitioning

Distribute data evenly using hash function:
datasets:
  - name: users
    from: postgres:users
    acceleration:
      enabled: true
      engine: duckdb
-- Distribute users across partitions by hash
SELECT *, 
  user_id % 16 as partition_id  -- 16 partitions
FROM users;

List Partitioning

Partition by discrete values:
datasets:
  - name: sales
    from: postgres:sales
-- Partition by region
SELECT * FROM sales WHERE region IN ('US', 'CA', 'MX');  -- North America partition
SELECT * FROM sales WHERE region IN ('UK', 'DE', 'FR');  -- Europe partition

Partition Configuration

S3 and File Partitioning

Hive Partitioning:
datasets:
  - name: logs
    from: s3://bucket/logs/
    params:
      hive_partitioning_enabled: 'true'
      # Expects structure: year=2024/month=01/day=15/
Directory-Based Partitioning:
datasets:
  - name: events
    from: s3://bucket/events/
    params:
      # Path: 2024/01/15/events.parquet
      partition_columns:
        - year
        - month  
        - day

DuckDB Partitioned Tables

Use DuckDB’s native partitioning:
datasets:
  - name: metrics
    from: postgres:metrics
    acceleration:
      enabled: true
      engine: duckdb
      mode: file
-- DuckDB automatically partitions large tables
SELECT * FROM metrics WHERE timestamp >= '2024-01-01';

Cayenne (Vortex) Partitioning

Cayenne automatically partitions data into multiple files:
datasets:
  - name: large_dataset
    from: s3://bucket/data/
    acceleration:
      enabled: true
      engine: cayenne  # Multi-file columnar format
Partition layout:
.spice/data/large_dataset/
├── partition_0.vortex
├── partition_1.vortex
├── partition_2.vortex
└── metadata.db (SQLite)

Partition Pruning

Partition pruning skips irrelevant partitions:

Time-Based Pruning

-- Only scans partitions for Q1 2024
SELECT COUNT(*) FROM events
WHERE timestamp BETWEEN '2024-01-01' AND '2024-03-31';

Multi-Column Pruning

-- Prunes by year, month, and region
SELECT * FROM sales
WHERE year = 2024 
  AND month IN (1, 2, 3)
  AND region = 'US';

Dynamic Partition Pruning

Prune based on join results:
-- Partitions pruned based on active_regions
SELECT s.* 
FROM sales s
JOIN active_regions r ON s.region = r.region
WHERE s.year = 2024;

Refresh Data Window with Partitioning

Limit refreshes to specific time ranges:
datasets:
  - name: events
    from: postgres:events
    acceleration:
      enabled: true
      engine: duckdb
      refresh_mode: append
      refresh_check_interval: 10m
      time_column: created_at
      time_format: timestamptz
      refresh_data_window: 7d  # Only refresh last 7 days
How it works:
-- Runtime automatically adds filter
SELECT * FROM events 
WHERE created_at >= NOW() - INTERVAL '7 days';

Partition Management

Partition Statistics

-- View partition information
SELECT 
  partition_id,
  row_count,
  size_bytes,
  min_timestamp,
  max_timestamp
FROM dataset_partitions
WHERE dataset_name = 'events';

Manual Partition Refresh

datasets:
  - name: logs
    from: s3://bucket/logs/
    acceleration:
      enabled: true
      refresh_mode: full
      refresh_sql: |
        SELECT * FROM logs 
        WHERE year = 2024 AND month = 1  -- Specific partition

Parallel Query Execution

Target Partitions

Configure parallelism:
runtime:
  query:
    target_partitions: 16  # Number of parallel partitions
Rule of thumb:
target_partitions = CPU cores × (2 to 4)

Partition Size Guidelines

Dataset SizePartition CountPartition Size
< 1GB4-8128-256MB
1-10GB8-16128-512MB
10-100GB16-32512MB-2GB
100GB+32-641-4GB

Advanced Partitioning

Multi-Level Partitioning

datasets:
  - name: events
    from: s3://bucket/events/
    params:
      # year=2024/month=01/day=15/hour=10/
      hive_partitioning_enabled: 'true'
-- Prunes by year, month, day, and hour
SELECT * FROM events
WHERE year = 2024 
  AND month = 1 
  AND day = 15 
  AND hour BETWEEN 10 AND 12;

Composite Partitioning

Combine partitioning strategies:
datasets:
  - name: transactions
    from: postgres:transactions
    acceleration:
      enabled: true
      engine: duckdb
-- Partition by date and region
CREATE TABLE transactions_partitioned AS
SELECT *,
  DATE_TRUNC('day', created_at) as date_partition,
  region as region_partition
FROM transactions;

Partition Expressions

Define partitions using SQL expressions:
datasets:
  - name: events
    from: postgres:events
    vector:
      enabled: true
      partition_by:
        - name: year
          expression: YEAR(created_at)
        - name: month
          expression: MONTH(created_at)

Partition Optimization

1. Choose Appropriate Partition Column

Good partition columns:
  • Frequently filtered in queries
  • Low cardinality (date, region, category)
  • Evenly distributed data
Poor partition columns:
  • High cardinality (user_id, transaction_id)
  • Rarely used in filters
  • Skewed distribution

2. Avoid Over-Partitioning

# BAD - Too many small partitions
datasets:
  - name: events
    from: s3://bucket/events/
    params:
      # year/month/day/hour/minute - Too granular!
      hive_partitioning_enabled: 'true'

# GOOD - Balanced partitions
datasets:
  - name: events
    from: s3://bucket/events/
    params:
      # year/month/day - Reasonable granularity
      hive_partitioning_enabled: 'true'

3. Align Partitions with Query Patterns

-- If queries filter by region and date
SELECT * FROM sales 
WHERE region = 'US' 
  AND sale_date >= '2024-01-01';

-- Partition by both region and date
partition_by:
  - region
  - year
  - month

4. Monitor Partition Skew

-- Check partition size distribution
SELECT 
  partition_id,
  row_count,
  size_bytes,
  size_bytes::FLOAT / SUM(size_bytes) OVER () AS pct_of_total
FROM dataset_partitions
WHERE dataset_name = 'events'
ORDER BY size_bytes DESC;
Ideal: All partitions within 2-3x of mean size.

Partitioning by Engine

Arrow

  • In-memory, single partition by default
  • Use target_partitions for parallel execution
  • No persistent partitioning
datasets:
  - name: events
    acceleration:
      engine: arrow  # In-memory, parallel execution

DuckDB

  • File-based partitioning for large datasets
  • Automatic partition management
  • Supports partition pruning
datasets:
  - name: events
    acceleration:
      engine: duckdb
      mode: file  # Enables partitioning

Cayenne (Vortex)

  • Multi-file columnar storage
  • Automatic partitioning across files
  • Optimized for large datasets
datasets:
  - name: events
    acceleration:
      engine: cayenne  # Multi-file partitioning

SQLite

  • Single-file database
  • No native partitioning support
  • Use separate databases for partitioning
datasets:
  - name: events_2024
    acceleration:
      engine: sqlite
      # Manual partitioning via separate datasets

PostgreSQL

  • Native table partitioning
  • Declarative partitioning syntax
  • Supports all partition types
datasets:
  - name: events
    from: postgres:events  # PostgreSQL handles partitioning
    acceleration:
      engine: postgres

Best Practices

1. Start with Time-Based Partitioning

For time-series data, always partition by time:
datasets:
  - name: metrics
    from: s3://bucket/metrics/
    params:
      hive_partitioning_enabled: 'true'
      # year=2024/month=01/day=15/

2. Use Partition Pruning in Queries

-- GOOD - Enables partition pruning
SELECT * FROM events 
WHERE year = 2024 AND month = 1;

-- BAD - Cannot prune partitions
SELECT * FROM events 
WHERE DATE_TRUNC('month', timestamp) = '2024-01-01';

3. Keep Partition Size Reasonable

Target 256MB - 2GB per partition for optimal performance.

4. Monitor Query Performance

EXPLAIN SELECT * FROM events WHERE year = 2024 AND month = 1;
Verify partition pruning is active in the query plan.