Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/spiceai/spiceai/llms.txt

Use this file to discover all available pages before exploring further.

Spice supports multiple model providers for both LLM inference and embeddings, including hosted APIs and local model serving with hardware acceleration.

Supported Providers

LLM Providers

  • OpenAI - GPT-4, GPT-4o, GPT-3.5-turbo models
  • Anthropic - Claude 3 Opus, Sonnet, Haiku models
  • xAI - Grok models
  • AWS Bedrock - Amazon Nova models
  • Azure OpenAI - Azure-hosted OpenAI models
  • Databricks - Models via Databricks serving endpoints
  • Google - Gemini models
  • Perplexity - Perplexity API models
  • File - Local GGUF/GGML/SafeTensor models
  • HuggingFace - Models from HuggingFace Hub
  • Spice.ai - Models from Spice.ai Cloud Platform

Embedding Providers

  • OpenAI - text-embedding-3-small, text-embedding-3-large
  • AWS Bedrock - Amazon Titan, Cohere, Nova embeddings
  • Azure OpenAI - Azure-hosted embedding models
  • Google - Gemini embedding models
  • File - Local ONNX models
  • HuggingFace - ONNX-compatible models from HuggingFace
  • Model2Vec - Static embeddings (500x faster)

Configuration Format

Models are configured in spicepod.yaml:
version: v1
kind: Spicepod
name: my-app

models:
  - from: <provider>:<model_id>
    name: <local_name>
    params:
      <provider_params>

embeddings:
  - from: <provider>:<model_id>
    name: <local_name>
    params:
      <provider_params>

OpenAI

Configuration

models:
  - from: openai:gpt-4o-mini
    name: chat-model
    params:
      openai_api_key: ${secrets:openai_key}
      openai_org_id: org-xxx  # Optional
      openai_project_id: proj-xxx  # Optional
      openai_usage_tier: tier3  # Optional: free, tier1-5

embeddings:
  - from: openai:text-embedding-3-small
    name: text-embedding
    params:
      openai_api_key: ${secrets:openai_key}
      openai_usage_tier: tier3

Available Models

Chat Models:
  • gpt-4o - Latest GPT-4 Optimized
  • gpt-4o-mini - Efficient GPT-4 variant
  • gpt-4-turbo - GPT-4 Turbo
  • gpt-3.5-turbo - Fast, cost-effective
Embedding Models:
  • text-embedding-3-small - 1536 dimensions
  • text-embedding-3-large - 3072 dimensions
  • text-embedding-ada-002 - Legacy model

Parameters

ParameterDescriptionRequired
openai_api_keyOpenAI API keyYes
openai_api_baseCustom API endpointNo
openai_org_idOrganization IDNo
openai_project_idProject IDNo
openai_usage_tierRate limit tier (free, tier1-5)No

Anthropic

Configuration

models:
  - from: anthropic:claude-3-5-sonnet-20241022
    name: claude
    params:
      anthropic_api_key: ${secrets:anthropic_key}

Available Models

  • claude-3-5-sonnet-20241022 - Most capable Claude model
  • claude-3-5-haiku-20241022 - Fast and efficient
  • claude-3-opus-20240229 - Highest capability
  • claude-3-sonnet-20240229 - Balanced performance
  • claude-3-haiku-20240307 - Fastest responses

Parameters

ParameterDescriptionRequired
anthropic_api_keyAnthropic API keyYes
anthropic_api_baseCustom API endpointNo

xAI

Configuration

models:
  - from: xai:grok-2-1212
    name: grok
    params:
      xai_api_key: ${secrets:xai_key}

Available Models

  • grok-2-1212 - Latest Grok model
  • grok-vision-beta - Vision capabilities

AWS Bedrock

Configuration

models:
  - from: bedrock:amazon.nova-pro-v1:0
    name: nova-pro
    params:
      aws_region: us-east-1
      aws_access_key_id: ${secrets:aws_access_key}
      aws_secret_access_key: ${secrets:aws_secret}

embeddings:
  - from: bedrock:amazon.titan-embed-text-v2:0
    name: titan-embed
    params:
      aws_region: us-east-1
      aws_access_key_id: ${secrets:aws_access_key}
      aws_secret_access_key: ${secrets:aws_secret}
      normalize: true
      dimensions: 512

Available Models

Chat Models:
  • amazon.nova-pro-v1:0 - Amazon Nova Pro
  • amazon.nova-lite-v1:0 - Amazon Nova Lite
  • amazon.nova-micro-v1:0 - Amazon Nova Micro
  • anthropic.claude-3-5-sonnet-20241022-v2:0 - Claude via Bedrock
Embedding Models:
  • amazon.titan-embed-text-v2:0 - Titan Text Embeddings v2
  • cohere.embed-english-v3 - Cohere embeddings
  • cohere.embed-multilingual-v3 - Multilingual embeddings
  • amazon.nova-multimodal-embed-v2:0 - Nova multimodal embeddings

Parameters

Common:
ParameterDescriptionRequired
aws_regionAWS regionYes
aws_access_key_idAWS access keyYes
aws_secret_access_keyAWS secret keyYes
Titan Embeddings:
ParameterDescriptionDefault
normalizeNormalize embeddingsfalse
dimensionsOutput dimensions512
Cohere Embeddings:
ParameterDescriptionDefault
truncateTruncation modeNONE
input_typeInput typeSEARCH_DOCUMENT
embedding_typeType (float, int8)FLOAT
Nova Embeddings:
ParameterDescriptionDefault
dimensionsOutput dimensions1024
embedding_purposePurpose (query, storage)STORAGE
truncation_modeTruncation modeNONE

Azure OpenAI

Configuration

models:
  - from: azure:gpt-4o-mini
    name: azure-chat
    params:
      azure_api_key: ${secrets:azure_key}
      azure_api_base: https://your-resource.openai.azure.com
      azure_api_version: 2024-02-15-preview
      azure_deployment_name: my-gpt4-deployment

embeddings:
  - from: azure:text-embedding-3-small
    name: azure-embed
    params:
      azure_api_key: ${secrets:azure_key}
      azure_api_base: https://your-resource.openai.azure.com
      azure_api_version: 2024-02-15-preview
      azure_deployment_name: my-embedding-deployment

Parameters

ParameterDescriptionRequired
azure_api_keyAzure OpenAI API keyYes
azure_api_baseAzure endpoint URLYes
azure_api_versionAPI versionYes
azure_deployment_nameDeployment nameYes
azure_entra_tokenAzure AD token (alternative auth)No

Local Models (File)

Configuration

models:
  - from: file:models/Llama-3.2-1B-Instruct-Q4_K_M.gguf
    name: local-llm

embeddings:
  - from: file:models/all-MiniLM-L6-v2/
    name: local-embed

Supported Formats

LLM Formats:
  • GGUF - Quantized llama.cpp format (recommended)
  • GGML - Legacy llama.cpp format
  • SafeTensor - Hugging Face SafeTensor format
Embedding Formats:
  • ONNX - Optimized neural network exchange format

Hardware Acceleration

Spice automatically detects and utilizes available hardware:
  • NVIDIA GPUs - CUDA acceleration for GGUF/GGML models
  • Apple Silicon - Metal acceleration on M1/M2/M3 chips
  • CPU - Optimized CPU inference with SIMD

Example: Local Llama Model

models:
  - from: file:./models/Llama-3.2-3B-Instruct-Q4_K_M.gguf
    name: llama-local
# Download a GGUF model
wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  -P models/

# Start Spice
spice run

HuggingFace

Configuration

models:
  - from: huggingface:Qwen/Qwen2.5-0.5B-Instruct
    name: qwen
    params:
      huggingface_token: ${secrets:hf_token}  # Optional

embeddings:
  - from: huggingface:sentence-transformers/all-MiniLM-L6-v2
    name: minilm
    params:
      huggingface_token: ${secrets:hf_token}  # Optional
Models are automatically downloaded from HuggingFace Hub on first use.

Parameters

ParameterDescriptionRequired
huggingface_tokenHF API tokenNo (for public models)

Model2Vec

Model2Vec provides static embeddings that are 500x faster than transformer models:

Configuration

embeddings:
  - from: model2vec:minishlab/potion-base-8M
    name: fast-embed
    params:
      huggingface_token: ${secrets:hf_token}  # Optional
      normalize: true
      parallelism: 4
      embed_max_token_length: 512
      embed_batch_size: 1024

Available Models

  • minishlab/potion-base-8M - 256 dimensions
  • minishlab/potion-multilingual-128M - Multilingual support

Parameters

ParameterDescriptionDefault
normalizeNormalize embeddingstrue
parallelismNumber of threadsCPU cores
embed_max_token_lengthMax token length512
embed_batch_sizeBatch size1024

Performance

Model2Vec is ideal for:
  • High-throughput embedding pipelines
  • Real-time search applications
  • Resource-constrained environments
  • CPU-only deployments

Google (Gemini)

Configuration

models:
  - from: google:gemini-1.5-pro
    name: gemini
    params:
      google_api_key: ${secrets:google_key}

embeddings:
  - from: google:text-embedding-004
    name: gemini-embed
    params:
      google_api_key: ${secrets:google_key}

Databricks

Configuration

models:
  - from: databricks:databricks-meta-llama-3-1-70b-instruct
    name: llama-databricks
    params:
      databricks_host: https://your-workspace.databricks.com
      databricks_token: ${secrets:databricks_token}

Rate Limiting

Configure rate limits to avoid throttling:
models:
  - from: openai:gpt-4o-mini
    name: rate-limited
    params:
      openai_api_key: ${secrets:key}
      openai_usage_tier: tier3  # Automatic rate limiting
Built-in rate controllers manage:
  • Requests per minute
  • Concurrent requests
  • Exponential backoff on errors

Caching

Enable response caching for improved performance:
models:
  - from: openai:gpt-4o-mini
    name: cached-model
    params:
      openai_api_key: ${secrets:key}
    caching:
      enabled: true
      max_size: 128MiB
      ttl: 1h

Health Checks

Spice performs automatic health checks on model initialization:
  • Tests model connectivity
  • Validates credentials
  • Ensures model availability
Health check logs:
2026-03-03T10:15:30Z INFO Model 'gpt-4o-mini' health check passed
2026-03-03T10:15:31Z ERROR Model 'invalid-model' health check failed: model not found

Model Discovery

Spice can list available models from providers:
# OpenAI models
curl http://localhost:8090/v1/models
Response:
{
  "object": "list",
  "data": [
    {
      "id": "gpt-4o-mini",
      "object": "model",
      "owned_by": "openai"
    }
  ]
}

Best Practices

Credential Management

Use Spice’s secret management:
models:
  - from: openai:gpt-4o-mini
    name: secure-model
    params:
      openai_api_key: ${secrets:SPICE_SECRET_OPENAI_API_KEY}

Provider Selection

  • OpenAI - Best for general-purpose tasks, structured outputs
  • Anthropic - Longer context windows, strong reasoning
  • Local Models - Privacy, offline operation, cost control
  • Model2Vec - High-throughput embeddings, CPU efficiency
  • AWS Bedrock - Enterprise compliance, AWS integration

Performance Optimization

  1. Use appropriate model sizes - Smaller models for simple tasks
  2. Enable caching - Reduce redundant API calls
  3. Configure rate limits - Avoid throttling
  4. Local models for high volume - CUDA/Metal acceleration
  5. Model2Vec for embeddings - 500x faster than transformers

Troubleshooting

Model Not Found

# Ensure model name matches spicepod.yaml
models:
  - from: openai:gpt-4o-mini
    name: chat-model  # Use this name in API calls

Authentication Errors

# Check secret configuration
spice secrets list

# Set secret
spice secrets set SPICE_SECRET_OPENAI_API_KEY your-key

Rate Limiting

# Configure usage tier
params:
  openai_usage_tier: tier3  # Adjust based on your OpenAI tier

Next Steps

OpenAI Compatibility

Use models with OpenAI SDK

Embeddings

Generate embeddings at scale

RAG

Build RAG workflows

MCP Integration

Add function calling