Documentation Index Fetch the complete documentation index at: https://mintlify.com/spiceai/spiceai/llms.txt
Use this file to discover all available pages before exploring further.
Spice supports multiple model providers for both LLM inference and embeddings, including hosted APIs and local model serving with hardware acceleration.
Supported Providers
LLM Providers
OpenAI - GPT-4, GPT-4o, GPT-3.5-turbo models
Anthropic - Claude 3 Opus, Sonnet, Haiku models
xAI - Grok models
AWS Bedrock - Amazon Nova models
Azure OpenAI - Azure-hosted OpenAI models
Databricks - Models via Databricks serving endpoints
Google - Gemini models
Perplexity - Perplexity API models
File - Local GGUF/GGML/SafeTensor models
HuggingFace - Models from HuggingFace Hub
Spice.ai - Models from Spice.ai Cloud Platform
Embedding Providers
OpenAI - text-embedding-3-small, text-embedding-3-large
AWS Bedrock - Amazon Titan, Cohere, Nova embeddings
Azure OpenAI - Azure-hosted embedding models
Google - Gemini embedding models
File - Local ONNX models
HuggingFace - ONNX-compatible models from HuggingFace
Model2Vec - Static embeddings (500x faster)
Models are configured in spicepod.yaml:
version : v1
kind : Spicepod
name : my-app
models :
- from : <provider>:<model_id>
name : <local_name>
params :
<provider_params>
embeddings :
- from : <provider>:<model_id>
name : <local_name>
params :
<provider_params>
OpenAI
Configuration
models :
- from : openai:gpt-4o-mini
name : chat-model
params :
openai_api_key : ${secrets:openai_key}
openai_org_id : org-xxx # Optional
openai_project_id : proj-xxx # Optional
openai_usage_tier : tier3 # Optional: free, tier1-5
embeddings :
- from : openai:text-embedding-3-small
name : text-embedding
params :
openai_api_key : ${secrets:openai_key}
openai_usage_tier : tier3
Available Models
Chat Models:
gpt-4o - Latest GPT-4 Optimized
gpt-4o-mini - Efficient GPT-4 variant
gpt-4-turbo - GPT-4 Turbo
gpt-3.5-turbo - Fast, cost-effective
Embedding Models:
text-embedding-3-small - 1536 dimensions
text-embedding-3-large - 3072 dimensions
text-embedding-ada-002 - Legacy model
Parameters
Parameter Description Required openai_api_keyOpenAI API key Yes openai_api_baseCustom API endpoint No openai_org_idOrganization ID No openai_project_idProject ID No openai_usage_tierRate limit tier (free, tier1-5) No
Anthropic
Configuration
models :
- from : anthropic:claude-3-5-sonnet-20241022
name : claude
params :
anthropic_api_key : ${secrets:anthropic_key}
Available Models
claude-3-5-sonnet-20241022 - Most capable Claude model
claude-3-5-haiku-20241022 - Fast and efficient
claude-3-opus-20240229 - Highest capability
claude-3-sonnet-20240229 - Balanced performance
claude-3-haiku-20240307 - Fastest responses
Parameters
Parameter Description Required anthropic_api_keyAnthropic API key Yes anthropic_api_baseCustom API endpoint No
xAI
Configuration
models :
- from : xai:grok-2-1212
name : grok
params :
xai_api_key : ${secrets:xai_key}
Available Models
grok-2-1212 - Latest Grok model
grok-vision-beta - Vision capabilities
AWS Bedrock
Configuration
models :
- from : bedrock:amazon.nova-pro-v1:0
name : nova-pro
params :
aws_region : us-east-1
aws_access_key_id : ${secrets:aws_access_key}
aws_secret_access_key : ${secrets:aws_secret}
embeddings :
- from : bedrock:amazon.titan-embed-text-v2:0
name : titan-embed
params :
aws_region : us-east-1
aws_access_key_id : ${secrets:aws_access_key}
aws_secret_access_key : ${secrets:aws_secret}
normalize : true
dimensions : 512
Available Models
Chat Models:
amazon.nova-pro-v1:0 - Amazon Nova Pro
amazon.nova-lite-v1:0 - Amazon Nova Lite
amazon.nova-micro-v1:0 - Amazon Nova Micro
anthropic.claude-3-5-sonnet-20241022-v2:0 - Claude via Bedrock
Embedding Models:
amazon.titan-embed-text-v2:0 - Titan Text Embeddings v2
cohere.embed-english-v3 - Cohere embeddings
cohere.embed-multilingual-v3 - Multilingual embeddings
amazon.nova-multimodal-embed-v2:0 - Nova multimodal embeddings
Parameters
Common:
Parameter Description Required aws_regionAWS region Yes aws_access_key_idAWS access key Yes aws_secret_access_keyAWS secret key Yes
Titan Embeddings:
Parameter Description Default normalizeNormalize embeddings false dimensionsOutput dimensions 512
Cohere Embeddings:
Parameter Description Default truncateTruncation mode NONE input_typeInput type SEARCH_DOCUMENT embedding_typeType (float, int8) FLOAT
Nova Embeddings:
Parameter Description Default dimensionsOutput dimensions 1024 embedding_purposePurpose (query, storage) STORAGE truncation_modeTruncation mode NONE
Azure OpenAI
Configuration
models :
- from : azure:gpt-4o-mini
name : azure-chat
params :
azure_api_key : ${secrets:azure_key}
azure_api_base : https://your-resource.openai.azure.com
azure_api_version : 2024-02-15-preview
azure_deployment_name : my-gpt4-deployment
embeddings :
- from : azure:text-embedding-3-small
name : azure-embed
params :
azure_api_key : ${secrets:azure_key}
azure_api_base : https://your-resource.openai.azure.com
azure_api_version : 2024-02-15-preview
azure_deployment_name : my-embedding-deployment
Parameters
Parameter Description Required azure_api_keyAzure OpenAI API key Yes azure_api_baseAzure endpoint URL Yes azure_api_versionAPI version Yes azure_deployment_nameDeployment name Yes azure_entra_tokenAzure AD token (alternative auth) No
Local Models (File)
Configuration
models :
- from : file:models/Llama-3.2-1B-Instruct-Q4_K_M.gguf
name : local-llm
embeddings :
- from : file:models/all-MiniLM-L6-v2/
name : local-embed
LLM Formats:
GGUF - Quantized llama.cpp format (recommended)
GGML - Legacy llama.cpp format
SafeTensor - Hugging Face SafeTensor format
Embedding Formats:
ONNX - Optimized neural network exchange format
Hardware Acceleration
Spice automatically detects and utilizes available hardware:
NVIDIA GPUs - CUDA acceleration for GGUF/GGML models
Apple Silicon - Metal acceleration on M1/M2/M3 chips
CPU - Optimized CPU inference with SIMD
Example: Local Llama Model
models :
- from : file:./models/Llama-3.2-3B-Instruct-Q4_K_M.gguf
name : llama-local
# Download a GGUF model
wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
-P models/
# Start Spice
spice run
HuggingFace
Configuration
models :
- from : huggingface:Qwen/Qwen2.5-0.5B-Instruct
name : qwen
params :
huggingface_token : ${secrets:hf_token} # Optional
embeddings :
- from : huggingface:sentence-transformers/all-MiniLM-L6-v2
name : minilm
params :
huggingface_token : ${secrets:hf_token} # Optional
Models are automatically downloaded from HuggingFace Hub on first use.
Parameters
Parameter Description Required huggingface_tokenHF API token No (for public models)
Model2Vec
Model2Vec provides static embeddings that are 500x faster than transformer models:
Configuration
embeddings :
- from : model2vec:minishlab/potion-base-8M
name : fast-embed
params :
huggingface_token : ${secrets:hf_token} # Optional
normalize : true
parallelism : 4
embed_max_token_length : 512
embed_batch_size : 1024
Available Models
minishlab/potion-base-8M - 256 dimensions
minishlab/potion-multilingual-128M - Multilingual support
Parameters
Parameter Description Default normalizeNormalize embeddings true parallelismNumber of threads CPU cores embed_max_token_lengthMax token length 512 embed_batch_sizeBatch size 1024
Model2Vec is ideal for:
High-throughput embedding pipelines
Real-time search applications
Resource-constrained environments
CPU-only deployments
Google (Gemini)
Configuration
models :
- from : google:gemini-1.5-pro
name : gemini
params :
google_api_key : ${secrets:google_key}
embeddings :
- from : google:text-embedding-004
name : gemini-embed
params :
google_api_key : ${secrets:google_key}
Databricks
Configuration
models :
- from : databricks:databricks-meta-llama-3-1-70b-instruct
name : llama-databricks
params :
databricks_host : https://your-workspace.databricks.com
databricks_token : ${secrets:databricks_token}
Rate Limiting
Configure rate limits to avoid throttling:
models :
- from : openai:gpt-4o-mini
name : rate-limited
params :
openai_api_key : ${secrets:key}
openai_usage_tier : tier3 # Automatic rate limiting
Built-in rate controllers manage:
Requests per minute
Concurrent requests
Exponential backoff on errors
Caching
Enable response caching for improved performance:
models :
- from : openai:gpt-4o-mini
name : cached-model
params :
openai_api_key : ${secrets:key}
caching :
enabled : true
max_size : 128MiB
ttl : 1h
Health Checks
Spice performs automatic health checks on model initialization:
Tests model connectivity
Validates credentials
Ensures model availability
Health check logs:
2026-03-03T10:15:30Z INFO Model 'gpt-4o-mini' health check passed
2026-03-03T10:15:31Z ERROR Model 'invalid-model' health check failed: model not found
Model Discovery
Spice can list available models from providers:
# OpenAI models
curl http://localhost:8090/v1/models
Response:
{
"object" : "list" ,
"data" : [
{
"id" : "gpt-4o-mini" ,
"object" : "model" ,
"owned_by" : "openai"
}
]
}
Best Practices
Credential Management
Use Spice’s secret management:
models :
- from : openai:gpt-4o-mini
name : secure-model
params :
openai_api_key : ${secrets:SPICE_SECRET_OPENAI_API_KEY}
Provider Selection
OpenAI - Best for general-purpose tasks, structured outputs
Anthropic - Longer context windows, strong reasoning
Local Models - Privacy, offline operation, cost control
Model2Vec - High-throughput embeddings, CPU efficiency
AWS Bedrock - Enterprise compliance, AWS integration
Use appropriate model sizes - Smaller models for simple tasks
Enable caching - Reduce redundant API calls
Configure rate limits - Avoid throttling
Local models for high volume - CUDA/Metal acceleration
Model2Vec for embeddings - 500x faster than transformers
Troubleshooting
Model Not Found
# Ensure model name matches spicepod.yaml
models :
- from : openai:gpt-4o-mini
name : chat-model # Use this name in API calls
Authentication Errors
# Check secret configuration
spice secrets list
# Set secret
spice secrets set SPICE_SECRET_OPENAI_API_KEY your-key
Rate Limiting
# Configure usage tier
params :
openai_usage_tier : tier3 # Adjust based on your OpenAI tier
Next Steps
OpenAI Compatibility Use models with OpenAI SDK
Embeddings Generate embeddings at scale
MCP Integration Add function calling