Getting Started
LlamaGate provides an OpenAI-compatible API for accessing open-source language models. You can use the official OpenAI SDK or any HTTP client to make requests.
Tip
Base URL
Install the SDK
Install the OpenAI SDK for your preferred language:
Authentication
All API requests require authentication using a Bearer token. You can create API keys from your dashboard.
Warning
API keys start with llg_sk_ prefix. Create your key in the API Keys dashboard.
Chat Completions
The chat completions endpoint is the primary way to interact with language models. Send a list of messages and receive a model-generated response.
Try this example to see a chat completion response
Parameters
| Parameter | Type | Description |
|---|---|---|
model | string | Model ID to use (required) |
messages | array | List of messages in the conversation (required) |
temperature | number | Sampling temperature (0-2, default: 1) |
max_tokens | integer | Maximum tokens to generate |
stream | boolean | Enable streaming responses |
top_p | number | Nucleus sampling parameter (0-1) |
Streaming Responses
For a better user experience, you can stream responses token by token. This is especially useful for chat interfaces.
Tool Calling (Function Calling)
Tool calling allows the model to request external function calls. This is useful for building agents, retrieving real-time data, or performing actions.
Note
Defining Tools
Handling Tool Calls
JSON Mode & Structured Outputs
JSON mode ensures the model outputs valid JSON. For even more control, use structured outputs to define an exact JSON schema the response must follow.
Basic JSON Mode
Use {"type": "json_object"} to ensure valid JSON output:
Structured Outputs (JSON Schema)
For guaranteed response structure, provide a JSON schema. The model will strictly follow the schema, ensuring type safety and required fields.
Tip
strict: true for guaranteed schema compliance in production.Vision (Image Input)
Vision-capable models can analyze images. Pass images as base64-encoded data or URLs.
Note
Embeddings
Generate vector embeddings for text. Useful for semantic search, clustering, and RAG applications.
Batch Embeddings
Available Models
List all available models via the API or view them on the pricing page.
Model Categories
| Category | Examples | Best For |
|---|---|---|
| General Purpose | Llama 3.1, Qwen, Mistral | Everyday tasks, chat, writing |
| Code | CodeGemma, DeepSeek Coder | Programming, code review |
| Reasoning | DeepSeek R1, OpenThinker | Complex problem-solving |
| Vision | LLaVA, Qwen VL | Image understanding |
| Embeddings | Nomic, Qwen Embedding | Vector search, RAG |
Error Handling
The API returns standard HTTP status codes and JSON error responses.
Error Codes
| Status | Description |
|---|---|
400 | Bad request - check your parameters |
401 | Unauthorized - invalid or missing API key |
402 | Payment required - insufficient credits |
404 | Not found - model does not exist |
429 | Rate limit exceeded |
500 | Internal server error |
Rate Limits
Rate limits ensure fair usage and service stability. Limits are applied per API key.
| Limit Type | Value |
|---|---|
| Requests per minute | 60 RPM |
| Tokens per minute | 100,000 TPM |
| Concurrent requests | 10 |
Rate limit headers are included in API responses:
| Header | Description |
|---|---|
X-RateLimit-Limit | Maximum requests allowed |
X-RateLimit-Remaining | Remaining requests |
X-RateLimit-Reset | Time when limit resets |