Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/microsoft/autogen/llms.txt

Use this file to discover all available pages before exploring further.

Model clients provide the interface between AutoGen agents and large language models. AutoGen supports multiple LLM providers through the autogen-ext package.

Installation

Install the extension for your chosen provider:
pip install "autogen-ext[openai]"

OpenAI

The OpenAIChatCompletionClient supports GPT-4, GPT-3.5, o1, and o3 models.

Basic Usage

from autogen_ext.models.openai import OpenAIChatCompletionClient
from autogen_agentchat.agents import AssistantAgent

# Create OpenAI client
model_client = OpenAIChatCompletionClient(
    model="gpt-4o",
    api_key="sk-...",  # Or set OPENAI_API_KEY environment variable
)

# Use with an agent
agent = AssistantAgent(
    name="assistant",
    model_client=model_client,
    system_message="You are a helpful assistant."
)

Configuration Options

model
string
required
The model name (e.g., gpt-4o, gpt-4-turbo, gpt-3.5-turbo)
api_key
string
OpenAI API key. If not provided, reads from OPENAI_API_KEY environment variable
temperature
float
default:"1.0"
Sampling temperature between 0 and 2
top_p
float
default:"1.0"
Nucleus sampling parameter
max_tokens
int
Maximum tokens to generate
timeout
float
default:"60.0"
Request timeout in seconds
base_url
string
Override the default OpenAI API endpoint

Advanced Example

from autogen_ext.models.openai import OpenAIChatCompletionClient

client = OpenAIChatCompletionClient(
    model="gpt-4o",
    api_key="sk-...",
    temperature=0.7,
    top_p=0.9,
    max_tokens=4096,
    timeout=120.0,
    # For Azure-compatible endpoints
    base_url="https://custom-endpoint.openai.azure.com/",
)

Azure OpenAI

The AzureOpenAIChatCompletionClient connects to Azure OpenAI Service.

Basic Usage

from autogen_ext.models.openai import AzureOpenAIChatCompletionClient

client = AzureOpenAIChatCompletionClient(
    model="gpt-4o",
    api_version="2024-02-01",
    azure_endpoint="https://YOUR-RESOURCE-NAME.openai.azure.com",
    api_key="...",  # Or use Azure AD authentication
    azure_deployment="gpt-4o-deployment",  # Your deployment name
)

Configuration Options

azure_endpoint
string
required
The Azure OpenAI endpoint URL
api_version
string
required
Azure OpenAI API version (e.g., 2024-02-01)
azure_deployment
string
required
Your deployment name in Azure
api_key
string
Azure OpenAI API key
azure_ad_token
string
Azure Active Directory token for authentication

Azure AD Authentication

from azure.identity import DefaultAzureCredential
from autogen_ext.models.openai import AzureOpenAIChatCompletionClient

# Using Azure AD authentication
credential = DefaultAzureCredential()
token = credential.get_token("https://cognitiveservices.azure.com/.default")

client = AzureOpenAIChatCompletionClient(
    model="gpt-4o",
    api_version="2024-02-01",
    azure_endpoint="https://YOUR-RESOURCE-NAME.openai.azure.com",
    azure_ad_token=token.token,
    azure_deployment="gpt-4o-deployment",
)

Anthropic

The AnthropicChatCompletionClient supports Claude models.

Basic Usage

from autogen_ext.models.anthropic import AnthropicChatCompletionClient

client = AnthropicChatCompletionClient(
    model="claude-3-5-sonnet-20241022",
    api_key="sk-ant-...",  # Or set ANTHROPIC_API_KEY
    max_tokens=4096,
)

Configuration Options

model
string
required
Claude model name:
  • claude-3-5-sonnet-20241022 - Most capable
  • claude-3-opus-20240229 - Previous flagship
  • claude-3-sonnet-20240229 - Balanced
  • claude-3-haiku-20240307 - Fast and compact
api_key
string
Anthropic API key. Falls back to ANTHROPIC_API_KEY environment variable
max_tokens
int
required
Maximum tokens to generate. Required for Anthropic models
temperature
float
default:"1.0"
Sampling temperature between 0 and 1
top_p
float
Nucleus sampling parameter
top_k
int
Only sample from top K options

Extended Thinking (Claude 3.5 Sonnet)

Claude 3.5 Sonnet supports extended thinking mode:
from autogen_ext.models.anthropic import AnthropicChatCompletionClient

client = AnthropicChatCompletionClient(
    model="claude-3-5-sonnet-20241022",
    api_key="sk-ant-...",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000,  # Tokens for thinking
    },
)

AWS Bedrock

Use Claude models through AWS Bedrock:
from autogen_ext.models.anthropic import AnthropicBedrockChatCompletionClient

client = AnthropicBedrockChatCompletionClient(
    model="anthropic.claude-3-5-sonnet-20241022-v2:0",
    max_tokens=4096,
    # AWS credentials from environment or ~/.aws/credentials
    aws_region="us-west-2",
)
aws_region
string
AWS region (e.g., us-west-2, us-east-1)
aws_access_key
string
AWS access key ID
aws_secret_key
string
AWS secret access key
aws_session_token
string
AWS session token for temporary credentials

Ollama

The OllamaChatCompletionClient connects to local Ollama instances.

Basic Usage

from autogen_ext.models.ollama import OllamaChatCompletionClient

client = OllamaChatCompletionClient(
    model="llama3.2",
    host="http://localhost:11434",
)

Configuration Options

model
string
required
Ollama model name (e.g., llama3.2, mistral, qwen2.5)
host
string
default:"http://localhost:11434"
Ollama server URL
temperature
float
Sampling temperature
top_p
float
Nucleus sampling parameter
top_k
int
Top-K sampling parameter
num_ctx
int
Context window size
num_predict
int
Maximum tokens to generate

Advanced Configuration

from autogen_ext.models.ollama import OllamaChatCompletionClient

client = OllamaChatCompletionClient(
    model="llama3.2",
    host="http://localhost:11434",
    temperature=0.7,
    top_p=0.9,
    top_k=40,
    num_ctx=8192,  # Context window
    num_predict=2048,  # Max generation
    repeat_penalty=1.1,
    seed=42,  # For reproducibility
)

Llama.cpp

Run GGUF models locally with llama.cpp:

Installation

pip install "autogen-ext[llama-cpp]"

Basic Usage

from autogen_ext.models.llama_cpp import LlamaCppChatCompletionClient

client = LlamaCppChatCompletionClient(
    model_path="./models/llama-3.2-3b-instruct-q8_0.gguf",
    n_ctx=8192,  # Context window
    n_gpu_layers=35,  # Offload layers to GPU
)

Configuration Options

model_path
string
required
Path to the GGUF model file
n_ctx
int
default:"2048"
Context window size
n_gpu_layers
int
default:"0"
Number of layers to offload to GPU
temperature
float
default:"0.8"
Sampling temperature
top_p
float
default:"0.95"
Nucleus sampling
top_k
int
default:"40"
Top-K sampling
max_tokens
int
default:"512"
Maximum tokens to generate

Azure AI

Connect to Azure AI model deployments:
from autogen_ext.models.azure import AzureAIChatCompletionClient

client = AzureAIChatCompletionClient(
    endpoint="https://YOUR-ENDPOINT.inference.ai.azure.com",
    credential="YOUR-API-KEY",
    model="gpt-4o",
)

Streaming Responses

All model clients support streaming:
from autogen_core import CancellationToken
from autogen_core.models import UserMessage

async def stream_example(client):
    messages = [UserMessage(content="Tell me a story", source="user")]
    
    async for chunk in client.create_stream(messages, CancellationToken()):
        if chunk.content:
            print(chunk.content, end="", flush=True)

Model Capabilities

Query model capabilities:
capabilities = client.capabilities

print(f"Vision: {capabilities.vision}")
print(f"Function calling: {capabilities.function_calling}")
print(f"JSON output: {capabilities.json_output}")

Token Counting

Count tokens before sending requests:
from autogen_core.models import UserMessage

messages = [UserMessage(content="Hello, world!", source="user")]
token_count = client.count_tokens(messages)
print(f"Message uses {token_count} tokens")

Usage Tracking

Track token usage from responses:
from autogen_core import CancellationToken
from autogen_core.models import UserMessage

messages = [UserMessage(content="Explain quantum computing", source="user")]
result = await client.create(messages, CancellationToken())

print(f"Prompt tokens: {result.usage.prompt_tokens}")
print(f"Completion tokens: {result.usage.completion_tokens}")

Error Handling

Handle common errors:
from openai import RateLimitError, APIError
from anthropic import AnthropicError
import asyncio

async def create_with_retry(client, messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            return await client.create(messages, CancellationToken())
        except RateLimitError:
            if attempt < max_retries - 1:
                await asyncio.sleep(2 ** attempt)  # Exponential backoff
            else:
                raise
        except APIError as e:
            print(f"API error: {e}")
            raise

Environment Variables

Model clients respect standard environment variables:
# OpenAI
export OPENAI_API_KEY="sk-..."
export OPENAI_ORG_ID="org-..."

# Anthropic
export ANTHROPIC_API_KEY="sk-ant-..."

# Azure OpenAI
export AZURE_OPENAI_ENDPOINT="https://..."
export AZURE_OPENAI_API_KEY="..."

# AWS (for Bedrock)
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_REGION="us-west-2"

Best Practices

Use Environment Variables

Store API keys in environment variables instead of hardcoding:
import os
from autogen_ext.models.openai import OpenAIChatCompletionClient

# Good: reads from environment
client = OpenAIChatCompletionClient(
    model="gpt-4o",
    api_key=os.getenv("OPENAI_API_KEY"),
)

# Better: automatic from environment
client = OpenAIChatCompletionClient(model="gpt-4o")

Set Timeouts

Always configure appropriate timeouts:
client = OpenAIChatCompletionClient(
    model="gpt-4o",
    timeout=120.0,  # 2 minute timeout
)

Monitor Usage

Track token usage to manage costs:
total_prompt_tokens = 0
total_completion_tokens = 0

result = await client.create(messages, CancellationToken())
total_prompt_tokens += result.usage.prompt_tokens
total_completion_tokens += result.usage.completion_tokens

print(f"Total usage: {total_prompt_tokens + total_completion_tokens} tokens")

Next Steps

Code Executors

Set up code execution environments

Tools

Add tools and capabilities to agents