Guide

Kimi K2.6

Moonshot's frontier open-weight agent model. 262K context, native tool calling, vision, and extended reasoning. Served through AIgateway with an OpenAI-compatible API — drop-in compatible with the OpenAI SDK, Cursor, Cline, LangChain, Vercel AI SDK, and anything else that speaks OpenAI.

Looking for the current frontier pick? The newer Kimi K2.7 Code is now the headline Moonshot model and the one in the $5 signup credit shortlist. K2.6 stays in the catalog at the same pricing — grab a key and call either by slug.

Quickstart (60 seconds)

Every OpenAI SDK works — just change base_url. Here's the Python version:

# pip install aigateway-py openai
# aigateway-py: sub-accounts, evals, replays, jobs, webhook verify.
# openai SDK: chat/embeddings/images/audio — drop-in compat per our SDK's own guidance.
from openai import OpenAI

client = OpenAI(
    base_url="https://api.aigateway.sh/v1",
    api_key="sk-aig-...",
)

r = client.chat.completions.create(
    model="moonshot/kimi-k2.6",
    messages=[{"role": "user", "content": "Write a Python one-liner to reverse a string."}],
)
print(r.choices[0].message.content)

Model card

Slug: moonshot/kimi-k2.6
Provider: Moonshot (served on the edge via AIgateway)
Context window: 262,144 tokens (~700 pages of text, most mid-sized repos fit whole)
Max output: 16,384 tokens
Modality: Text + vision
Capabilities: Streaming, tool calling, JSON mode, extended reasoning, vision
Pricing: $0.95 / 1M input tokens, $4.00 / 1M output tokens. Provider prompt-cached input tokens bill at 50% of the input rate ($0.475 / 1M). Pass-through — our 5% fee is added to the provider cost on every call.

Request

Full OpenAI chat.completions body. Everything is optional except model and messages:

{
  "model": "moonshot/kimi-k2.6",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user",   "content": "Hello!" }
  ],
  "temperature": 0.7,
  "top_p": 0.95,
  "max_tokens": 4096,
  "stream": false,

  "tools": [ /* OpenAI function spec — see below */ ],
  "tool_choice": "auto",
  "parallel_tool_calls": true,

  "response_format": { "type": "json_object" }
}

Response (non-streaming)

Two fields are non-obvious on Kimi: reasoning_content (chain of thought) and tool_calls (when the model wants to call a function).

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1776947082,
  "model": "moonshot/kimi-k2.6",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The weather in Tokyo is sunny.",
        "reasoning_content": "The user asked about Tokyo. I should call the tool...",
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"city\":\"Tokyo\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls"
    }
  ],
  "usage": {
    "prompt_tokens": 53,
    "completion_tokens": 79,
    "total_tokens": 132
  }
}

If you don't want to show the chain of thought, just read content — reasoning_content is additive, not required. Cursor, Cline, and the OpenAI SDK all ignore it by default.

Streaming (SSE)

Set "stream": true and read text/event-stream. Kimi emits chunks in this order:

// 1. Role
data: {"choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

// 2. Reasoning (Kimi thinks first — stream of reasoning_content)
data: {"choices":[{"index":0,"delta":{"reasoning_content":"The "},"finish_reason":null}]}
data: {"choices":[{"index":0,"delta":{"reasoning_content":"user "},"finish_reason":null}]}
// ... many chunks ...

// 3. Content (the actual answer)
data: {"choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}
data: {"choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}

// 4. Finish
data: {"choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Tool calling

Kimi speaks the standard OpenAI tool-call protocol. Define tools, Kimi decides when to call them, you execute the call and feed the result back as a role: "tool" message.

# Define a tool
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]

# Turn 1 — Kimi asks to call the tool
r = client.chat.completions.create(
    model="moonshot/kimi-k2.6",
    messages=[{"role": "user", "content": "Weather in Tokyo?"}],
    tools=tools,
)
call = r.choices[0].message.tool_calls[0]
# call.function.name == "get_weather"
# call.function.arguments == '{"city":"Tokyo"}'

# You execute the tool
result = {"temperature_c": 22, "conditions": "sunny"}

# Turn 2 — feed the result back, Kimi writes the final answer
r2 = client.chat.completions.create(
    model="moonshot/kimi-k2.6",
    messages=[
        {"role": "user", "content": "Weather in Tokyo?"},
        r.choices[0].message,                         # assistant turn with tool_calls
        {"role": "tool", "tool_call_id": call.id,
         "content": json.dumps(result)},
    ],
    tools=tools,
)
print(r2.choices[0].message.content)

Streaming tool calls

When streaming with tools, the arguments string arrives as fragments. Concatenate by index until finish_reason === "tool_calls":

data: {"choices":[{"index":0,"delta":{"tool_calls":[
  {"index":0,"id":"call_abc","type":"function",
   "function":{"name":"get_weather","arguments":""}}]},"finish_reason":null}]}

data: {"choices":[{"index":0,"delta":{"tool_calls":[
  {"index":0,"function":{"arguments":"{\"city\":"}}]},"finish_reason":null}]}

data: {"choices":[{"index":0,"delta":{"tool_calls":[
  {"index":0,"function":{"arguments":"\"Tokyo\"}"}}]},"finish_reason":null}]}

data: {"choices":[{"index":0,"delta":{},"finish_reason":"tool_calls"}]}

Use Kimi in Cursor (free agent mode)

Cursor's agent mode speaks full OpenAI-compat, so Kimi slots in:

Get a key at aigateway.sh
In Cursor: Settings → Models → "Override OpenAI Base URL"
Base URL: https://api.aigateway.sh/v1
Add model: moonshot/kimi-k2.6
Code.

Agent mode, tool calls, multi-turn conversations — all work. Tab autocomplete and Cmd-K still use Cursor's own backend (hardwired on their side), but the chat + agent panel is fully yours.

Use Kimi in Cline

# Cline: Settings → API Provider → "OpenAI Compatible"
Base URL:  https://api.aigateway.sh/v1
API key:   sk-aig-...
Model ID:  moonshot/kimi-k2.6

Use Kimi in LangChain

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="moonshot/kimi-k2.6",
    base_url="https://api.aigateway.sh/v1",
    api_key="sk-aig-...",
)
print(llm.invoke("Hello!").content)

Use Kimi in Vercel AI SDK

import { createOpenAI } from "@ai-sdk/openai";
import { streamText } from "ai";

const aigateway = createOpenAI({
  baseURL: "https://api.aigateway.sh/v1",
  apiKey: process.env.AIG_KEY,
});

const result = await streamText({
  model: aigateway("moonshot/kimi-k2.6"),
  prompt: "Hello!",
});
for await (const chunk of result.textStream) process.stdout.write(chunk);

Vision input

Kimi sees images. Pass them as OpenAI-style content parts:

{
  "model": "moonshot/kimi-k2.6",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "What is in this image?" },
        { "type": "image_url",
          "image_url": { "url": "https://example.com/cat.jpg" } }
      ]
    }
  ]
}

Pricing worked example

A typical agent turn: ~500 input tokens (conversation + tool defs), ~300 output tokens.

Input: 500 × $0.95 / 1,000,000 = $0.000475
Output: 300 × $4.00 / 1,000,000 = $0.0012
Provider cost: ~$0.0017 per turn; with our 5% platform fee on top, ~$0.00178 per turn — about 560 agent turns per dollar.

With provider prompt caching (distinct from gateway-level response caching), cached input tokens bill at 50% of the input rate ($0.95 / 1M × 50% = $0.475 / 1M) while output tokens remain at the full rate. A prefix-cached agent loop (system prompt + tool definitions unchanged turn-to-turn) cuts input cost roughly in half — about 800 turns per dollar. Check the X-Cached-Input-Units response header to see how many input tokens were cache-served on each call.

Limits

K2.6 is a full-catalog model — it opens once you top up any amount (the $5 signup credit covers the curated shortlist, where Kimi K2.7 Code is the Moonshot pick).

Paid (any topup): 600 requests / minute. Auto-promoted on first topup. See rate limits.
Enterprise: 30,000+ RPM under contract.

Benchmarks

Moonshot's published numbers against the common suite:

MMLU: ~86% — close to Claude Sonnet 4.5
HumanEval: ~93% — on par with GPT-5.4 on straightforward code
SWE-Bench: ~58% — meaningfully behind Opus 4.7 on hairy refactors, ahead of Haiku 4.5
Tool-use: strong on multi-step agent loops, clean argument JSON

In our own eyeballed coding eval across ~200 prompts, Kimi K2.6 is roughly Sonnet-tier on day-to-day agent work and noticeably behind Opus on multi-file architectural refactors.

Common errors

429 rate_limit_error — over the paid-tier 600 req/min. See rate limits or contact support for a higher ceiling.
402 insufficient_credits — no topup balance. K2.6 needs a paid balance (it's outside the $5 signup shortlist). Add credits.
503 service_unavailable — upstream saturation on Kimi. Retryable; default retry-after is 2 seconds.

Kimi K2.6

Quickstart (60 seconds)

Model card

Request

Response (non-streaming)

Streaming (SSE)

Tool calling

Streaming tool calls

Use Kimi in Cursor (free agent mode)

Use Kimi in Cline

Use Kimi in LangChain

Use Kimi in Vercel AI SDK

Vision input

Pricing worked example

Limits

Benchmarks

Common errors

More