What is a token in the context of AI APIs?
A token is a chunk of text as the model processes it — roughly 4 characters or 0.75 words in English. Common words are usually one token; rarer or longer words may be two or three. Code, non-Latin scripts, and whitespace tokenise differently. Most API documentation links to a tokeniser tool so you can check exact counts for your inputs.
Why are output tokens more expensive than input tokens?
Generating output requires the model to run the full forward pass for every token it produces, one token at a time. Reading input requires only one forward pass for the entire prompt. Because output generation is more compute-intensive and slower, providers typically charge two to four times more per output token than per input token.
How do I find the per-token price for a specific model?
Check the pricing page of the API provider you are using. Prices change as providers optimise their infrastructure. For budgeting, use the current listed price and add a 10–20% buffer to account for price updates and model version changes over the budget period.
How much does 1 million tokens cost?
It depends entirely on the model and provider. Some models price input and output separately, so the total for 1 million prompt tokens can be very different from 1 million completion tokens. The safest approach is to enter the current per-1K or per-1M rates from the provider's pricing page and let the calculator scale the total for your request volume.
Should I enter prices per 1K tokens or per 1M tokens?
Use the calculator's per-1K inputs. If a provider publishes AI API pricing per 1 million tokens, divide that price by 1 000 before entering it. The page also shows the per-million equivalent beside the inputs so you can check that the converted rate still matches the provider's rate card.
How many words are in 1,000 tokens?
A common rough estimate is about 750 English words, but the exact ratio varies with punctuation, code, short words, and the model's tokeniser. Non-English text can compress differently. Treat any tokens-to-words conversion as a planning shortcut, not a precise count.
What is prompt or context caching?
Prompt caching lets a provider reuse repeated prompt prefixes instead of reprocessing the same long instructions every time. That can lower latency and cost when the same system prompt, examples, or documents are sent repeatedly. If caching applies to your workflow, enter the share of input tokens you expect to be cached and the cached-token discount for that provider.
How do retries and tool calls affect AI API cost?
Retries, failed validation runs, tool-call loops, and agentic workflows can increase billable requests beyond the neat usage number in a product plan. A 10% retry or tool overhead means the calculator budgets for 10% more input and output tokens than the base workload. That makes the estimate more realistic for production LLM applications.
How do I set a monthly budget for AI API usage?
Start with the maximum monthly amount you are willing to spend on inference for this feature, then compare the calculator's monthly spend, cost per request, and users supported within budget. If the workload exceeds the guardrail, test shorter outputs, better prompt caching, lower retry overhead, lower-cost model routing, or a lower request allowance before shipping the feature.
How do I compare two LLM models fairly?
Keep the workload assumptions identical, then change only the input price, output price, token lengths, and caching assumptions that differ by model. Compare total monthly spend, cost per request, cost per user, and output share of spend. If one model needs longer prompts or more retries to reach the same quality, include those extra tokens before deciding it is cheaper.
Why might my real bill differ from the estimate?
Real bills change with retries, tool calls, cached tokens, long-context premiums, batch discounts, provider-side updates, and model routing choices. The calculator is accurate for the assumptions you enter, but it does not see your provider logs or account-specific discounts. Use it as a planning estimate, then reconcile it against actual usage reports.
Does this include vector DB or hosting costs?
No. This calculator focuses on token spend only. If your product also uses embeddings, vector search, file storage, queueing, hosting, or observability tools, add those outside the token estimate so you can see the full unit economics of the system.
When should I use a smaller model instead?
Use a smaller model when the task does not need frontier-level reasoning, tool use, or long-context capability. Simple drafting, classification, extraction, and routing tasks often work well on cheaper models. A smaller model can lower both input and output spend, which matters most when request volume is high.
How often do LLM prices change?
Providers update prices periodically, and the cheapest model today may not stay the cheapest model next month. For budgeting, check the official pricing page regularly and refresh your assumptions whenever you switch models or notice an announcement about a new rate card.