Cost Optimization

Amazon Bedrock Pricing | Complete 2026 Guide (Every Model, Every Mode + Cost Optimization)

Gartner estimates global GenAI spending will hit $644 billion in

Ahmad
February 25, 2026

17 min read

Share this post

Ahmad
February 25, 2026

17 min read

Share this post

Gartner estimates global GenAI spending will hit $644 billion in 2025 — up 76.4% year over year Gartner. For teams building on Amazon Bedrock, that growth shows up directly on their AWS bill. The challenge isn’t accessing capable models. It’s understanding why Claude 3.5 Sonnet can cost $15 per 1 million output tokens while Nova Micro costs $0.14 — a 107× price difference — and knowing which one is right for your use case.

Amazon Bedrock pricing is complex because:

30+ foundation models each have different pricing tiers
5 billing modes with vastly different economics (on-demand, batch, provisioned throughput, prompt caching, fine-tuning)
Optional services like Knowledge Bases, Agents, and Guardrails add significant costs that are often overlooked
Prompt caching can reduce costs by up to 90% — but few teams use it effectively

This comprehensive guide covers all 5 Amazon Bedrock pricing modes, a complete 2026 model pricing table with every major foundation model, detailed breakdowns of Knowledge Bases costs, Bedrock Agents pricing, Guardrails expenses, and 8 proven cost optimization strategies. Whether you’re evaluating AWS Bedrock pricing against alternatives or optimizing your existing Bedrock cost structure, this guide provides the precise token counts, dollar figures, and real-world examples you need.

How Amazon Bedrock Pricing Works — The 4 Core Cost Drivers

Understanding Amazon Bedrock pricing requires grasping four fundamental cost drivers that determine your monthly AWS bill:

Driver 1 — Tokens: The Fundamental Billing Unit
For text models, tokens are the atomic unit of pricing. Approximately 1,000 tokens equals 750 words of English text. Amazon Bedrock bills input tokens (your prompt) and output tokens (the model’s response) separately. Output tokens typically cost 3-5× more than input tokens because generating text requires significantly more computation than processing it.

Driver 2 — Model Choice: The 100× Variable
Bedrock model pricing varies by up to 100× across different foundation models. Claude 3 Opus costs $0.075 per 1,000 output tokens; Nova Micro costs $0.00014 — a 535× difference. Choosing the right model for each specific task represents the single biggest cost optimization lever available.

Driver 3 — Pricing Mode: Five Distinct Billing Models
Bedrock offers five pricing modes: on-demand (most flexible, pay-per-token), batch inference (50% discount, asynchronous processing), provisioned throughput (reserved capacity with hourly charges), prompt caching (up to 90% savings on repeated inputs), and model customization (fine-tuning with ongoing storage and inference costs). Each mode has distinct economics.

Driver 4 — Additional Services: The Hidden Multipliers
Knowledge Bases, Bedrock Agents, Guardrails, and model evaluation each add their own charges on top of base inference costs. For RAG applications, OpenSearch Serverless storage alone can cost $345/month minimum — often exceeding the inference costs themselves.

Mode 1 — On-Demand Pricing (Per-Token, Pay-As-You-Go)

On-demand pricing is the most flexible Amazon Bedrock pricing model with no capacity commitments required. You’re charged separately per 1,000 input tokens AND per 1,000 output tokens processed. This AWS Bedrock pricing mode works best for variable workloads, development environments, and prototypes where usage patterns are unpredictable.

Cross-Region Inference: Amazon Bedrock supports routing traffic globally to avoid regional capacity constraints. The source region’s pricing applies with no additional charge for cross-region routing, making it a valuable availability feature without cost penalties.

Complete 2026 Model Pricing Table

The following table shows us-east-1 on-demand pricing as of February 2026. Always verify current rates at aws.amazon.com/bedrock/pricing as prices may vary by region and change over time.

Provider	Model	Input $/1K tokens	Output $/1K tokens
Anthropic	Claude 3.7 Sonnet	$0.003	$0.015
Anthropic	Claude 3.5 Sonnet v2	$0.003	$0.015
Anthropic	Claude 3.5 Haiku	$0.0008	$0.004
Anthropic	Claude 3 Opus	$0.015	$0.075
Anthropic	Claude 3 Haiku	$0.00025	$0.00125
Anthropic	Claude Instant	$0.0008	$0.0024
Amazon	Nova Premier	$0.0025	$0.0125
Amazon	Nova Pro	$0.0008	$0.0032
Amazon	Nova Lite	$0.00006	$0.00024
Amazon	Nova Micro	$0.000035	$0.00014
Amazon	Titan Text Express	$0.0008	$0.0016
Amazon	Titan Embeddings V2	$0.00011	—
Meta	Llama 3.3 70B Instruct	$0.00072	$0.00072
Meta	Llama 3.2 90B Vision	$0.002	$0.002
Meta	Llama 3.2 11B Vision	$0.00035	$0.00035
Meta	Llama 3.2 3B Instruct	$0.00015	$0.00015
Meta	Llama 3.2 1B Instruct	$0.0001	$0.0001
Mistral	Mistral Large 3	$0.0050	$0.0150
Mistral	Magistral Small 1.2	$0.0005	$0.0015
Mistral	Ministral 8B 3.0	$0.00015	$0.00015
Mistral	Ministral 3B 3.0	$0.0001	$0.0001
Cohere	Command R+	$0.003	$0.015
Cohere	Command R	$0.0005	$0.0015
Cohere	Embed English v3	$0.0001	—
AI21 Labs	Jamba 1.5 Large	$0.002	$0.008
AI21 Labs	Jamba 1.5 Mini	$0.0002	$0.0004
Stability AI	Stable Image Ultra	$0.08/image	—
Stability AI	SDXL 1.0	$0.04/image	—

The most important observation from this table: Amazon Nova Micro costs $0.000035 per 1,000 input tokens. Claude 3 Opus costs $0.015 per 1,000 — 428× more expensive. For simple classification tasks, routing decisions, or structured data extraction, Nova Micro delivers comparable results at a fraction of the Bedrock cost. Many teams overspend by defaulting to premium models for tasks that smaller, cheaper foundation models handle perfectly well.

Mode 2 — Batch Inference (50% Off On-Demand)

Batch inference is Amazon Bedrock’s asynchronous processing mode that delivers 50% savings versus on-demand rates for the same foundation model. Instead of real-time API calls, you submit a JSONL file containing all prompts to Amazon S3, Bedrock processes them asynchronously, and returns results to your S3 bucket within 24 hours.

Supported Providers: Anthropic (Claude models), Amazon (Titan and Nova series), Cohere, Meta Llama, Mistral AI, and AI21 Labs all support batch inference pricing.

Cost Example:

Scenario: Monthly content processing workload requiring 10 million output tokens using Claude 3.5 Haiku
On-Demand Cost: 10,000,000 tokens × ($0.004/1,000 tokens) = $40.00/month
Batch Mode Cost: 10,000,000 tokens × ($0.002/1,000 tokens) = $20.00/month
Annual Savings: $240/year from a single workload

Best Use Cases for Batch Inference:

Bulk content generation (marketing copy, product descriptions)
Large-scale document summarization
Data enrichment pipelines (adding AI-generated metadata)
Sentiment analysis at scale across customer feedback
Batch embedding generation for retrieval systems
Offline model evaluation and benchmarking

Batch mode is not suitable for real-time user-facing applications, chatbots, or any workflow requiring immediate responses. The 24-hour turnaround window makes it exclusively valuable for asynchronous, scheduled workloads where AWS Bedrock pricing optimization matters more than latency.

Mode 3 — Provisioned Throughput (Reserved Capacity)

Provisioned Throughput in Amazon Bedrock works like EC2 Reserved Instances: you purchase model units (MUs) that guarantee a specific throughput level, billed hourly whether you use the capacity or not. This Bedrock pricing mode requires commitment terms of either 1-month or 6-month periods, with 6-month commitments offering lower hourly rates.

When Provisioned Throughput is Required:

Custom or fine-tuned model inference (cannot use on-demand for customized models)
Applications requiring guaranteed performance SLAs
High-volume production workloads with consistent, predictable traffic

Representative Provisioned Throughput Pricing:

Provider	Model	1-month $/hr/MU	6-month $/hr/MU
Anthropic	Claude 3.5 Haiku	Contact AWS	Contact AWS
Meta	Llama 3.1 70B	$21.18	~$18.00
Stability AI	SDXL 1.0	$49.86	$46.18

Note: Most text foundation models’ Provisioned Throughput pricing is available via AWS Console. Contact your AWS account team or check the Bedrock console for current rates specific to your model and region.

Break-Even Calculation:

Provisioned Throughput makes economic sense when: Your on-demand cost at 70% utilization > (Hourly MU price × 24 hours)

Example: If your on-demand spend would be $30/day at 70% capacity utilization, and the MU costs $1.00/hour, provisioned throughput costs $24/day — a 20% savings. However, if your utilization drops below 60%, you’re likely paying more for unused capacity than you’d save.

The AWS Bedrock pricing break-even point typically occurs around 80-85% sustained capacity utilization. Below that threshold, on-demand or batch modes usually offer better cost efficiency.

Mode 4 — Prompt Caching (Up to 90% Input Cost Reduction)

Prompt caching is the most underused and most powerful cost optimization feature in Amazon Bedrock pricing. Most teams never enable it. Teams that do can reduce input token costs by up to 90% for repeated context.

How Prompt Caching Works

Cache Point Creation: You designate a portion of your prompt as “cacheable” — typically system prompts, RAG-retrieved documents, few-shot examples, or any content that repeats across multiple requests.
First Request (Cache Write): The cacheable prompt portion is processed normally with a small write premium (~25% above standard input token cost) and a cache point is created.
Subsequent Requests (Cache Read): Cached content is retrieved instead of reprocessed, costing 90% less than standard input tokens.
Cache TTL: Cache entries remain valid for 5 minutes, with the TTL refreshed on each cache hit.

Prompt Caching Pricing

Cache Write (first use): ~$0.00375 per 1,000 tokens (25% premium vs. on-demand for supported models)
Cache Read (subsequent uses): ~$0.0003 per 1,000 tokens (90% discount vs. on-demand)
Supported Models: Claude 3.7 Sonnet, Claude 3.5 Sonnet v2, Claude 3.5 Haiku, and select other Anthropic models

Cost Comparison Example: Customer Service Chatbot

Scenario Parameters:

System prompt: 2,000 tokens (identical for every conversation)
User message: 100 tokens (varies per request)
Monthly volume: 100,000 customer conversations
Model: Claude 3.5 Haiku ($0.0008/1K input tokens)

WITHOUT Prompt Caching:

Total input tokens: 100,000 conversations × 2,100 tokens = 210,000,000 tokens
Cost: 210,000 × $0.0008 = $168.00/month

WITH Prompt Caching (95% cache hit rate):

Cache writes (5% miss rate): 5,000 writes × 2,000 tokens = 10M tokens × $0.001/1K = $10.00
Cache reads (95% hit rate): 95,000 reads × 2,000 tokens = 190M tokens × $0.00008/1K = $15.20
Fresh tokens (user messages): 100,000 × 100 tokens = 10M tokens × $0.0008/1K = $8.00
Total with caching: $33.20/month

SAVINGS: $168.00 – $33.20 = $134.80/month (80% reduction)

Best Use Cases for Prompt Caching

RAG Applications: When retrieved context documents remain consistent across queries
Customer Service Bots: Fixed system prompts defining behavior and guidelines
Multi-Turn Conversations: Long conversation history passed with each turn
Code Review Assistants: Large codebase context loaded for every review
Legal/Compliance Applications: Regulatory document context referenced repeatedly
Few-Shot Learning: Examples and instruction templates used consistently

Prompt caching transforms the economics of context-heavy applications. For any workload where you pass the same content repeatedly, enabling prompt caching should be your first Bedrock cost optimization action.

Mode 5 — Model Customization (Fine-Tuning)

Amazon Bedrock supports model customization through two approaches: fine-tuning (training on labeled data to specialize for your task) and continued pretraining (training on unlabeled domain data for knowledge injection). Customization is currently supported for Amazon Titan, Cohere, and Meta Llama foundation models.

Model Customization Costs

Training: Charged per token processed during training (rates vary by base model — check AWS Console for current pricing)
Model Storage: $0.02–$0.10 per GB per month for custom model weights
Inference: Requires Provisioned Throughput (cannot use on-demand pricing for custom models)
Initial Evaluation Period: One model unit provided term-free during initial testing phase

When to Fine-Tune vs. Prompt Engineer

Choose Fine-Tuning When:

The base foundation model consistently fails at your specific task format despite prompt engineering
Quality improvements justify the significantly higher cost and complexity
You have very high-volume consistent workloads (fine-tuned models on Provisioned Throughput become cost-effective at scale)
Your use case requires specialized domain knowledge not present in base models

Choose Prompt Engineering First:

Cheaper and faster to iterate (no training costs, immediate deployment)
Sufficient for 90% of use cases with modern large language models
Allows A/B testing different approaches without commitment
Easier to maintain and update as requirements change

For most teams, extensive prompt engineering, retrieval-augmented generation (RAG), and careful model selection deliver better ROI than fine-tuning. Reserve fine-tuning for truly specialized applications where base models demonstrably fail.

Additional Bedrock Service Costs

Beyond base foundation model inference, Amazon Bedrock offers integrated services that add significant costs often overlooked in initial budgeting. Understanding these additional charges is critical for accurate Bedrock pricing projections.

Knowledge Bases (RAG)

Bedrock Knowledge Bases enable retrieval-augmented generation by storing your documents in a searchable vector database. This service has three distinct cost components:

Ingestion (Embedding Generation):

Amazon Titan Embeddings V2: $0.00011 per 1,000 input tokens
Cohere Embed English v3: $0.0001 per 1,000 input tokens
Example: 10,000 pages × 500 words × ~670 tokens per 500 words = 6.7M tokens × $0.00011/1K = $0.74 (one-time ingestion cost)

Vector Storage (Amazon OpenSearch Serverless):

OpenSearch Capacity Unit (OCU) pricing: $0.24/OCU/hour
Separate charges for indexing OCUs and search OCUs
Minimum: 2 OCUs required = $0.48/hour = $345.60/month base cost
Alternative: Consider Pinecone, Weaviate, or pgvector on Aurora Serverless for small-to-medium Knowledge Bases (often 70-80% cheaper)

Query (Retrieval): Each RAG query incurs:

Embedding cost for the user question
OpenSearch search OCU capacity
Input token cost for retrieved context chunks
Output token cost for the generated response

Practical RAG Cost Example:

Volume: 10,000 queries/day to a Knowledge Base
Per Query: Embed question (200 tokens) + retrieve context (3,000 tokens) + respond (500 output tokens with Claude Haiku)
Daily Costs:
- Question embedding: 10K × 200 tokens × $0.00011/1K = $0.22
- Context input tokens: 10K × 3,000 × $0.0008/1K = $24.00
- Output generation: 10K × 500 × $0.004/1K = $20.00
- Daily Total: $44.22 (~$1,327/month, excluding OpenSearch storage)
Monthly with OpenSearch: $1,327 + $345.60 = $1,672.60/month

Key Insight: OpenSearch Serverless represents your biggest RAG infrastructure cost at $345/month minimum. For small Knowledge Bases (<100,000 documents), evaluate pgvector on Aurora Serverless v2 ($0.06/ACU-hour) versus OpenSearch ($0.48/OCU-hour) — potential 87% storage cost reduction.

Bedrock Agents

Bedrock Agents orchestrate multi-step tasks by reasoning about which tools to invoke, calling external APIs, querying databases, performing calculations, and synthesizing final responses. Agents use foundation models for orchestration, so pricing follows token-based models.

Agent Cost Multiplier: Each agent “step” generates both input and output tokens for:

Orchestration reasoning (deciding the next action)
Tool call preparation (formatting API requests)
Response synthesis (writing the final answer to the user)

An agent completing a task with 5 tool calls consumes approximately 5× the tokens of a single direct model invocation. If a simple API call costs $0.01 in tokens, the same call through an agent orchestration layer might cost $0.05-$0.08.

Action Groups: Lambda function charges apply for each tool execution ($0.20 per 1M requests + $0.0000166667 per GB-second compute).

Bedrock Guardrails

Guardrails filter harmful content, detect personally identifiable information (PII), identify hallucinations, and block off-topic responses. Guardrails pricing is $0.15 per 1,000 text units processed (reduced from $0.75 in late 2024 — an 80% price cut).

Text Unit Definition: Approximately 1,000 characters of input or output passed through Guardrails filters.

Cost Example:

Volume: 100 million API calls per month
Guardrails Cost: 100,000,000 / 1,000 × $0.15 = $15,000/month

Important: While Guardrails pricing decreased significantly, enabling them on every API call for high-volume applications still adds substantial cost. Use Guardrails selectively:

✅ User-facing chatbots (safety critical)
✅ Content generation for public consumption
❌ Internal classification tasks
❌ Simple lookup queries with structured responses

Blocked Requests: No charge for requests blocked or denied by Guardrails (you only pay when content passes through filters).

Model Evaluation

Amazon Bedrock offers automatic and human-based model evaluation:

Automatic Evaluation:

No separate evaluation fee
Pay only for inference tokens during evaluation runs

Human-Based Evaluation:

$0.21 per completed task (one human worker reviewing one prompt/response pair)
Plus standard inference costs for generating responses
Billed under Amazon SageMaker line items in AWS Cost Explorer

Data Transfer

Within Same Region: Free (no data transfer charges)

Cross-Region: Standard AWS data transfer rates apply ($0.02/GB out to other AWS regions, $0.09/GB out to internet)

Best Practice: Keep your Bedrock endpoint in the same AWS region as your application to avoid all data transfer charges. Use VPC endpoints for private connectivity, which eliminates NAT Gateway charges ($0.045/hour + $0.045/GB processed).

Model Selection Decision Framework

Choosing the right foundation model for each use case is the single highest-impact AWS Bedrock pricing optimization. This framework helps match models to tasks based on quality requirements and cost constraints.

Use Case	Recommended Model	Rationale	Approx $/1M output tokens
Simple classification / routing	Nova Micro	Cheapest capable model for structured tasks	$0.14
Customer support chatbot	Claude 3.5 Haiku	Best quality-to-cost ratio for conversations	$4.00
Content generation (bulk)	Llama 3.3 70B (batch)	Open weights, batch discount, good quality	$0.72
Complex reasoning / coding	Claude 3.5 Sonnet	Best overall quality for difficult tasks	$15.00
Vision / multimodal	Nova Pro or Claude 3.5 Haiku	Strong vision capabilities at reasonable cost	$3.20–$4.00
Embedding generation	Titan Embeddings V2	Cheapest embedding model on Bedrock	$0.11
Long document analysis	Claude 3.7 Sonnet	200K context window, best comprehension	$15.00
Real-time streaming chat	Nova Lite	Ultra-fast, very low latency	$0.24
ML/data processing (batch)	Nova Micro (batch)	50% batch discount on already cheap model	$0.07

The ROI Test: Before choosing a premium foundation model, benchmark your specific task with Nova Micro or Llama 3.3 70B first. If the quality is sufficient, you’re potentially spending 10–100× more than necessary by defaulting to Claude Sonnet or other expensive models. Many teams discover that 60-70% of their use cases work perfectly well with budget models.

3 Real-World Bedrock Cost Scenarios

Understanding Amazon Bedrock pricing theory is one thing; seeing it applied to actual workloads provides actionable context.

Scenario 1 — Customer Support Chatbot

Setup:

100,000 customer conversations per month
Average 1,500 input tokens per turn (includes 2,000-token system prompt with instructions and company context)
Average 300 output tokens per response
Model: Claude 3.5 Haiku ($0.0008/1K input, $0.004/1K output)
Prompt caching enabled on the 1,000-token system prompt portion

Without Prompt Caching:

Input cost: 100,000 × 1,500 tokens × $0.0008/1K = $120.00
Output cost: 100,000 × 300 tokens × $0.004/1K = $120.00
Total: $240.00/month

With Prompt Caching (90% cache hit rate on system prompt):

Cached portion (cache reads): 90,000 × 1,000 tokens = 90M tokens × $0.000064/1K = $5.76
Fresh portion: 100,000 × 500 tokens × $0.0008/1K = $40.00
Output cost: $120.00 (unchanged)
Total with caching: ~$166.00/month (31% savings)

Key Takeaway: Even with only 1,000 tokens cached (not the entire prompt), caching delivers 31% cost reduction. With larger cached contexts (5K–10K tokens for RAG applications), savings approach 70-80%.

Scenario 2 — Bulk Document Summarization (Batch Mode)

Setup:

500,000 documents summarized monthly
Average 3,000 input tokens, 400 output tokens per document
Model: Llama 3.3 70B Instruct ($0.00072/1K input and output)

On-Demand Pricing:

Input cost: 500,000 × 3,000 × $0.00072/1K = $1,080.00
Output cost: 500,000 × 400 × $0.00072/1K = $144.00
Total: $1,224.00/month

Batch Mode (50% discount):

Total: $612.00/month
Annual savings vs. on-demand: $7,344

Key Takeaway: For non-real-time, scheduled workloads, batch inference immediately cuts costs in half with zero quality trade-offs. The only cost is development time to implement asynchronous S3-based workflows.

Scenario 3 — RAG Application (Knowledge Base + Chat)

Setup:

Knowledge Base: 5,000 documents, re-indexed weekly (full refresh)
50,000 RAG queries per month
Per query: 200 tokens (embed question) + 3,000 tokens (retrieved context) + 500 output tokens
Model: Claude 3.5 Haiku for generation, Titan Embeddings V2 for embeddings

Monthly Cost Breakdown:

Embedding (weekly re-index): 5,000 docs × 500 tokens × 4 weeks = 10M tokens × $0.00011/1K = $1.10
Query embedding: 50,000 × 200 tokens × $0.00011/1K = $1.10
RAG input tokens: 50,000 × 3,000 × $0.0008/1K = $120.00
Output tokens: 50,000 × 500 × $0.004/1K = $100.00
OpenSearch Serverless (minimum 2 OCUs): $345.60
Total: ~$568.00/month

Key Insight: OpenSearch Serverless is your biggest RAG infrastructure cost at $345/month minimum — significantly exceeding the $222 in actual inference costs. For Knowledge Bases under 100,000 documents, evaluate vector database alternatives like pgvector on Aurora Serverless v2 (costs $0.06/ACU-hour vs. $0.48/OCU-hour for OpenSearch), which could reduce storage costs from $345/month to $50–$80/month for small-to-medium datasets.

5 Bedrock Cost Optimization Strategies

Implementing these eight strategies can reduce Amazon Bedrock pricing costs by 50-70% for typical production workloads without sacrificing quality.

1. Enable Prompt Caching for ALL Production Workloads

Any repeated content — system prompts, RAG context, few-shot examples, conversation history — should be cached. The 90% discount on cached input tokens pays back immediately with zero quality impact. For customer service bots, documentation assistants, and RAG applications, this single change often delivers 30-50% cost reduction.

2. Start with the Cheapest Capable Model

Always benchmark Nova Micro ($0.14/1M output tokens) first. If quality is insufficient, try Nova Lite ($0.24/1M). Then Claude 3.5 Haiku ($4.00/1M). Only escalate to premium foundation models (Claude 3.5 Sonnet at $15/1M, Nova Premier at $12.50/1M) when cheaper options demonstrably fail your quality requirements. Most teams discover that 50-60% of use cases work perfectly well with budget models.

3. Use Batch Mode for Non-Real-Time Workflows

Batch inference delivers 50% savings versus on-demand rates. Schedule batch jobs for content generation, data enrichment, document summarization, evaluation pipelines, and any asynchronous workload. The 24-hour turnaround is acceptable for the vast majority of internal processing tasks.

4. Set MaxTokens to Cap Output Length

Output tokens cost 3-5× more than input. Set explicit MaxTokens parameters in every API call to prevent runaway generation. Even if only 1-2% of requests generate excessive outputs, it can inflate monthly costs by 10-20%. For fixed-format responses (JSON, classifications), set tight token limits.

5. Optimize Prompt Length

Every unnecessary word in your prompt costs money on every invocation. Compress system prompts, eliminate redundant context, use abbreviations where appropriate, and employ role-based prompt compression techniques. Reducing a 3,000-token system prompt to 1,500 tokens cuts input costs by 50% before any other optimization.

Amazon Bedrock vs Azure OpenAI vs Google Vertex AI

For teams evaluating cloud providers for generative AI workloads, pricing comparisons across AWS Bedrock, Azure OpenAI Service, and Google Vertex AI inform platform decisions.

Pricing Comparison (High-End Models, Feb 2026)

Service	Provider	Comparable Model	Input $/1M	Output $/1M
Amazon Bedrock	Anthropic	Claude 3.5 Sonnet	$3.00	$15.00
Azure OpenAI	OpenAI	GPT-4o	$2.50	$10.00
Google Vertex AI	Google	Gemini 1.5 Pro	$1.25	$5.00
Amazon Bedrock	Amazon	Nova Pro	$0.80	$3.20

1: What is Amazon Bedrock pricing?

Answer: It’s based on tokens, model choice, billing mode, and optional services like Knowledge Bases, Agents, and Guardrails.

2: Which Bedrock model is cheapest?
Answer: Nova Micro is the most cost-effective for simple classification and structured tasks.

3: How can I save costs on repeated prompts?
Answer: Enable prompt caching to reduce input token costs by up to 90%.

4: When should I use batch inference?
Answer: For non-real-time workloads like bulk content generation or document summarization to save 50% vs on-demand.

5: Do additional services like Knowledge Bases add cost?
Answer: Yes, services like OpenSearch Serverless, Bedrock Agents, and Guardrails can significantly increase your monthly bill.

Conclusion

Amazon Bedrock pricing in 2026 presents opportunities for both cost efficiency and performance optimization. By carefully selecting foundation models, leveraging on-demand vs batch inference, and applying prompt caching or model customization strategically, organizations can control expenses while maximizing AI capabilities. Teams seeking a holistic cloud cost perspective should also consider AWS API Gateway Pricing to manage API and model costs together for a clearer view of overall architecture expenses. At GoCloud, we guide organizations in implementing these strategies to ensure optimized, predictable, and efficient cloud deployments.