How to Build a Private ChatGPT on AWS |The Complete Production Architecture Guide

Ahmad
June 4, 2026

18 min read

Share this post

Ahmad
June 4, 2026

18 min read

Share this post

How to Build a Private ChatGPT on AWS is becoming a priority for many organizations because ChatGPT Enterprise costs $30–60 per user per month. For a 500-person company, that translates to $180,000–$360,000 annually—before considering data privacy requirements, limited control over model behavior, and the challenges of grounding responses in proprietary business knowledge. A private AI solution on AWS offers greater control, security, and long-term cost efficiency.

The more serious problem is data. When employees use ChatGPT for work, they paste meeting notes, customer data, source code, legal documents, and financial projections into a third-party system. OpenAI’s enterprise tier promises data privacy, but your data still leaves your infrastructure. For regulated industries — healthcare, finance, legal, government — this is a compliance failure waiting to happen.

Building a private ChatGPT on AWS with Amazon Bedrock solves all of these simultaneously. Your data stays within your AWS VPC. Model queries are never used for training. User access is controlled through your existing IAM and Cognito setup. Costs scale with usage rather than headcount. And the AI can actually know your business — your internal documents, your policies, your codebases — through a properly built RAG pipeline.

What You Will Have After This Guide

A fully private AI chat interface accessible only to authenticated users in your organization
Claude (or Llama, Titan, or your model of choice) running through Amazon Bedrock — no model infrastructure to manage
A RAG pipeline grounding responses in your internal knowledge base stored in Amazon OpenSearch
IAM roles with least-privilege access so no component has broader permissions than required
Bedrock Guardrails preventing harmful outputs and enforcing content policies
CloudWatch monitoring for latency, token usage, cost, and error rates
A production-ready architecture that passes security review for HIPAA, GDPR, and SOC 2 requirements

Architecture Overview: How a Private ChatGPT on AWS Actually Works

Before writing a single line of code, understand the complete architecture. Building a private AI chat system on AWS involves five distinct layers that must be designed cohesively getting one wrong undermines the others.

Layer 1: Foundation Model Access Amazon Bedrock

Amazon Bedrock is the managed foundation model service that replaces the need to host, scale, or maintain any AI model infrastructure. It provides API access to frontier models — Anthropic Claude (Opus, Sonnet, Haiku), Meta Llama, Amazon Titan, Cohere, and others — through a single unified API endpoint within your AWS account.

Key architectural properties of Bedrock that matter for a private system:

Data processed through Bedrock is not used to train any underlying model by default
All API calls stay within your AWS VPC when using VPC endpoints — data never traverses the public internet
Bedrock supports HIPAA, SOC 2, ISO 27001, and is FedRAMP authorized for government workloads
Model access is controlled through IAM the same identity system governing your entire AWS account

Layer 2: Knowledge Base — RAG with Amazon OpenSearch

A private ChatGPT that only knows what the foundation model was trained on is only marginally better than public ChatGPT. The architectural differentiator is Retrieval Augmented Generation (RAG): your internal documents, wikis, policies, and codebases are indexed and searched at query time, with relevant content injected into the model context before generating a response.

Amazon OpenSearch Serverless is the recommended vector store for most organizations. OpenSearch indexes your document embeddings, performs semantic similarity search at query time, and returns the most relevant document chunks to include in the Bedrock context window. Alternatives include Aurora PostgreSQL with pgvector for teams with existing RDS infrastructure, or Amazon Kendra for document-heavy knowledge bases that require enterprise search features.

Layer 3: Orchestration AWS Lambda and Bedrock Agents

Lambda functions serve as the orchestration layer between your frontend, your knowledge base, and Bedrock. A typical request flow:

User sends a message through the frontend
API Gateway routes the request to an orchestration Lambda function
Lambda generates an embedding of the user’s query using Bedrock’s embedding model
Lambda queries OpenSearch for the most semantically similar document chunks
Lambda constructs the full prompt: system context + retrieved documents + conversation history + user message
Lambda calls the Bedrock InvokeModel API with the constructed prompt
Bedrock returns the model response through the Lambda, which streams it back to the frontend

Layer 4: Security and Identity — IAM, Cognito, VPC

Security is not a feature you add after the system works — it is an architectural constraint that shapes every component decision. Three AWS services form the security backbone:

Amazon Cognito: User authentication and authorization. Cognito User Pools manage your employee identities, and Cognito Identity Pools exchange authenticated tokens for temporary IAM credentials with scoped permissions
IAM Roles: Each component (Lambda, EC2, ECS task) has its own IAM role with only the permissions it needs. The Lambda orchestrator has bedrock:InvokeModel and opensearch:ESHttpPost. Nothing has broader access
VPC and Private Endpoints: Bedrock and OpenSearch are accessed through VPC interface endpoints, keeping all traffic on AWS’s private network. No traffic exits to the public internet

Layer 5: Frontend and API — Amplify, API Gateway, CloudFront

The frontend layer serves your chat interface to authenticated users. AWS Amplify provides the simplest deployment path for a React or Next.js chat UI with built-in Cognito authentication integration. API Gateway manages the REST or WebSocket API that connects the frontend to your Lambda orchestrator. CloudFront sits in front for caching static assets and enforcing HTTPS.

How to Build a Private ChatGPT on AWS: Step-by-Step Implementation

Step 1: Enable Amazon Bedrock and Request Model Access

Bedrock does not grant access to all models automatically. You must explicitly request access for each model in each AWS region you plan to use.

Open the Amazon Bedrock console in your target AWS region (us-east-1 and us-west-2 have the broadest model availability in 2026)
Navigate to Model Access in the left navigation
Request access to: Anthropic Claude Sonnet (for production chat — optimal cost/quality balance), Claude Haiku (for intent classification and routing), and Amazon Titan Embeddings V2 (for generating document and query embeddings)
Access approval is typically instant for Titan and within minutes for Claude — no human review required for standard tiers

Step 2: Configure IAM Roles with Least Privilege

Create dedicated IAM roles for each component before writing any application code. Retrofitting IAM permissions after deployment is error-prone and creates security gaps.

Required IAM roles and their permissions:

IAM Role: private-chatgpt-lambda-role

# Permissions required for the orchestration Lambda

{

“Version”: “2012-10-17”,

“Statement”: [

{

“Effect”: “Allow”,

“Action”: [

“bedrock:InvokeModel”,

“bedrock:InvokeModelWithResponseStream”

“Resource”: “arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-*”

{

“Effect”: “Allow”,

“Action”: [“es:ESHttpPost”, “es:ESHttpGet”],

“Resource”: “arn:aws:es:us-east-1:ACCOUNT:domain/private-chatgpt-kb/*”

{

“Effect”: “Allow”,

“Action”: [“logs:CreateLogGroup”, “logs:CreateLogStream”, “logs:PutLogEvents”],

“Resource”: “arn:aws:logs:*:*:*”

}

]

}

Step 3: Build the Document Ingestion Pipeline

Your knowledge base is only as good as your document ingestion pipeline. This pipeline processes raw documents — PDFs, Word files, Confluence pages, Notion exports, Markdown files, code repositories — into searchable vector embeddings stored in OpenSearch.

Document ingestion architecture:

Source documents stored in an S3 bucket with versioning enabled
S3 event notifications trigger an ingestion Lambda on new document uploads
The ingestion Lambda extracts text from each document format (using Textract for PDFs, direct parsing for Markdown/HTML)
Text is chunked into overlapping segments of 512-1024 tokens (overlap of 10-20% improves retrieval quality at chunk boundaries)
Each chunk is embedded using Bedrock’s Titan Embeddings V2 model to produce a 1536-dimensional vector
Chunk text, metadata (source document, section, timestamp, author), and embedding vector are indexed in OpenSearch

Python — Document Chunking and Embedding

import boto3, json

from opensearchpy import OpenSearch, RequestsHttpConnection

bedrock = boto3.client(‘bedrock-runtime’, region_name=’us-east-1′)

def embed_text(text: str) -> list[float]:

response = bedrock.invoke_model(

modelId=’amazon.titan-embed-text-v2:0′,

body=json.dumps({‘inputText’: text})

)

return json.loads(response[‘body’].read())[’embedding’]

def chunk_text(text: str, chunk_size=800, overlap=100) -> list[str]:

words = text.split()

chunks = []

for i in range(0, len(words), chunk_size – overlap):

chunks.append(‘ ‘.join(words[i:i+chunk_size]))

return chunks

def index_document(doc_text: str, metadata: dict, os_client):

for i, chunk in enumerate(chunk_text(doc_text)):

os_client.index(

index=’private-chatgpt-kb’,

body={

‘text’: chunk,

’embedding’: embed_text(chunk),

‘metadata’: metadata,

‘chunk_id’: i

}

)

Step 4: Build the Query Orchestration Lambda

The orchestration Lambda is the brain of your private ChatGPT. It receives user messages, retrieves relevant context, constructs the prompt, calls Bedrock, and returns responses. This is where most of the engineering effort concentrates.

Key design decisions in the orchestration Lambda:

Conversation history management: Store conversation history in DynamoDB keyed by session ID. Include the last 5-10 turns in every Bedrock call — more than 10 turns risks exceeding context limits and increases cost without proportional quality improvement
Context window budgeting: Reserve approximately 30% of the context window for the model’s response. Allocate the remainder across: system prompt (10%), conversation history (30%), and retrieved documents (30%). Adjust based on average document chunk size
Prompt construction: System prompt defines the assistant’s identity and behavioral constraints. Retrieved document chunks are injected with their source metadata so the model can cite sources in responses
Streaming responses: Use InvokeModelWithResponseStream for chat interfaces — users see text appear progressively rather than waiting for the full response, dramatically improving perceived performance

Python — Orchestration Lambda Core Logic

import boto3, json

from opensearchpy import OpenSearch

def handler(event, context):

user_message = event[‘message’]

session_id = event[‘session_id’]

user_id = event[‘user_id’]

# 1. Retrieve conversation history from DynamoDB

history = get_conversation_history(session_id)

# 2. Embed the user query

query_embedding = embed_text(user_message)

# 3. Retrieve relevant documents from OpenSearch

docs = semantic_search(query_embedding, top_k=5)

# 4. Build the prompt

prompt = build_prompt(

system=SYSTEM_PROMPT,

history=history,

context_docs=docs,

user_message=user_message

)

# 5. Call Bedrock with streaming

response = invoke_bedrock_stream(

model_id=’anthropic.claude-sonnet-4-6′,

prompt=prompt

)

# 6. Store conversation turn in DynamoDB

save_conversation_turn(session_id, user_message, response)

return {‘response’: response, ‘sources’: [d[‘source’] for d in docs]}

Step 5: Set Up Bedrock Guardrails

Bedrock Guardrails are a managed safety layer that intercepts model inputs and outputs before and after inference. They are not optional for production deployments they protect against prompt injection attacks, enforce content policies, and prevent data leakage.

Guardrails to configure for a private ChatGPT:

Denied topics: Define topics the model should never discuss. For an internal tool, this might include competitor analysis, personal advice, or topics outside the system’s intended scope
Content filters: Enable AWS Managed Content Filters for hate speech, violence, and sexual content at the highest sensitivity threshold for internal tools
PII detection: Enable PII redaction for SSN, credit card numbers, and other sensitive data types — both in user inputs and model outputs. This prevents accidental data leakage through the chat interface
Prompt attack detection: Guardrails detect common prompt injection patterns designed to override your system prompt or extract model instructions
Grounding check: Bedrock can verify that model responses are grounded in the provided source documents, flagging hallucinated responses that cite non-existent sources

Security Hardening: What Competitors Miss

Most guides for building a private ChatGPT on AWS cover the happy path — getting it working. Security hardening is what separates a prototype from a production system that passes your security team’s review.

VPC Endpoint Configuration

By default, Bedrock API calls route through the public internet, even from within a VPC. Configure VPC interface endpoints for Bedrock, OpenSearch, and DynamoDB to keep all traffic on AWS’s private network:

Create a VPC interface endpoint for com.amazonaws.{region}.bedrock-runtime
Create a VPC interface endpoint for com.amazonaws.{region}.bedrock
Attach endpoint policies that restrict access to only your Lambda execution role ARNs
Disable public access on the endpoint — interface endpoints with public access enabled undermine the isolation

Secrets Management

Never hardcode API keys, database connection strings, or other secrets in Lambda environment variables or application code. Use AWS Secrets Manager for all credentials, with automatic rotation enabled:

OpenSearch master user credentials in Secrets Manager
Any external service API keys (Slack webhooks for notifications, etc.) in Secrets Manager
Lambda environment variables should contain only non-sensitive configuration (region names, resource ARNs, model IDs)

Audit Logging and Compliance Evidence

For HIPAA, SOC 2, and GDPR compliance, you need complete audit trails for all AI interactions. Configure:

AWS CloudTrail logging for all Bedrock API calls — who called which model with what parameters, when
DynamoDB conversation storage with user ID, timestamp, message hash, and session ID — enabling reconstruction of any conversation for audit
CloudWatch Logs for all Lambda executions with structured JSON logging including user_id, session_id, model_id, token_count, and latency
Bedrock model invocation logging — Bedrock can log all inference requests and responses to S3 for compliance archive, with configurable retention

Token and Rate Limiting

Without token limits, a single user can generate hundreds of thousands of Bedrock tokens per hour — driving unexpected costs and potentially consuming your Bedrock service quota and impacting other users.

Implement per-user token budgets tracked in DynamoDB — increment on each Bedrock call and reject requests when the daily budget is exceeded
Configure API Gateway usage plans with rate limits (requests per second) and quotas (requests per day) per Cognito user pool group
Set Bedrock service quotas via AWS Service Quotas to cap maximum tokens per minute at the account level

Making Your Private ChatGPT Actually Useful: RAG Quality Optimization

A private ChatGPT with a poorly implemented RAG pipeline produces confident, fluent, wrong answers — worse than no AI at all because users trust it. RAG quality optimization is the difference between a tool your team adopts and a tool that erodes trust in AI.

Chunking Strategy

Fixed-size chunking (splitting documents every N tokens) is the fastest to implement and the worst for quality. Use semantic chunking instead: split at natural boundaries — paragraph breaks, section headers, list endings — and preserve structural context in each chunk’s metadata.

Preserve section hierarchy in metadata: chapter, section, subsection. Include this in the OpenSearch document so it can be returned with search results and cited in model responses
Include 1-2 sentences of context before each chunk (the sentences that precede the chunk in the original document). This ‘context window’ dramatically improves retrieval relevance for queries that reference concepts introduced before the retrieved section
Store both a dense vector embedding (for semantic search) and a BM25 sparse embedding (for keyword search) for each chunk. Hybrid search combining both consistently outperforms pure vector search for knowledge base retrieval

Re-ranking Retrieved Results

OpenSearch returns the top-K most similar chunks by embedding distance. But embedding similarity is not the same as relevance for a specific query. Implement a re-ranking step:

Retrieve top-20 candidates from OpenSearch
Pass the query and all 20 candidates to a re-ranking model (Amazon Bedrock Reranking with Amazon Rerank, or a lightweight cross-encoder model hosted on Lambda)
Select the top-5 candidates after re-ranking for inclusion in the Bedrock context window

This two-stage retrieval approach (retrieve many, re-rank, select few) consistently improves answer quality at marginal additional cost.

Evaluating RAG Quality

You cannot improve what you do not measure. Build an evaluation pipeline before deploying to users:

Create a golden dataset: 50-100 question-answer pairs based on your actual knowledge base content, with known correct answers
Run your RAG pipeline against the golden dataset and measure: retrieval recall (did the correct document appear in the top-5?), answer accuracy (did the model produce the correct answer?), and citation accuracy (did cited sources actually support the answer?)
Re-evaluate monthly or after any significant knowledge base update

Cost Architecture: Running a Private ChatGPT on AWS Without Surprises

The cost of a private ChatGPT on AWS is real and variable — it scales with usage. Understanding the cost model before deployment prevents the budget shock that kills internal AI initiatives.

Cost Components and Estimates

Component	AWS Service	Pricing Model	Estimate (100 users, 50 msgs/day)
Foundation Model Inference	Amazon Bedrock (Claude Sonnet)	Per input/output token	~$150-400/month depending on message length
Document Embeddings	Bedrock Titan Embeddings V2	Per token (one-time per doc)	~$5-20/month for ongoing ingestion
Vector Search	OpenSearch Serverless	OCU-hours	~$200-400/month (2 OCU minimum)
Conversation Storage	DynamoDB On-Demand	Per read/write unit	~$5-15/month
Orchestration Compute	AWS Lambda	Per GB-second	~$1-5/month (Lambda is near-free at this scale)
API Management	API Gateway	Per API call	~$5-15/month
Authentication	Amazon Cognito	Per MAU above free tier	~$0-50/month (free tier: 50K MAU)
Monitoring	CloudWatch	Per metric/log GB	~$10-30/month
Secrets Storage	Secrets Manager	Per secret per month	~$1-5/month
Total Estimate			~$380-940/month for 100 active users

For comparison: ChatGPT Enterprise at $30/user/month for 100 users = $3,000/month. The AWS build is 4-8x cheaper at 100 users, with no per-user pricing ceiling as the organization scales.

Cost Optimization Strategies

Model selection: Use Claude Haiku for simple, high-frequency queries (FAQ-style responses) and Claude Sonnet only for complex queries requiring deeper reasoning. A routing Lambda that classifies query complexity before model selection can reduce Bedrock costs by 40-60%
Context window management: Every token in the Bedrock context window costs money. Implement aggressive conversation summarization — after 10 turns, summarize the conversation history to 200-300 tokens rather than passing all 10 turns verbatim
OpenSearch sizing: OpenSearch Serverless has a minimum of 2 OCUs ($0.24/OCU-hour = ~$350/month base). For small deployments, Aurora PostgreSQL with pgvector on a t3.medium instance ($30-50/month) may be significantly cheaper if your knowledge base fits in a relational database
Caching repeated queries: Common queries (‘What is our vacation policy?’, ‘How do I submit an expense?’) generate identical or near-identical Bedrock calls. Implement semantic caching in ElastiCache or DynamoDB — cache responses for 24-48 hours for queries whose embedding distance to a cached query is below a similarity threshold

Production Readiness: The Checklist Before Going Live

Most internal AI tools fail not because the AI is bad, but because the surrounding infrastructure is not production-ready. Work through this checklist before enabling access for your organization.

Reliability and Availability

Multi-AZ Lambda deployment (default — Lambda runs across multiple AZs automatically)
OpenSearch Serverless with multi-AZ configuration enabled
DynamoDB global tables if you require cross-region read access for conversation history
API Gateway with throttling configured at both the account and per-user plan level
Dead letter queues on all Lambda functions to capture and alert on failed invocations

Observability Stack

CloudWatch dashboard with: Bedrock token usage per model per day, Lambda p50/p95/p99 latency, API Gateway 4xx/5xx error rates, OpenSearch query latency, and DynamoDB consumed capacity
CloudWatch alarms on: Bedrock token usage approaching daily budget, Lambda error rate above 1%, and API Gateway 5xx rate above 0.1%
Structured logging in all Lambda functions with consistent fields: user_id, session_id, request_id, model_id, input_tokens, output_tokens, latency_ms, retrieved_docs_count

User Onboarding and Adoption

SAML or OIDC federation with your existing identity provider (Okta, Azure AD, Google Workspace) through Cognito — employees use their existing SSO credentials
Role-based access control: different Cognito user groups get access to different knowledge bases (e.g., Engineering group accesses code documentation, HR group accesses policies and benefits)
System prompt customization per user group — Engineering users get a system prompt optimized for technical queries, HR users get one optimized for policy questions

Private ChatGPT on AWS: Comparing Your Build Approaches

Not all private ChatGPT builds on AWS are equal. The approach you choose depends on your engineering capability, timeline, and customization requirements.

Approach	Components	Time to Deploy	Customization	Best For
Amazon Q Business (Managed)	Fully managed by AWS — no custom code	1-2 weeks config	Limited — AWS console only	Teams needing fast deployment with minimal engineering
Bedrock + Lambda (Custom)	Bedrock, Lambda, OpenSearch, API GW, Cognito	4-8 weeks build	Full — every component configurable	Teams needing full control and custom workflows
Bedrock Knowledge Bases (Hybrid)	Managed RAG via Bedrock KB + custom Lambda	2-4 weeks	Partial — RAG managed, prompting custom	Teams wanting managed RAG with custom application logic
Open Source + Bedrock (AnythingLLM, Flowise)	Self-hosted UI + Bedrock as model backend	1-2 weeks setup	High — open source UI customizable	Teams wanting open-source UX with AWS model backend

For most organizations building a production-grade private ChatGPT on AWS, the Bedrock Knowledge Bases hybrid approach offers the best balance: managed RAG infrastructure reduces engineering complexity, while custom Lambda orchestration maintains the flexibility to implement custom routing, authentication, cost controls, and business logic.

Conclusion: Your Private ChatGPT on AWS Roadmap

Building a private ChatGPT on AWS is one of the highest-ROI internal infrastructure investments an organization can make in 2026. The combination of Amazon Bedrock for managed model access, OpenSearch for knowledge retrieval, Lambda for orchestration, and Cognito for authentication creates a system that matches ChatGPT Enterprise on user experience while surpassing it on data privacy, customization, and long-term cost.

Key implementation takeaways:

Start with Bedrock Knowledge Bases for the RAG layer if you want faster time-to-value — switch to custom OpenSearch only when you need retrieval customization that managed RAG cannot provide
IAM least privilege and VPC endpoints are not optional — they are required for any organization with a security review process
Bedrock Guardrails should be configured before the first user request — retroactive safety configuration is harder than proactive configuration
Cost controls — per-user token budgets, query routing by complexity, and response caching — need to be designed into the architecture from day one, not added when your AWS bill surprises you
RAG quality evaluation with a golden dataset is the only reliable way to know whether your private ChatGPT is actually helping users or producing confident wrong answers

The organizations that build private AI on AWS successfully are not those with the largest ML teams. They are the ones that design the system for production from the start: secure by default, observable, cost-controlled, and grounded in real organizational knowledge. Build it right once and it compounds in value as your knowledge base grows and your users develop AI-native workflows.

Scale your startups with AWS free credits

Blogs

AWS

Contact

All Services

Location Based Service

How to Build a Private ChatGPT on AWS |The Complete Production Architecture Guide

Share this post

Share this post

What You Will Have After This Guide

Architecture Overview: How a Private ChatGPT on AWS Actually Works

Layer 1: Foundation Model Access Amazon Bedrock

Layer 2: Knowledge Base — RAG with Amazon OpenSearch

Layer 3: Orchestration AWS Lambda and Bedrock Agents

Layer 4: Security and Identity — IAM, Cognito, VPC

Layer 5: Frontend and API — Amplify, API Gateway, CloudFront

How to Build a Private ChatGPT on AWS: Step-by-Step Implementation

Step 1: Enable Amazon Bedrock and Request Model Access

Step 4: Build the Query Orchestration Lambda

Security Hardening: What Competitors Miss

VPC Endpoint Configuration

Secrets Management

Audit Logging and Compliance Evidence

Making Your Private ChatGPT Actually Useful: RAG Quality Optimization

Chunking Strategy

Re-ranking Retrieved Results

Evaluating RAG Quality

Cost Architecture: Running a Private ChatGPT on AWS Without Surprises

Cost Components and Estimates

Cost Optimization Strategies

Production Readiness: The Checklist Before Going Live

Reliability and Availability

Observability Stack

User Onboarding and Adoption

Private ChatGPT on AWS: Comparing Your Build Approaches

Conclusion: Your Private ChatGPT on AWS Roadmap

Get the latest articles and news about AWS

Enabling digital Future with GoCloud!!