How to Prepare PDFs for ChatGPT & Claude: The Complete Chunking Guide
You have a 50-page research paper you need to analyze with ChatGPT. You paste the text, hit enter, and get an error: "This conversation has reached its context limit." Sound familiar? This is one of the most frustrating aspects of working with AI models and long documents.
The problem isn't with AI capabilities. Modern language models like GPT-4 and Claude can provide remarkably insightful analysis of documents. The challenge is getting your content into the model in a way it can effectively process. This is where intelligent document chunking becomes essential.
Understanding LLM Context Windows and Token Limits
Every AI model has a context window, the maximum amount of text it can consider at once. This is measured in tokens, which are roughly word fragments. A typical English word averages 1.3 tokens, though technical documents with specialized terminology may run higher.
| Model | Context Window | Approx. Pages |
|---|---|---|
| Llama 4 | 10,000,000 tokens | ~15,000+ pages |
| Gemini 3 Pro | 2,000,000 tokens | ~3,000+ pages |
| GPT-4.1 | 1,000,000 tokens | ~1,500+ pages |
| Claude Sonnet 4 | 1,000,000 tokens | ~1,500+ pages |
| GPT-5 | 400,000 tokens | ~600+ pages |
| Claude Opus 4.5 | 200,000 tokens | ~300+ pages |
| DeepSeek V3.2 | 128,000 tokens | ~200+ pages |
| Qwen 3 | 128,000 tokens | ~200+ pages |
Important: Advertised context windows often exceed practical limits. Research shows models typically become unreliable at 60-70% of their stated capacity, with sudden performance drops rather than gradual degradation.
But context window size doesn't tell the whole story. Research from Stanford and UC Berkeley demonstrated the "lost in the middle" phenomenon. When relevant information is buried in the middle of a long context, models frequently fail to retrieve or use it effectively. Information at the beginning and end receives disproportionate attention.
Key Insight: Even if your document fits within the context window, chunking improves results by ensuring each piece of information gets proper attention. Strategic chunking places relevant content where the model can best utilize it.
The Four Main Chunking Strategies
Not all chunking approaches work equally well for all documents. The optimal strategy depends on your document structure, use case, and target AI model.
1. Fixed Token Size with Overlap (Recommended Default)
This method splits text into chunks of approximately equal token count, with each chunk overlapping the previous one by 10-20%. The overlap ensures that sentences or concepts spanning chunk boundaries aren't lost.
Best for: General documents, reports, articles. This is the most versatile approach and works well when you don't know the document structure in advance.
Recommended settings: 400-500 tokens per chunk, 15% overlap.
2. By Page
Preserves the original document structure by keeping each PDF page as a separate chunk. This maintains page-level context and makes it easy to reference specific locations in the original document.
Best for: Presentations, forms, documents where page boundaries are meaningful. Also useful when you need to cite specific page numbers in responses.
Limitation: Page lengths vary significantly. A mostly-image page may have 50 tokens while a dense text page has 800+.
3. By Paragraphs (Semantic Chunking)
Splits at natural paragraph boundaries, attempting to keep semantically related content together. Merges small paragraphs and splits overly long ones to maintain reasonable chunk sizes.
Best for: Well-structured documents with clear paragraph organization. Academic papers, legal documents, and technical manuals often work well with this approach.
4. Custom Size
Allows precise control over chunk size for specialized use cases. Smaller chunks (200-300 tokens) work better for retrieval-augmented generation (RAG) systems, while larger chunks (800-1000 tokens) preserve more context per chunk.
Why Overlap Matters: Preventing Context Loss
Consider this sentence spanning a chunk boundary without overlap:
If someone asks "What were the study results?", neither chunk alone contains the complete answer. With 15% overlap, both chunks would contain the full sentence, ensuring the information remains accessible.
"In our RAG benchmarks, 15-20% overlap reduced failed retrievals by 34% compared to non-overlapping chunks, with minimal increase in storage and processing overhead."
Token Estimation: How to Count Before You Chunk
Accurate token estimation helps you plan your chunking strategy and predict how many chunks your document will produce.
Quick estimation formulas:
tokens = words x 1.3(general English text)tokens = characters / 4(alternative method)- Average PDF page: 400-600 tokens
- Dense academic page: 600-800 tokens
For precise counts, each model family uses different tokenization. GPT models use tiktoken (cl100k_base encoding), Claude uses its own tokenizer, and open-source models often use SentencePiece. The differences are usually small (within 5-10%) for typical text.
Handling Scanned PDFs: OCR Before Chunking
Scanned PDFs present a unique challenge. They contain images of text rather than actual text data. Before chunking can occur, Optical Character Recognition (OCR) must convert the images to machine-readable text.
Modern OCR engines like Tesseract achieve 95-99% accuracy on clean scans. However, poor scan quality, unusual fonts, or handwriting can significantly reduce accuracy. Always review OCR output for critical documents.
Pro Tip: Our PDF to LLM tool automatically detects scanned PDFs and applies OCR before chunking. You don't need to pre-process your documents.
Best Practices for Different Use Cases
For ChatGPT/Claude Conversations
- Use 400-500 token chunks with 15% overlap
- Include chunk numbers in output for easy reference
- Process 3-5 most relevant chunks per query
- Use Markdown format for better readability
For RAG/Vector Database Applications
- Smaller chunks (256-384 tokens) improve retrieval precision
- Higher overlap (20%) prevents boundary issues
- JSON format with metadata enables filtering
- Include source page numbers in metadata
For Document Summarization
- Larger chunks (600-800 tokens) preserve context
- By-page chunking works well for structured documents
- Process chunks sequentially, building cumulative summaries
Common Mistakes to Avoid
- Chunks too large: Stuffing maximum tokens into context doesn't improve results. The "lost in the middle" effect means information gets overlooked.
- No overlap: Zero overlap guarantees some information will be split awkwardly across chunk boundaries.
- Ignoring document structure: A table split across chunks becomes meaningless. Use page-based chunking for documents with complex layouts.
- Forgetting metadata: Without page numbers or section headers, you can't trace AI responses back to source material.
- One strategy for all documents: A legal contract needs different chunking than a research paper. Adjust your approach to the content.
Ready to Prepare Your PDFs for AI?
Transform your documents into LLM-friendly chunks with our free, privacy-first tool. No uploads required.
Try PDF to LLM ToolFrequently Asked Questions
Research suggests 256-512 tokens is optimal for most use cases. This size balances context preservation with retrieval precision. Smaller chunks (under 200 tokens) may lose important context, while larger chunks (over 1000 tokens) can dilute relevance.
LLMs have context window limits. A 100-page PDF may contain 50,000+ tokens, exceeding model limits. Even within limits, the "lost in the middle" phenomenon means information buried in long contexts gets overlooked. Chunking ensures each piece gets proper attention.
Chunk overlap (typically 10-20%) means consecutive chunks share content at boundaries. This prevents sentences from being split mid-thought and ensures complete information remains accessible in at least one chunk.
Quick estimate: tokens = words x 1.3. Average PDF pages contain 400-600 tokens. For precise counts, use model-specific tokenizers (tiktoken for GPT, Claude's tokenizer for Claude models).
Yes, but OCR must run first to extract text from images. Our PDF to LLM tool automatically detects scanned documents and applies Tesseract OCR before chunking.