How to Prepare PDFs for ChatGPT & Claude: The Complete Chunking Guide

Q: Why do I need to chunk PDFs for ChatGPT?

LLMs have context window limits (e.g., 8K-128K tokens). A 100-page PDF may contain 50,000+ tokens, exceeding these limits. Chunking splits the document into manageable pieces that fit within context windows while preserving semantic meaning for accurate AI responses.

Q: How do I estimate tokens in a PDF?

A rough estimate is: tokens = words x 1.3, or characters / 4. For precise counts, use tokenizer tools specific to your target model (GPT uses tiktoken, Claude uses its own tokenizer). Most PDF pages contain 400-600 tokens of text.

January 12, 2026 10 min read AI & Documents

You have a 50-page research paper you need to analyze with ChatGPT. You paste the text, hit enter, and get an error: "This conversation has reached its context limit." Sound familiar? This is one of the most frustrating aspects of working with AI models and long documents.

The problem isn't with AI capabilities. Modern language models like GPT-4 and Claude can provide remarkably insightful analysis of documents. The challenge is getting your content into the model in a way it can effectively process. This is where intelligent document chunking becomes essential.

200K

tokens is Claude's context window, but research shows AI attention degrades significantly in the middle of long contexts (the "lost in the middle" effect)

Understanding LLM Context Windows and Token Limits

Every AI model has a context window, the maximum amount of text it can consider at once. This is measured in tokens, which are roughly word fragments. A typical English word averages 1.3 tokens, though technical documents with specialized terminology may run higher.

Model	Context Window	Approx. Pages
Llama 4	10,000,000 tokens	~15,000+ pages
Gemini 3 Pro	2,000,000 tokens	~3,000+ pages
GPT-4.1	1,000,000 tokens	~1,500+ pages
Claude Sonnet 4	1,000,000 tokens	~1,500+ pages
GPT-5	400,000 tokens	~600+ pages
Claude Opus 4.5	200,000 tokens	~300+ pages
DeepSeek V3.2	128,000 tokens	~200+ pages
Qwen 3	128,000 tokens	~200+ pages

Important: Advertised context windows often exceed practical limits. Research shows models typically become unreliable at 60-70% of their stated capacity, with sudden performance drops rather than gradual degradation.

But context window size doesn't tell the whole story. Research from Stanford and UC Berkeley demonstrated the "lost in the middle" phenomenon. When relevant information is buried in the middle of a long context, models frequently fail to retrieve or use it effectively. Information at the beginning and end receives disproportionate attention.

Key Insight: Even if your document fits within the context window, chunking improves results by ensuring each piece of information gets proper attention. Strategic chunking places relevant content where the model can best utilize it.

The Four Main Chunking Strategies

Not all chunking approaches work equally well for all documents. The optimal strategy depends on your document structure, use case, and target AI model.

1. Fixed Token Size with Overlap (Recommended Default)

This method splits text into chunks of approximately equal token count, with each chunk overlapping the previous one by 10-20%. The overlap ensures that sentences or concepts spanning chunk boundaries aren't lost.

Chunk 1: [tokens 1-500]
Chunk 2: [tokens 425-925] (75 token overlap)
Chunk 3: [tokens 850-1350] (75 token overlap)
...
        

Best for: General documents, reports, articles. This is the most versatile approach and works well when you don't know the document structure in advance.

Recommended settings: 400-500 tokens per chunk, 15% overlap.

2. By Page

Preserves the original document structure by keeping each PDF page as a separate chunk. This maintains page-level context and makes it easy to reference specific locations in the original document.

Best for: Presentations, forms, documents where page boundaries are meaningful. Also useful when you need to cite specific page numbers in responses.

Limitation: Page lengths vary significantly. A mostly-image page may have 50 tokens while a dense text page has 800+.

3. By Paragraphs (Semantic Chunking)

Splits at natural paragraph boundaries, attempting to keep semantically related content together. Merges small paragraphs and splits overly long ones to maintain reasonable chunk sizes.

Best for: Well-structured documents with clear paragraph organization. Academic papers, legal documents, and technical manuals often work well with this approach.

4. Custom Size

Allows precise control over chunk size for specialized use cases. Smaller chunks (200-300 tokens) work better for retrieval-augmented generation (RAG) systems, while larger chunks (800-1000 tokens) preserve more context per chunk.

Why Overlap Matters: Preventing Context Loss

Consider this sentence spanning a chunk boundary without overlap:

Chunk 1: "...the study found that patients who received the experimental treatment showed significant improvement in"
Chunk 2: "cognitive function scores compared to the control group, with p < 0.001..."
        

If someone asks "What were the study results?", neither chunk alone contains the complete answer. With 15% overlap, both chunks would contain the full sentence, ensuring the information remains accessible.

"In our RAG benchmarks, 15-20% overlap reduced failed retrievals by 34% compared to non-overlapping chunks, with minimal increase in storage and processing overhead."

Pinecone Vector Database Research, 2024

Token Estimation: How to Count Before You Chunk

Accurate token estimation helps you plan your chunking strategy and predict how many chunks your document will produce.

Quick estimation formulas:

tokens = words x 1.3 (general English text)
tokens = characters / 4 (alternative method)
Average PDF page: 400-600 tokens
Dense academic page: 600-800 tokens

For precise counts, each model family uses different tokenization. GPT models use tiktoken (cl100k_base encoding), Claude uses its own tokenizer, and open-source models often use SentencePiece. The differences are usually small (within 5-10%) for typical text.

Handling Scanned PDFs: OCR Before Chunking

Scanned PDFs present a unique challenge. They contain images of text rather than actual text data. Before chunking can occur, Optical Character Recognition (OCR) must convert the images to machine-readable text.

Modern OCR engines like Tesseract achieve 95-99% accuracy on clean scans. However, poor scan quality, unusual fonts, or handwriting can significantly reduce accuracy. Always review OCR output for critical documents.

Pro Tip: Our PDF to LLM tool automatically detects scanned PDFs and applies OCR before chunking. You don't need to pre-process your documents.

Best Practices for Different Use Cases

For ChatGPT/Claude Conversations

Use 400-500 token chunks with 15% overlap
Include chunk numbers in output for easy reference
Process 3-5 most relevant chunks per query
Use Markdown format for better readability

For RAG/Vector Database Applications

Smaller chunks (256-384 tokens) improve retrieval precision
Higher overlap (20%) prevents boundary issues
JSON format with metadata enables filtering
Include source page numbers in metadata

For Document Summarization

Larger chunks (600-800 tokens) preserve context
By-page chunking works well for structured documents
Process chunks sequentially, building cumulative summaries

Common Mistakes to Avoid

Chunks too large: Stuffing maximum tokens into context doesn't improve results. The "lost in the middle" effect means information gets overlooked.
No overlap: Zero overlap guarantees some information will be split awkwardly across chunk boundaries.
Ignoring document structure: A table split across chunks becomes meaningless. Use page-based chunking for documents with complex layouts.
Forgetting metadata: Without page numbers or section headers, you can't trace AI responses back to source material.
One strategy for all documents: A legal contract needs different chunking than a research paper. Adjust your approach to the content.

Ready to Prepare Your PDFs for AI?

Transform your documents into LLM-friendly chunks with our free, privacy-first tool. No uploads required.

Try PDF to LLM Tool

Frequently Asked Questions

What is the best chunk size for LLMs?

Research suggests 256-512 tokens is optimal for most use cases. This size balances context preservation with retrieval precision. Smaller chunks (under 200 tokens) may lose important context, while larger chunks (over 1000 tokens) can dilute relevance.

Why do I need to chunk PDFs for ChatGPT?

LLMs have context window limits. A 100-page PDF may contain 50,000+ tokens, exceeding model limits. Even within limits, the "lost in the middle" phenomenon means information buried in long contexts gets overlooked. Chunking ensures each piece gets proper attention.

What is chunk overlap and why does it matter?

Chunk overlap (typically 10-20%) means consecutive chunks share content at boundaries. This prevents sentences from being split mid-thought and ensures complete information remains accessible in at least one chunk.

How do I estimate tokens in a PDF?

Quick estimate: tokens = words x 1.3. Average PDF pages contain 400-600 tokens. For precise counts, use model-specific tokenizers (tiktoken for GPT, Claude's tokenizer for Claude models).

Can I chunk scanned PDFs?

Yes, but OCR must run first to extract text from images. Our PDF to LLM tool automatically detects scanned documents and applies Tesseract OCR before chunking.