How PDF Files Actually Work: The Format That Changed Document Sharing Forever

You open a PDF on Windows, a Mac, an Android phone, and a Linux server, and it looks exactly the same on all four. That reproducibility was once a remarkable engineering achievement, and it is not accidental. The Portable Document Format was designed from the ground up to make documents look identical on any device, anywhere. But what is a PDF, really? This article pulls back the curtain on the technical internals: the object database, the content streams, the cross-reference table, the reasons some PDFs are unsearchable, and why merging two PDFs can sometimes double the file size.

A Brief History: From Camelot to ISO Standard

In 1991, John Warnock, co-founder of Adobe Systems, wrote an internal memo called "The Camelot Project." His goal was ambitious: create a universal file format that would let anyone send any document to any computer and have it print exactly as intended, regardless of what software or fonts were installed on the recipient's machine.

The first public version of PDF appeared in 1993 alongside Adobe Acrobat 1.0. Early adoption was slow because Acrobat Reader was not yet free and the format required significant processing power. Adobe made Reader free in 1994, and adoption began accelerating.

For its first 15 years, PDF was a proprietary Adobe format. That changed on July 1, 2008, when PDF 1.7 was published as ISO 32000-1, an open international standard. A second edition, ISO 32000-2 (PDF 2.0), followed in 2017 with new features including better accessibility metadata, improved encryption, and cleaner deprecation of legacy constructs. Today, any developer or software vendor can implement full PDF support without paying Adobe a cent.

What a PDF Actually Is: A Hierarchical Object Database

Most people think of a PDF as a document or a fancy image. Neither is quite right. A PDF file is a hierarchical object database stored in a flat text (or binary) file. The file contains a tree of numbered objects, and the document is assembled by following the references between those objects.

Object types inside a PDF include:

Object Type	Purpose
Dictionary	Key-value pairs, the building block of most structures
Array	Ordered list of objects
Stream	Binary or compressed data block (images, fonts, content)
String	Text data, either literal or hex-encoded
Number	Integer or real values for coordinates, sizes
Boolean	True/false flags
Name	Symbolic identifiers like `/Font` or `/Page`
Null	Placeholder for absent values

Every page, font, image, annotation, and form field in a PDF is one or more of these objects. When a PDF viewer opens a file, it does not read it sequentially from top to bottom. Instead, it jumps to the end of the file, reads the trailer, locates the cross-reference table, and uses that table as an index to find exactly the objects it needs.

The Four Sections of Every PDF File

Every valid PDF file has four sections, in this order:

%PDF-1.7              ← Header
1 0 obj ... endobj    ← Body (many objects)
xref                  ← Cross-reference table
trailer               ← Trailer
%%EOF                 ← End of file marker

Header: The first line of any PDF is %PDF-x.y, where x.y is the version number (such as 1.7 or 2.0). The second line often contains four bytes with values above 127, which signals to file transfer programs that the file contains binary data.

Body: The body is a sequence of numbered objects. Each object starts with N G obj (where N is the object number and G is the generation number) and ends with endobj. A page object, for instance, is a dictionary that lists the page's size, its resources (fonts and images), and a reference to its content stream.

Cross-reference table (xref): The xref table is a fixed-size byte-offset index of every object in the file. Each entry records the byte offset at which an object begins. Because byte offsets are fixed-width, a PDF viewer can seek directly to any object in microseconds, even in a 500 MB file. This is why large PDFs open quickly.

Trailer: The trailer dictionary points to the xref table and to the document catalog object, which is the root of the object tree. It also stores the total object count and an optional encryption dictionary.

Why PDFs Look the Same Everywhere: PostScript Heritage and Embedded Fonts

PDF is a direct descendant of PostScript, Adobe's page description language from 1982. PostScript describes pages as programs: sequences of instructions that draw lines, fill shapes, and place text at precise coordinates. PDF inherits this coordinate-based, device-independent model.

When a PDF viewer renders text, it does not rely on fonts installed on your computer. Instead, the PDF file itself contains a font descriptor and, usually, a full or subset-embedded copy of the font data. This is why a document set in a custom typeface looks identical on a machine that has never installed that font.

Text in a PDF content stream is not stored the way you would type it. A typical text-drawing sequence looks like this:

BT
  /F1 12 Tf
  100 700 Td
  (Hello, world) Tj
ET

BT begins a text block. /F1 12 Tf selects font F1 at 12 points. 100 700 Td moves the text cursor to coordinates (100, 700) in user space. (Hello, world) Tj draws the string. ET ends the text block. All coordinates are in points (1/72 of an inch), measured from the bottom-left corner of the page.

Why Some PDFs Are Unsearchable: Scanned Documents and OCR

A scanner does not produce text. It produces an image of a page. When that image is wrapped in a PDF container, the resulting file is a PDF that looks like a document but contains zero text data. Every "word" you see is just a collection of dark pixels in a raster image.

This is why you cannot select, copy, or search text in a scanned PDF. The PDF structure exists (header, body, xref, trailer), but the body contains only image stream objects, not content streams with text operators.

To make a scanned PDF searchable, you need Optical Character Recognition (OCR). OCR software analyzes the pixel patterns in the image, infers character shapes, and produces a hidden text layer that is placed behind the visible image. The result is a "searchable PDF": visually it looks like the scan, but the text layer allows selection and search. The quality of OCR depends on scan resolution (300 DPI is a common minimum), font clarity, and whether the page was skewed when scanned.

PDF Versions and Key Feature Milestones

The PDF specification evolved significantly across versions:

Version	Year	Key Addition
PDF 1.0	1993	Initial release with Acrobat 1.0
PDF 1.2	1996	Interactive forms (AcroForms)
PDF 1.4	2001	Transparency and alpha channel support
PDF 1.5	2003	Object streams (better compression), cross-reference streams
PDF 1.6	2004	3D content, larger encryption key sizes
PDF 1.7	2006	Became ISO 32000-1 in 2008
PDF 2.0	2017	ISO 32000-2: improved accessibility, new encryption, deprecated legacy features

PDF 1.4 introduced the transparency model, allowing objects to be rendered with opacity and blending modes. Before PDF 1.4, everything was opaque. PDF 1.5 introduced object streams, which bundle multiple small objects into a single compressed stream, often reducing file size by 40 to 60 percent compared to PDF 1.4 for the same content.

Linearized PDFs: Fast Web View

A standard PDF must be fully downloaded before a browser can display any page, because the cross-reference table is at the end of the file. Linearized PDFs (also called "Fast Web View" in Adobe Acrobat) solve this by restructuring the file so that all objects needed to display the first page appear at the very beginning.

A linearized PDF has a special hint table near the start of the file. A web server can begin streaming the file, and the browser can render page 1 before the rest of the file has arrived. This matters enormously for long PDFs served over the web: a 200-page report opens at page 1 in under a second, while the remaining 199 pages download in the background.

You can check whether a PDF is linearized by looking for the /Linearized dictionary in the first few hundred bytes of the file.

Incremental Updates: How PDF Editing Works Without Rewriting the File

When you open a PDF, add a comment, and save it, a naive implementation would rewrite the entire file. PDFs instead use an incremental update model. New and modified objects are appended to the end of the file, followed by a new xref section and a new trailer pointing to those changes. The original file body is untouched.

This has two consequences. First, saving is fast: you only write the changed objects, not the entire document. Second, the file grows with each edit because older versions of objects are never deleted (unless you perform a "save as" full rewrite). A document that has been annotated and re-saved many times may contain dozens of superseded object versions. Some PDF optimization tools strip these stale versions, which can significantly reduce file size.

Incremental updates are also the mechanism behind digital signatures: the signature signs a specific byte range of the file, and any subsequent modification outside that range invalidates the signature.

Why Merging PDFs Can Change File Size Unexpectedly

When you merge two PDF files, you might expect the output to be roughly the sum of the two input sizes. In practice, the result can be larger or smaller.

Larger than expected is common when both source PDFs embed the same font. A 500 KB font embedded in file A and the same 500 KB font embedded in file B will both appear in the merged output if the merger does not deduplicate font resources. The output is then 1 MB heavier than necessary. A well-implemented merger detects duplicate font resources and keeps only one copy.

Smaller than expected can happen when both PDFs share large common resources (like a background image or a company logo) that can be deduplicated, or when the merger applies compression that the original files lacked.

Understanding this helps you diagnose unexpected file sizes after merging and know when to look for a smarter merging tool.

Frequently Asked Questions

Why are some PDFs so large?

PDF size depends on embedded fonts, image resolution, the number of pages, and whether object stream compression is used. A single high-resolution photograph embedded at 300 DPI can be 5 to 10 MB by itself. PDFs saved from Microsoft Word or Google Docs sometimes include large uncompressed preview images or redundant data. Running a PDF through an optimizer (or re-saving with "reduce file size" enabled) can often cut file size by 50 percent or more.

Why can't I copy text from some PDFs?

There are two reasons. First, the PDF may be a scanned document containing only raster images with no text layer. OCR is required to extract text. Second, the PDF author may have set a permissions password that restricts copying. In a permissions-restricted PDF, the file is encrypted and the decryption key allows rendering but not text extraction. Note that these restrictions are advisory: some tools ignore them.

What is a PDF/A?

PDF/A is an ISO standard (ISO 19005) designed for long-term archiving. A PDF/A file is a strict subset of PDF: it must embed all fonts, must not reference external resources, must not use encryption, must not use JavaScript, and must include specific metadata. The goal is that a PDF/A file should be fully self-contained and renderable by software written decades in the future, with no external dependencies. Different sub-formats exist: PDF/A-1 is based on PDF 1.4, PDF/A-2 on PDF 1.7, and PDF/A-3 allows embedded files of any type.

Why does my PDF look different on different computers?

The most common cause is missing or substituted fonts. If a PDF does not embed its fonts (which is allowed but inadvisable), the viewer substitutes the nearest available font. Different substitution choices on different operating systems produce different line breaks, different text flow, and different page layouts. Fully embedding fonts eliminates this problem. A second cause is PDF version incompatibility: a very old viewer may not support transparency (PDF 1.4) or object streams (PDF 1.5), causing rendering differences.

What is the difference between the xref table and cross-reference streams?

PDF 1.4 and earlier used a plain-text xref table. PDF 1.5 introduced cross-reference streams, which store the same byte-offset data in a compressed binary stream object. Cross-reference streams are smaller and can be compressed along with other object streams. Older viewers that only understand PDF 1.4 cannot read cross-reference streams and will fail to open such files. Most modern viewers handle both formats transparently.

Now that you understand how PDF files are built, you can work with them more confidently. Whether you need to combine chapters into a single report or split a large file into smaller sections, the PDF Merge & Split tool on MoreFreeTools handles both operations cleanly, without unexpected font duplication or broken cross-references.