.PDF Portable Document Format
.pdf

Portable Document Format

PDF became an open ISO standard (ISO 32000) in 2008 after 15 years under Adobe's control — and supports embedded JavaScript, making untrusted PDFs a real security risk. Convert, compress, redact, or merge PDFs in your browser with FileDex — no upload, no server.

Learn more ↓
Document structure
%PDF header · version
Body objects · pages
XRef cross-reference
%%EOF trailer
PagesFormsFontsISO 320001993
By FileDex

Your files never leave your device

Common questions

How do I convert a PDF to PNG images?

Drop your PDF into the FileDex converter and select PNG as the output format. Each page is rendered as a separate PNG at 150 DPI. Multi-page PDFs are packaged into a ZIP download. The entire conversion runs in your browser — no file is uploaded to any server.

Can I extract text from a scanned PDF?

Scanned PDFs contain images instead of text streams, so standard text extraction produces nothing. Use FileDex's built-in OCR tool (Tesseract WASM) to generate a searchable text layer from the scanned page images. The OCR runs entirely in your browser.

Why is my PDF so large when it only contains text?

Embedded high-resolution images, unsubsetted fonts (full font files instead of only used glyphs), and scanned pages stored as uncompressed bitmaps are the common causes. Drop the PDF into FileDex's compressor to reduce its size, or use any PDF optimization tool that resamples images and subsets fonts. See the CLI tab in Technical Reference below for the exact command.

Is it safe to open a PDF from an unknown source?

PDFs can contain JavaScript, launch actions, and embedded files that exploit vulnerable readers. Open untrusted PDFs in Chrome or Firefox (sandboxed PDF.js/PDFium) rather than Adobe Acrobat. Disable JavaScript in your PDF reader preferences as an added precaution.

What is the difference between PDF and PDF/A?

PDF/A (ISO 19005) is a restricted subset of PDF designed for long-term archival. It forbids JavaScript, encryption, and external content references while requiring embedded fonts and ICC color profiles. Standard PDF supports all these features but offers no archival guarantee.

What are the different PDF versions and which should I use?

PDF 1.4 added transparency and is the most compatible baseline. PDF 1.7 became ISO 32000-1 in 2008 and is the default output of most modern tools. PDF 2.0 (ISO 32000-2, published 2017) adds AES-256 encryption and deprecates XFA forms. Use 1.7 for general distribution, PDF/A-2b (based on 1.7) for archival.

Can I extract text from a PDF without Adobe Acrobat?

Yes. Select .txt as the output format in the FileDex PDF converter. The text extraction is powered by PDF.js (pdfjs-dist) running in your browser, which reads the embedded text stream objects from the PDF structure. This works on text-based PDFs. Scanned PDFs (image-only) require OCR and will produce empty or garbled output.

Why does my converted PDF image look blurry?

PDF-to-image conversion renders at a specific DPI (dots per inch). At 72 DPI (screen resolution), text appears soft. FileDex renders at 150 DPI by default. For sharper output — especially for print or OCR — look for a 300 DPI setting in your PDF-to-image tool. Higher DPI produces larger files but preserves fine text detail.

What makes .PDF special

Camelot Project
Adobe envisioned it in 1991
John Warnock wrote a paper called The Camelot Project describing a system to send documents with perfect visual fidelity. PDF 1.0 shipped in 1993.
ISO standard
No longer owned by Adobe since 2008
PDF 1.7 became ISO 32000-1 in 2008, making it an open international standard. PDF 2.0 (ISO 32000-2) added 256-bit AES encryption.
Random access
Opens page 400 without parsing 1-399
The cross-reference table maps every object to its byte offset. A 500-page PDF jumps directly to any page without sequential parsing.
Owner password myth
Copy/print restrictions are bypassed
Owner password restrictions are enforced only by viewer software. Any tool that ignores permission flags can print, copy, or edit the file freely.

Every PDF is a database of numbered objects connected by a cross-reference table that enables byte-level random access to any page or resource in the file. A PDF cross-reference table maps every object to its byte offset, enabling random-access reads — a 500-page PDF opens to page 400 without parsing pages 1-399. This design choice, made in 1993, is the reason PDF remains the dominant fixed-layout document format three decades later.

Continue reading — full technical deep dive

Object-Based Structure

A PDF file has four sections: header, body, cross-reference table, and trailer. The header declares the PDF version (%PDF-1.7 or %PDF-2.0). The body contains numbered indirect objects — dictionaries, streams, arrays, strings, names, numbers, and booleans. The cross-reference (xref) table lists every object number with its byte offset from the file's start. The trailer points back to the xref table and identifies the root object (the document catalog).

Objects reference each other by number. The document catalog points to a page tree. The page tree contains page objects. Page objects reference content streams, font dictionaries, image XObjects, and other resources. This graph structure means a PDF reader can resolve any reference by looking up the object number in the xref table and seeking directly to that byte position.

Content Streams and the Graphics Model

PDF pages are rendered by executing a content stream — a sequence of operators that manipulate a graphics state machine. Text operators like BT, Tf (set font), Tm (set text matrix), and Tj (show string) place glyphs. Path operators like m (moveto), l (lineto), re (rectangle), and f (fill) draw shapes. The cm operator modifies the current transformation matrix for scaling, rotation, and translation.

This operator model means PDF is not reflowable. Each glyph has an absolute position. Extracting running text from a PDF requires reconstructing reading order from scattered coordinates — a task that fails silently when documents use complex layouts, right-to-left text, or decorative positioning.

Font Embedding

PDF supports three font embedding strategies. Full embedding includes every glyph in the font program. Subset embedding includes only the glyphs used in the document, prefixed with a six-letter tag (e.g., ABCDEF+Helvetica). CIDFont embedding handles CJK fonts with thousands of glyphs using a character identifier mapping.

Subsetting reduces file size dramatically — a 2 MB OpenType font becomes a 30 KB subset for a typical business letter. The tradeoff: subsetting breaks text extraction if the encoding mapping is incomplete or uses custom glyph names. Some PDF generators write ToUnicode CMaps to resolve this. Others do not, producing files where copied text becomes garbled.

Incremental Updates

PDF supports incremental saving. Instead of rewriting the entire file, a PDF editor appends new or modified objects, a new xref table, and a new trailer to the end of the file. The previous content remains intact. This makes saves fast — editing one annotation in a 200 MB file appends only a few kilobytes. Multiple save cycles stack: a file edited ten times contains ten appended revisions.

The cost is bloat. Each revision duplicates modified objects without removing the originals. A 5 MB file can grow to 15 MB after heavy editing. Removing this bloat requires a full rewrite, which some tools call "Save As" or "Optimize."

Linearization

Linearized PDF (sometimes called "Fast Web View") rearranges objects so the first page's resources appear at the beginning of the file, followed by a hint table describing where subsequent pages live. A web browser can render page one while the rest of the file downloads. This is analogous to moving the moov atom to the front of an MP4 file. Without linearization, a PDF viewer over HTTP must download the trailer (at the end of the file), then the xref table, then seek backward for each page's objects.

PDF Subtype Standards

PDF/A restricts features to ensure long-term archival. No JavaScript, no external font references, no encryption, no transparency in PDF/A-1. All fonts must be embedded. ICC color profiles are required. PDF/A-3 relaxes one constraint: it allows embedding arbitrary files as attachments, enabling hybrid documents with source data inside.

PDF/X constrains output for print production. It requires specific color spaces (CMYK or spot colors), embedded fonts, bleed box definitions, and trapped status declarations. Output intent profiles guarantee consistent color reproduction across print facilities.

PDF/UA mandates accessibility. Tagged structure, reading order, alt text for images, and proper heading hierarchy are required. Most PDF generators do not produce PDF/UA-compliant output without explicit configuration.

Encryption and Security

PDF encryption operates at two levels. A user password prevents opening the file entirely — the content is encrypted with AES-256 (in modern implementations) and cannot be read without the key. An owner password restricts operations: printing, copying text, editing, annotation. Here is the critical distinction — owner password restrictions are enforced only by PDF viewer software and are trivially bypassable. Any tool that ignores the permission flags can print, copy, or edit an owner-password-protected file without knowing the password. The content itself is not encrypted by the owner password.

JavaScript Execution

PDF supports embedded JavaScript through the Adobe Acrobat SDK. Scripts can run on document open, page navigation, form field interaction, or print events. This enables interactive forms, calculations, and data validation. It also creates a significant attack surface. Malicious PDFs have exploited JavaScript execution to trigger buffer overflows, download malware, and execute arbitrary code. Most modern viewers disable JavaScript by default or sandbox its execution.

Gotchas

Font subsetting without a complete ToUnicode CMap produces files that render correctly but yield nonsense when text is copied or searched. Incremental updates can hide previous versions of content — a "redacted" document may still contain the original text in earlier revisions. PDF forms using XFA (XML Forms Architecture) are only fully supported in Adobe Acrobat; other viewers render them partially or not at all. Tagged PDF structure, needed for accessibility and reliable text extraction, is absent in most machine-generated PDFs.

.PDF compared to alternatives

.PDF compared to alternative formats
Formats Criteria Winner
.PDF vs .DOCX
Rendering fidelity
PDF embeds all fonts, images, and layout instructions — the output is pixel-identical on any device. DOCX depends on the viewer's installed fonts and rendering engine, causing layout shifts across platforms.
PDF wins
.PDF vs .DOCX
Editability
DOCX uses structured XML with semantic paragraphs, styles, and sections that any word processor can reflow and edit. PDF stores content as positioned glyphs in content streams — editing requires reconstructing the document structure.
DOCX wins
.PDF vs .EPUB
Mobile readability
EPUB uses reflowable HTML/CSS that adapts to any screen size. PDF pages have fixed dimensions designed for print — on small screens, text requires zooming and horizontal scrolling.
EPUB wins
.PDF vs .PNG
Multi-page documents
A single PDF file can contain thousands of pages with selectable text, hyperlinks, and bookmarks. PNG produces one file per page with no text layer, requiring external packaging for multi-page content.
PDF wins

Technical reference

MIME Type
application/pdf
Magic Bytes
25 50 44 46 %PDF signature followed by version number.
Developer
Adobe Systems / ISO
Year Introduced
1993
Open Standard
Yes — View specification
0000000025504446 %PDF

%PDF signature followed by version number.

Binary Structure

A PDF file has four sequential sections. The header starts with the magic bytes %PDF- followed by a version number (1.0 through 2.0) and a binary comment line containing high bytes to signal binary content to transfer agents. The body contains indirect objects numbered sequentially — each wrapped in 'N G obj ... endobj' delimiters — representing pages, fonts, images, content streams, and metadata. The cross-reference table (xref) maps each object number to its exact byte offset in the file, enabling random access without sequential parsing. The trailer dictionary at file end points to the document catalog (/Root) and info dictionary (/Info), plus the startxref value giving the byte offset of the xref table itself. PDF 1.5+ can replace the plain-text xref with compressed xref streams using DEFLATE, reducing overhead by 30-50% in object-heavy documents. Incremental updates append new objects, a new xref section, and a new trailer with /Prev pointing to the previous xref — this is how each Save operation grows the file without rewriting it.

OffsetLengthFieldExampleDescription
0x00 5 bytes Magic + dash 25 50 44 46 2D (%PDF-) PDF signature. The dash is part of the magic — validators must check all 5 bytes, not just 4.
0x05 3 bytes Version 31 2E 37 (1.7) ASCII version string. Common values: 1.4, 1.5, 1.6, 1.7, 2.0.
0x08 1 byte Line terminator 0A (LF) Line feed or carriage return ending the header line.
0x09 5+ bytes Binary comment 25 E2 E3 CF D3 (%....) Comment with high bytes (>127) to signal binary content to FTP and mail transfer agents.
1991Adobe co-founder John Warnock publishes 'The Camelot Project' paper proposing a universal document format1993PDF 1.0 released alongside Acrobat 1.0; initially commercial, limited early adoption1994Acrobat Reader made free; PDF adoption begins to scale2001PDF/X-1a (ISO 15930-1) ratified for print production workflows2005PDF/A-1 (ISO 19005-1) ratified for long-term document archival2008PDF 1.7 published as ISO 32000-1; Adobe relinquishes proprietary control2012PDF/UA (ISO 14289-1) ratified for universal accessibility compliance2017PDF 2.0 published as ISO 32000-2 with AES-256 encryption and updated transparency model2020ISO 32000-2:2020 corrigendum published with clarifications
Render all PDF pages to PNG at 300 DPI ghostscript
gs -dNOPAUSE -dBATCH -sDEVICE=png16m -r300 -sOutputFile=page_%03d.png input.pdf

Ghostscript renders every page to a numbered PNG sequence. -sDEVICE=png16m selects 24-bit color output. -r300 sets 300 DPI resolution. %03d zero-pads page numbers to 3 digits.

Extract all text from a PDF preserving layout other
pdftotext -layout -enc UTF-8 input.pdf output.txt

Poppler's pdftotext preserves spatial layout using whitespace padding. -layout produces readable column-based output. -enc UTF-8 forces Unicode output encoding.

Compress a PDF for web delivery ghostscript
gs -dNOPAUSE -dBATCH -dPDFSETTINGS=/ebook -sDEVICE=pdfwrite -dCompatibilityLevel=1.5 -sOutputFile=compressed.pdf input.pdf

Ghostscript recompresses images to 150 DPI JPEG, subsets fonts, and removes redundant objects. /ebook preset targets web delivery. -dCompatibilityLevel=1.5 enables xref streams for smaller output.

Decrypt a PDF with owner password restriction qpdf
qpdf --decrypt --password='' input.pdf decrypted.pdf

Removes owner-password restrictions (print, copy, edit) without altering content. Does not bypass user-password (AES content encryption).

Merge multiple PDFs into one qpdf
qpdf --empty --pages input1.pdf input2.pdf input3.pdf -- merged.pdf

Creates an empty output PDF and appends all pages from each input file in order. The -- separator marks the end of the pages specification.

PDF PNG render lossless Each PDF page is rasterized to a Canvas element at a target DPI via pdfjs-dist, then exported as lossless PNG. Ideal for page thumbnails, diagram extraction from academic papers, and archival snapshots where every visual detail matters.
PDF JPG render lossy JPEG rendering of PDF pages yields 60-80% size reduction versus PNG at quality 85+, with imperceptible loss for photographic content. Preferred for email attachments, social media previews, and web thumbnails where size matters more than pixel accuracy.
PDF TXT export variable Text extraction walks PDF content stream operators (Tj, TJ) and resolves Unicode via ToUnicode CMaps. Used for full-text indexing, LLM ingestion pipelines, and accessibility conversion. Scanned PDFs require OCR first.
PDF WEBP render lossy WebP offers 25-35% smaller file sizes than JPEG at equivalent perceived quality, making it the best choice for web-optimized page images served via CDN or CMS platforms.
HIGH

Attack Vectors

  • Embedded JavaScript execution
  • Launch action and embedded file execution
  • URI action data exfiltration
  • Malformed stream length buffer overflow
  • Polyglot PDF/ZIP and PDF/HTML files

Mitigation: FileDex processes PDF files entirely in-browser using pdfjs-dist (WebAssembly). No file is uploaded to any server. JavaScript actions, launch triggers, and embedded files in the PDF are ignored by the renderer — only page content streams are processed for conversion and display.

Industry-standard PDF creation, editing, and form design
PDF.js library
Mozilla's open-source JavaScript PDF renderer used in Firefox
pdfjs-dist library
NPM distribution of PDF.js for browser-based PDF rendering
PostScript/PDF interpreter for rasterization and compression
Open-source office suite with PDF export and import support
qpdf tool
CLI tool for PDF linearization, decryption, and repair
pdftk tool
PDF toolkit for merging, splitting, rotating, and stamping
pdf-lib library
JavaScript library for creating and modifying PDFs in browser