Portable Document Format
PDF became an open ISO standard (ISO 32000) in 2008 after 15 years under Adobe's control — and supports embedded JavaScript, making untrusted PDFs a real security risk. Convert, compress, redact, or merge PDFs in your browser with FileDex — no upload, no server.
Your files never leave your device
Common questions
How do I convert a PDF to PNG images?
Drop your PDF into the FileDex converter and select PNG as the output format. Each page is rendered as a separate PNG at 150 DPI. Multi-page PDFs are packaged into a ZIP download. The entire conversion runs in your browser — no file is uploaded to any server.
Can I extract text from a scanned PDF?
Scanned PDFs contain images instead of text streams, so standard text extraction produces nothing. Use FileDex's built-in OCR tool (Tesseract WASM) to generate a searchable text layer from the scanned page images. The OCR runs entirely in your browser.
Why is my PDF so large when it only contains text?
Embedded high-resolution images, unsubsetted fonts (full font files instead of only used glyphs), and scanned pages stored as uncompressed bitmaps are the common causes. Drop the PDF into FileDex's compressor to reduce its size, or use any PDF optimization tool that resamples images and subsets fonts. See the CLI tab in Technical Reference below for the exact command.
Is it safe to open a PDF from an unknown source?
PDFs can contain JavaScript, launch actions, and embedded files that exploit vulnerable readers. Open untrusted PDFs in Chrome or Firefox (sandboxed PDF.js/PDFium) rather than Adobe Acrobat. Disable JavaScript in your PDF reader preferences as an added precaution.
What is the difference between PDF and PDF/A?
PDF/A (ISO 19005) is a restricted subset of PDF designed for long-term archival. It forbids JavaScript, encryption, and external content references while requiring embedded fonts and ICC color profiles. Standard PDF supports all these features but offers no archival guarantee.
What are the different PDF versions and which should I use?
PDF 1.4 added transparency and is the most compatible baseline. PDF 1.7 became ISO 32000-1 in 2008 and is the default output of most modern tools. PDF 2.0 (ISO 32000-2, published 2017) adds AES-256 encryption and deprecates XFA forms. Use 1.7 for general distribution, PDF/A-2b (based on 1.7) for archival.
Can I extract text from a PDF without Adobe Acrobat?
Yes. Select .txt as the output format in the FileDex PDF converter. The text extraction is powered by PDF.js (pdfjs-dist) running in your browser, which reads the embedded text stream objects from the PDF structure. This works on text-based PDFs. Scanned PDFs (image-only) require OCR and will produce empty or garbled output.
Why does my converted PDF image look blurry?
PDF-to-image conversion renders at a specific DPI (dots per inch). At 72 DPI (screen resolution), text appears soft. FileDex renders at 150 DPI by default. For sharper output — especially for print or OCR — look for a 300 DPI setting in your PDF-to-image tool. Higher DPI produces larger files but preserves fine text detail.
What makes .PDF special
Every PDF is a database of numbered objects connected by a cross-reference table that enables byte-level random access to any page or resource in the file. A PDF cross-reference table maps every object to its byte offset, enabling random-access reads — a 500-page PDF opens to page 400 without parsing pages 1-399. This design choice, made in 1993, is the reason PDF remains the dominant fixed-layout document format three decades later.
Continue reading — full technical deep dive
Object-Based Structure
A PDF file has four sections: header, body, cross-reference table, and trailer. The header declares the PDF version (%PDF-1.7 or %PDF-2.0). The body contains numbered indirect objects — dictionaries, streams, arrays, strings, names, numbers, and booleans. The cross-reference (xref) table lists every object number with its byte offset from the file's start. The trailer points back to the xref table and identifies the root object (the document catalog).
Objects reference each other by number. The document catalog points to a page tree. The page tree contains page objects. Page objects reference content streams, font dictionaries, image XObjects, and other resources. This graph structure means a PDF reader can resolve any reference by looking up the object number in the xref table and seeking directly to that byte position.
Content Streams and the Graphics Model
PDF pages are rendered by executing a content stream — a sequence of operators that manipulate a graphics state machine. Text operators like BT, Tf (set font), Tm (set text matrix), and Tj (show string) place glyphs. Path operators like m (moveto), l (lineto), re (rectangle), and f (fill) draw shapes. The cm operator modifies the current transformation matrix for scaling, rotation, and translation.
This operator model means PDF is not reflowable. Each glyph has an absolute position. Extracting running text from a PDF requires reconstructing reading order from scattered coordinates — a task that fails silently when documents use complex layouts, right-to-left text, or decorative positioning.
Font Embedding
PDF supports three font embedding strategies. Full embedding includes every glyph in the font program. Subset embedding includes only the glyphs used in the document, prefixed with a six-letter tag (e.g., ABCDEF+Helvetica). CIDFont embedding handles CJK fonts with thousands of glyphs using a character identifier mapping.
Subsetting reduces file size dramatically — a 2 MB OpenType font becomes a 30 KB subset for a typical business letter. The tradeoff: subsetting breaks text extraction if the encoding mapping is incomplete or uses custom glyph names. Some PDF generators write ToUnicode CMaps to resolve this. Others do not, producing files where copied text becomes garbled.
Incremental Updates
PDF supports incremental saving. Instead of rewriting the entire file, a PDF editor appends new or modified objects, a new xref table, and a new trailer to the end of the file. The previous content remains intact. This makes saves fast — editing one annotation in a 200 MB file appends only a few kilobytes. Multiple save cycles stack: a file edited ten times contains ten appended revisions.
The cost is bloat. Each revision duplicates modified objects without removing the originals. A 5 MB file can grow to 15 MB after heavy editing. Removing this bloat requires a full rewrite, which some tools call "Save As" or "Optimize."
Linearization
Linearized PDF (sometimes called "Fast Web View") rearranges objects so the first page's resources appear at the beginning of the file, followed by a hint table describing where subsequent pages live. A web browser can render page one while the rest of the file downloads. This is analogous to moving the moov atom to the front of an MP4 file. Without linearization, a PDF viewer over HTTP must download the trailer (at the end of the file), then the xref table, then seek backward for each page's objects.
PDF Subtype Standards
PDF/A restricts features to ensure long-term archival. No JavaScript, no external font references, no encryption, no transparency in PDF/A-1. All fonts must be embedded. ICC color profiles are required. PDF/A-3 relaxes one constraint: it allows embedding arbitrary files as attachments, enabling hybrid documents with source data inside.
PDF/X constrains output for print production. It requires specific color spaces (CMYK or spot colors), embedded fonts, bleed box definitions, and trapped status declarations. Output intent profiles guarantee consistent color reproduction across print facilities.
PDF/UA mandates accessibility. Tagged structure, reading order, alt text for images, and proper heading hierarchy are required. Most PDF generators do not produce PDF/UA-compliant output without explicit configuration.
Encryption and Security
PDF encryption operates at two levels. A user password prevents opening the file entirely — the content is encrypted with AES-256 (in modern implementations) and cannot be read without the key. An owner password restricts operations: printing, copying text, editing, annotation. Here is the critical distinction — owner password restrictions are enforced only by PDF viewer software and are trivially bypassable. Any tool that ignores the permission flags can print, copy, or edit an owner-password-protected file without knowing the password. The content itself is not encrypted by the owner password.
JavaScript Execution
PDF supports embedded JavaScript through the Adobe Acrobat SDK. Scripts can run on document open, page navigation, form field interaction, or print events. This enables interactive forms, calculations, and data validation. It also creates a significant attack surface. Malicious PDFs have exploited JavaScript execution to trigger buffer overflows, download malware, and execute arbitrary code. Most modern viewers disable JavaScript by default or sandbox its execution.
Gotchas
Font subsetting without a complete ToUnicode CMap produces files that render correctly but yield nonsense when text is copied or searched. Incremental updates can hide previous versions of content — a "redacted" document may still contain the original text in earlier revisions. PDF forms using XFA (XML Forms Architecture) are only fully supported in Adobe Acrobat; other viewers render them partially or not at all. Tagged PDF structure, needed for accessibility and reliable text extraction, is absent in most machine-generated PDFs.
.PDF compared to alternatives
| Formats | Criteria | Winner |
|---|---|---|
| .PDF vs .DOCX | Rendering fidelity PDF embeds all fonts, images, and layout instructions — the output is pixel-identical on any device. DOCX depends on the viewer's installed fonts and rendering engine, causing layout shifts across platforms. | PDF wins |
| .PDF vs .DOCX | Editability DOCX uses structured XML with semantic paragraphs, styles, and sections that any word processor can reflow and edit. PDF stores content as positioned glyphs in content streams — editing requires reconstructing the document structure. | DOCX wins |
| .PDF vs .EPUB | Mobile readability EPUB uses reflowable HTML/CSS that adapts to any screen size. PDF pages have fixed dimensions designed for print — on small screens, text requires zooming and horizontal scrolling. | EPUB wins |
| .PDF vs .PNG | Multi-page documents A single PDF file can contain thousands of pages with selectable text, hyperlinks, and bookmarks. PNG produces one file per page with no text layer, requiring external packaging for multi-page content. | PDF wins |
Convert .PDF to...
Technical reference
- MIME Type
application/pdf- Magic Bytes
25 50 44 46%PDF signature followed by version number.- Developer
- Adobe Systems / ISO
- Year Introduced
- 1993
- Open Standard
- Yes — View specification
%PDF signature followed by version number.
Binary Structure
A PDF file has four sequential sections. The header starts with the magic bytes %PDF- followed by a version number (1.0 through 2.0) and a binary comment line containing high bytes to signal binary content to transfer agents. The body contains indirect objects numbered sequentially — each wrapped in 'N G obj ... endobj' delimiters — representing pages, fonts, images, content streams, and metadata. The cross-reference table (xref) maps each object number to its exact byte offset in the file, enabling random access without sequential parsing. The trailer dictionary at file end points to the document catalog (/Root) and info dictionary (/Info), plus the startxref value giving the byte offset of the xref table itself. PDF 1.5+ can replace the plain-text xref with compressed xref streams using DEFLATE, reducing overhead by 30-50% in object-heavy documents. Incremental updates append new objects, a new xref section, and a new trailer with /Prev pointing to the previous xref — this is how each Save operation grows the file without rewriting it.
| Offset | Length | Field | Example | Description |
|---|---|---|---|---|
0x00 | 5 bytes | Magic + dash | 25 50 44 46 2D (%PDF-) | PDF signature. The dash is part of the magic — validators must check all 5 bytes, not just 4. |
0x05 | 3 bytes | Version | 31 2E 37 (1.7) | ASCII version string. Common values: 1.4, 1.5, 1.6, 1.7, 2.0. |
0x08 | 1 byte | Line terminator | 0A (LF) | Line feed or carriage return ending the header line. |
0x09 | 5+ bytes | Binary comment | 25 E2 E3 CF D3 (%....) | Comment with high bytes (>127) to signal binary content to FTP and mail transfer agents. |
Attack Vectors
- Embedded JavaScript execution
- Launch action and embedded file execution
- URI action data exfiltration
- Malformed stream length buffer overflow
- Polyglot PDF/ZIP and PDF/HTML files
Mitigation: FileDex processes PDF files entirely in-browser using pdfjs-dist (WebAssembly). No file is uploaded to any server. JavaScript actions, launch triggers, and embedded files in the PDF are ignored by the renderer — only page content streams are processed for conversion and display.
- Specification ISO 32000-2:2020 — Document management, Portable document format, Part 2 (PDF 2.0)
- Specification PDF Reference 1.7 (Adobe, pre-ISO publication) — 756 pages, freely downloadable
- Registry PDF (Portable Document Format) Family — Library of Congress Format Description
- Registry application/pdf — IANA Media Types (registered by ISO 32000 Project Leaders)
- Registry Acrobat PDF 1.0–1.7 (fmt/14–fmt/1016) — The National Archives PRONOM Registry
- Industry PDF Association — ISO 32000, PDF/A, PDF/UA technical resources
- History PDF — Wikipedia