Microsoft Word Document (Open XML)
DOCX is Microsoft Word's default format since 2007 — every file is a ZIP archive of XML you can unzip and inspect right now. Defined by ISO/IEC 29500 and ECMA-376, DOCX replaced the binary .doc format that was a Microsoft trade secret for 25 years.
DOCX conversion is not yet available in FileDex. For now, use the CLI commands in the Developer Door to convert between document formats with LibreOffice or pandoc.
Common questions
What is a DOCX file?
Every DOCX is secretly a ZIP file — rename it to .zip and unzip it yourself. Inside you'll find XML files storing every paragraph, font choice, and image in your document. Microsoft made DOCX the default Word format in 2007, replacing the proprietary binary .doc format that had been a trade secret for 25 years. The open XML structure (ECMA-376, ISO/IEC 29500) means any text editor can read what's inside.
What is the difference between DOCX and DOC?
DOC is the legacy binary format from before 2007, with a file structure that was a Microsoft trade secret for 25 years. DOCX is open XML that anyone can inspect. DOCX also separates macros into the .docm extension and supports over a million table rows — DOC was limited to 32,767 rows per table.
How do I convert a DOCX file to PDF?
For individual files, open in Microsoft Word, LibreOffice Writer, or Google Docs and export to PDF. Apple Pages on Mac also exports DOCX to PDF. For automated or batch conversion, LibreOffice offers a headless mode that runs without a graphical interface — see the Technical Reference below for the exact command.
How do I remove track changes from a Word document?
In Microsoft Word, go to Review tab, click Accept All Changes, then save. This removes revision markup from the visible document. However, other metadata like author name, editing time, and comments may still be embedded. Use File → Inspect Document → Document Inspector to find and remove all hidden data before sharing.
Is it safe to open a DOCX file from an unknown sender?
Standard .docx files cannot contain macros and are generally safe to view. However, watch for .docm files (macro-enabled) disguised with misleading names. Never enable macros on documents from untrusted sources. Microsoft's Protected View opens external documents in a restricted sandbox by default.
What makes .DOCX special
Rename any .docx to .zip and double-click it. Inside you'll find XML files describing every paragraph, font, and image in your document — a structure Microsoft kept secret for 25 years before opening it in 2007. That transparency is what makes DOCX different from the binary .doc files that came before it.
Continue reading — full technical deep dive
Rename a .docx to .zip. Your File Explorer Unzips It. What's Inside?
Every DOCX follows the Open Packaging Conventions: a ZIP archive containing [Content_Types].xml as its manifest, a _rels/ directory for relationship mapping, and a word/ directory holding the actual document parts. The main text lives in word/document.xml. Styles cascade from word/styles.xml. Fonts are declared in word/fontTable.xml. Embedded images sit in word/media/. To distinguish a DOCX from an XLSX or PPTX (all three are ZIP archives with the same PK magic bytes), parsers check [Content_Types].xml for the WordprocessingML content type.
This structure replaced a 25-year trade secret. The binary .doc format (1983-2008) was a proprietary Compound File Binary Format whose internals Microsoft deliberately left undocumented to maintain competitive advantage. When ECMA-376 standardized DOCX in 2006 and ISO/IEC 29500 followed in 2008, document internals became readable by anyone with a text editor. Microsoft simultaneously published the .doc binary specification under the Open Specification Promise — but by then, DOCX was already the default. Two conformance levels exist today: Transitional (allows legacy VML graphics and Microsoft-specific extensions) and Strict (ISO-defined elements only). Most DOCX files in the wild are Transitional because Word defaults to it. The spec keeps documents small in theory. In practice, a single Word doc routinely balloons into megabytes.
A Simple Word Doc Can Balloon to 15 MB. Why?
Make one word bold and Word creates three XML elements. The document stores text at three levels: a Paragraph (<w:p>) holds block-level properties like alignment and spacing. Inside it, a Run (<w:r>) represents contiguous text sharing identical formatting. Each Run wraps one or more Text elements (<w:t>) containing the characters. The xml:space="preserve" attribute on <w:t> ensures whitespace is not stripped.
Every formatting change — bold, italic, font switch, color shift — forces a new Run. The sentence "Hello World" becomes two Runs: one plain, one bold. A paragraph mixing five fonts contains at least five Runs. This is why document.xml can be large for simple-looking documents.
The styles system offsets some of this bloat. Four style types — paragraph, character, table, and numbering — cascade from document defaults through the style hierarchy to direct formatting. The "Normal" style anchors every paragraph. But direct formatting overrides still generate individual Runs for every deviation. Embedded images are stored once in word/media/ and referenced by relationship ID, so the same image appearing on ten pages adds no extra file size — a deduplication strategy that keeps image-heavy documents from ballooning. File size is one kind of hidden weight. What users don't see when they share the document is a different kind of leak.
Word Documents Remember Everything You Tried to Delete. Why?
Track changes in Word are not a UI feature — they are XML elements baked into the document file. Inserted text wraps in <w:ins>, deleted text in <w:del> with <w:delText> replacing <w:t>. Every revision records the author name, timestamp, and a revision session identifier (rsid) that tracks which editing session made each change. This data persists even after clicking "Accept All Changes" — the visible markup disappears, but the XML history stays unless purged with Document Inspector.
Legal filings have exposed confidential negotiation positions when opposing counsel unzipped the DOCX and read the revision XML. Government documents have leaked classified edits the same way. Beyond track changes, DOCX field codes once supported DDE (Dynamic Data Exchange) — commands like DDEAUTO that could launch executables when a document opened, without any macro warning. Microsoft patched this in 2017 (ADV170021) with a security update that disabled DDE auto-execution across all Word installations. The .docx extension itself is a security boundary: standard .docx files cannot contain VBA macros. Macro-enabled documents require the .docm extension, letting organizations block them at email gateways while allowing .docx through. Metadata leaks apply to any document. A different class of problems is specific to right-to-left scripts, where even correctly-authored content can render wrong.
Arabic Text in Word Needs Two Independent Direction Controls. Why?
Copy a paragraph of Arabic from one Word document into another and watch the text direction scramble. This happens because bidirectional text in DOCX operates at two independent levels: <w:bidi/> on paragraph properties sets the entire paragraph's base direction to right-to-left, while <w:rtl/> on individual run properties marks specific text runs as RTL. When the source and destination documents have different paragraph-level bidi settings, pasted text can flip direction mid-sentence. The fix: use Paste Special with "Unformatted Text" to strip inherited direction properties.
Mixed Arabic-English paragraphs trigger the Unicode Bidirectional Algorithm at the run level, with complex script fonts declared separately via <w:rFonts w:cs="...">. Section-level <w:bidi/> can set the default direction for an entire document section. Arabic diacritical marks (tashkeel — fatha, damma, kasra) are preserved within run text elements as Unicode combining characters inside <w:t>. Religious texts, educational materials, and formal government documents that require full tashkeel store these marks intact, and they survive format conversion as long as the target format supports Unicode normalization. Word's Arabic proofing tools support tashkeel insertion, and publishers of religious and educational Arabic content rely on DOCX to preserve these marks through format conversions.
.DOCX compared to alternatives
| Formats | Criteria | Winner |
|---|---|---|
| .DOCX vs .DOC | Inspectability DOCX is a ZIP of XML files — unzip and read with any text editor. DOC was a proprietary binary format whose internals were a Microsoft trade secret for 25 years until 2008. | DOCX wins |
| .DOCX vs .DOC | Security DOCX separates macros into the .docm extension, letting organizations block macro-enabled files at email gateways while allowing .docx through. DOC had no filename distinction between macro and non-macro files. | DOCX wins |
| .DOCX vs .PDF | Editability DOCX is designed for editing — text reflows, styles update, and track changes record revisions. PDF is a fixed-layout format designed for final output where the visual appearance must not change. | DOCX wins |
| .DOCX vs .PDF | Visual consistency PDF renders identically on every device and operating system. DOCX appearance depends on available fonts, installed styles, and the rendering engine — a document may look different in Word, LibreOffice, and Google Docs. | PDF wins |
Technical reference
- MIME Type
application/vnd.openxmlformats-officedocument.wordprocessingml.document- Magic Bytes
50 4B 03 04ZIP archive header. Contains [Content_Types].xml and word/ directory.- Developer
- Microsoft / Ecma International
- Year Introduced
- 2007
- Open Standard
- Yes — View specification
ZIP archive header. Contains [Content_Types].xml and word/ directory.
Binary Structure
A DOCX file is a ZIP archive (magic bytes 50 4B 03 04, 'PK') following the Open Packaging Conventions. The ZIP contains a root [Content_Types].xml declaring MIME types for all parts, a _rels/.rels entry point, and a word/ directory with document.xml (main text content as Paragraph/Run/Text elements), styles.xml (paragraph, character, table, and numbering style definitions), fontTable.xml (font declarations and substitution mappings), settings.xml (document settings), numbering.xml (list definitions), and optionally a media/ subdirectory for embedded images. To distinguish DOCX from XLSX or PPTX, check [Content_Types].xml for the WordprocessingML content type.
| Offset | Length | Field | Example | Description |
|---|---|---|---|---|
0x00 | 4 bytes | ZIP Signature | 50 4B 03 04 | PK local file header — shared by all OOXML formats (XLSX, DOCX, PPTX) |
0x04 | 2 bytes | Version needed | 14 00 | Minimum ZIP version to extract (2.0) |
0x1A | 2 bytes | Filename length | 13 00 | Length of first entry name — typically '[Content_Types].xml' (19 bytes) |
DOCX conversion is not yet available in FileDex. For now, use the CLI commands in the Developer Door to convert between document formats with LibreOffice or pandoc.
Attack Vectors
- DDE command execution
- Embedded OLE objects
- Macro injection via DOCM variant
- Track changes information leakage
- XML External Entity (XXE) injection
Mitigation: Open DOCX only in trusted word processors. Disable macros by default, and never enable them for documents from unknown senders. Use `.docm` as a red flag — macro-enabled documents can execute code on open. Inspect internal XML structure in a text editor before opening untrusted files. FileDex does not parse DOCX — this page is static, no upload.
- Specification ECMA-376 — Office Open XML File Formats, 5th Edition
- Specification ISO/IEC 29500-1:2016 — Office Open XML Fundamentals
- Registry LOC FDD fdd000397 — DOCX Transitional
- Registry PRONOM fmt/412 — Microsoft Word DOCX
- History Microsoft Word — Wikipedia