.DOCX Microsoft Word Document (Open XML)
.docx

Microsoft Word Document (Open XML)

DOCX is Microsoft Word's default format since 2007 — every file is a ZIP archive of XML you can unzip and inspect right now. Defined by ISO/IEC 29500 and ECMA-376, DOCX replaced the binary .doc format that was a Microsoft trade secret for 25 years.

Document structure
PK header
[Content_Types].xml
word/document.xml
styles.xml
PagesFontsMetadataText
Not convertible

DOCX conversion is not yet available in FileDex. For now, use the CLI commands in the Developer Door to convert between document formats with LibreOffice or pandoc.

Common questions

What is a DOCX file?

Every DOCX is secretly a ZIP file — rename it to .zip and unzip it yourself. Inside you'll find XML files storing every paragraph, font choice, and image in your document. Microsoft made DOCX the default Word format in 2007, replacing the proprietary binary .doc format that had been a trade secret for 25 years. The open XML structure (ECMA-376, ISO/IEC 29500) means any text editor can read what's inside.

What is the difference between DOCX and DOC?

DOC is the legacy binary format from before 2007, with a file structure that was a Microsoft trade secret for 25 years. DOCX is open XML that anyone can inspect. DOCX also separates macros into the .docm extension and supports over a million table rows — DOC was limited to 32,767 rows per table.

How do I convert a DOCX file to PDF?

For individual files, open in Microsoft Word, LibreOffice Writer, or Google Docs and export to PDF. Apple Pages on Mac also exports DOCX to PDF. For automated or batch conversion, LibreOffice offers a headless mode that runs without a graphical interface — see the Technical Reference below for the exact command.

How do I remove track changes from a Word document?

In Microsoft Word, go to Review tab, click Accept All Changes, then save. This removes revision markup from the visible document. However, other metadata like author name, editing time, and comments may still be embedded. Use File → Inspect Document → Document Inspector to find and remove all hidden data before sharing.

Is it safe to open a DOCX file from an unknown sender?

Standard .docx files cannot contain macros and are generally safe to view. However, watch for .docm files (macro-enabled) disguised with misleading names. Never enable macros on documents from untrusted sources. Microsoft's Protected View opens external documents in a restricted sandbox by default.

What makes .DOCX special

25-year secret
DOC was a binary trade secret until 2008
Microsoft kept .doc undocumented from 1983 to 2008. DOCX opened it up — unzip any DOCX and read its XML directly.
Forensic evidence
Track changes reveal deleted secrets
Revision history persists in the XML even after accepting changes. Legal filings have exposed confidential edits.
DDE attack vector
Documents that could run system commands
DDEAUTO field codes launched executables without macros. Microsoft patched it in 2017 — a system-wide security update.
Run-level granularity
Every bold word creates a new XML element
Each formatting change forces a new Run element. Five font changes in one sentence means five separate XML runs.

Rename any .docx to .zip and double-click it. Inside you'll find XML files describing every paragraph, font, and image in your document — a structure Microsoft kept secret for 25 years before opening it in 2007. That transparency is what makes DOCX different from the binary .doc files that came before it.

Continue reading — full technical deep dive

Rename a .docx to .zip. Your File Explorer Unzips It. What's Inside?

Every DOCX follows the Open Packaging Conventions: a ZIP archive containing [Content_Types].xml as its manifest, a _rels/ directory for relationship mapping, and a word/ directory holding the actual document parts. The main text lives in word/document.xml. Styles cascade from word/styles.xml. Fonts are declared in word/fontTable.xml. Embedded images sit in word/media/. To distinguish a DOCX from an XLSX or PPTX (all three are ZIP archives with the same PK magic bytes), parsers check [Content_Types].xml for the WordprocessingML content type.

This structure replaced a 25-year trade secret. The binary .doc format (1983-2008) was a proprietary Compound File Binary Format whose internals Microsoft deliberately left undocumented to maintain competitive advantage. When ECMA-376 standardized DOCX in 2006 and ISO/IEC 29500 followed in 2008, document internals became readable by anyone with a text editor. Microsoft simultaneously published the .doc binary specification under the Open Specification Promise — but by then, DOCX was already the default. Two conformance levels exist today: Transitional (allows legacy VML graphics and Microsoft-specific extensions) and Strict (ISO-defined elements only). Most DOCX files in the wild are Transitional because Word defaults to it. The spec keeps documents small in theory. In practice, a single Word doc routinely balloons into megabytes.

A Simple Word Doc Can Balloon to 15 MB. Why?

Make one word bold and Word creates three XML elements. The document stores text at three levels: a Paragraph (<w:p>) holds block-level properties like alignment and spacing. Inside it, a Run (<w:r>) represents contiguous text sharing identical formatting. Each Run wraps one or more Text elements (<w:t>) containing the characters. The xml:space="preserve" attribute on <w:t> ensures whitespace is not stripped.

Every formatting change — bold, italic, font switch, color shift — forces a new Run. The sentence "Hello World" becomes two Runs: one plain, one bold. A paragraph mixing five fonts contains at least five Runs. This is why document.xml can be large for simple-looking documents.

The styles system offsets some of this bloat. Four style types — paragraph, character, table, and numbering — cascade from document defaults through the style hierarchy to direct formatting. The "Normal" style anchors every paragraph. But direct formatting overrides still generate individual Runs for every deviation. Embedded images are stored once in word/media/ and referenced by relationship ID, so the same image appearing on ten pages adds no extra file size — a deduplication strategy that keeps image-heavy documents from ballooning. File size is one kind of hidden weight. What users don't see when they share the document is a different kind of leak.

Word Documents Remember Everything You Tried to Delete. Why?

Track changes in Word are not a UI feature — they are XML elements baked into the document file. Inserted text wraps in <w:ins>, deleted text in <w:del> with <w:delText> replacing <w:t>. Every revision records the author name, timestamp, and a revision session identifier (rsid) that tracks which editing session made each change. This data persists even after clicking "Accept All Changes" — the visible markup disappears, but the XML history stays unless purged with Document Inspector.

Legal filings have exposed confidential negotiation positions when opposing counsel unzipped the DOCX and read the revision XML. Government documents have leaked classified edits the same way. Beyond track changes, DOCX field codes once supported DDE (Dynamic Data Exchange) — commands like DDEAUTO that could launch executables when a document opened, without any macro warning. Microsoft patched this in 2017 (ADV170021) with a security update that disabled DDE auto-execution across all Word installations. The .docx extension itself is a security boundary: standard .docx files cannot contain VBA macros. Macro-enabled documents require the .docm extension, letting organizations block them at email gateways while allowing .docx through. Metadata leaks apply to any document. A different class of problems is specific to right-to-left scripts, where even correctly-authored content can render wrong.

Arabic Text in Word Needs Two Independent Direction Controls. Why?

Copy a paragraph of Arabic from one Word document into another and watch the text direction scramble. This happens because bidirectional text in DOCX operates at two independent levels: <w:bidi/> on paragraph properties sets the entire paragraph's base direction to right-to-left, while <w:rtl/> on individual run properties marks specific text runs as RTL. When the source and destination documents have different paragraph-level bidi settings, pasted text can flip direction mid-sentence. The fix: use Paste Special with "Unformatted Text" to strip inherited direction properties.

Mixed Arabic-English paragraphs trigger the Unicode Bidirectional Algorithm at the run level, with complex script fonts declared separately via <w:rFonts w:cs="...">. Section-level <w:bidi/> can set the default direction for an entire document section. Arabic diacritical marks (tashkeel — fatha, damma, kasra) are preserved within run text elements as Unicode combining characters inside <w:t>. Religious texts, educational materials, and formal government documents that require full tashkeel store these marks intact, and they survive format conversion as long as the target format supports Unicode normalization. Word's Arabic proofing tools support tashkeel insertion, and publishers of religious and educational Arabic content rely on DOCX to preserve these marks through format conversions.

.DOCX compared to alternatives

.DOCX compared to alternative formats
Formats Criteria Winner
.DOCX vs .DOC
Inspectability
DOCX is a ZIP of XML files — unzip and read with any text editor. DOC was a proprietary binary format whose internals were a Microsoft trade secret for 25 years until 2008.
DOCX wins
.DOCX vs .DOC
Security
DOCX separates macros into the .docm extension, letting organizations block macro-enabled files at email gateways while allowing .docx through. DOC had no filename distinction between macro and non-macro files.
DOCX wins
.DOCX vs .PDF
Editability
DOCX is designed for editing — text reflows, styles update, and track changes record revisions. PDF is a fixed-layout format designed for final output where the visual appearance must not change.
DOCX wins
.DOCX vs .PDF
Visual consistency
PDF renders identically on every device and operating system. DOCX appearance depends on available fonts, installed styles, and the rendering engine — a document may look different in Word, LibreOffice, and Google Docs.
PDF wins

Technical reference

MIME Type
application/vnd.openxmlformats-officedocument.wordprocessingml.document
Magic Bytes
50 4B 03 04 ZIP archive header. Contains [Content_Types].xml and word/ directory.
Developer
Microsoft / Ecma International
Year Introduced
2007
Open Standard
Yes — View specification
00000000504B0304 PK..

ZIP archive header. Contains [Content_Types].xml and word/ directory.

Binary Structure

A DOCX file is a ZIP archive (magic bytes 50 4B 03 04, 'PK') following the Open Packaging Conventions. The ZIP contains a root [Content_Types].xml declaring MIME types for all parts, a _rels/.rels entry point, and a word/ directory with document.xml (main text content as Paragraph/Run/Text elements), styles.xml (paragraph, character, table, and numbering style definitions), fontTable.xml (font declarations and substitution mappings), settings.xml (document settings), numbering.xml (list definitions), and optionally a media/ subdirectory for embedded images. To distinguish DOCX from XLSX or PPTX, check [Content_Types].xml for the WordprocessingML content type.

OffsetLengthFieldExampleDescription
0x00 4 bytes ZIP Signature 50 4B 03 04 PK local file header — shared by all OOXML formats (XLSX, DOCX, PPTX)
0x04 2 bytes Version needed 14 00 Minimum ZIP version to extract (2.0)
0x1A 2 bytes Filename length 13 00 Length of first entry name — typically '[Content_Types].xml' (19 bytes)
1983Microsoft Word 1.0 for MS-DOS released — originally called 'Multi-Tool Word'1985Word for Macintosh released — preceded the Windows version by 4 years1989Word for Windows 1.0 released — begins Word's dominance on the Windows platform2006ECMA-376 approved — WordprocessingML defined in Part 1 §172007Word 2007 ships with DOCX as default format — binary .doc era ends2008ISO/IEC 29500 approved — Microsoft publishes .doc binary spec, ending 25 years of trade secrecy2017Microsoft disables DDE auto-execution in Word (ADV170021), closing a major attack vector
Convert DOCX to PDF via LibreOffice other
libreoffice --headless --convert-to pdf document.docx

Headless LibreOffice renders the DOCX and exports to PDF. No GUI required — runs on servers. The most common DOCX conversion command.

Convert DOCX to Markdown with pandoc other
pandoc document.docx -t markdown -o document.md

Pandoc reads the full DOCX structure including styles, images, tables, and footnotes, converting to clean Markdown suitable for documentation workflows.

Inspect DOCX XML structure other
unzip -p document.docx word/document.xml | xmllint --format -

Extracts and pretty-prints the main document XML, showing the Paragraph/Run/Text hierarchy, formatting properties, and any track changes markup.

DOCX conversion is not yet available in FileDex. For now, use the CLI commands in the Developer Door to convert between document formats with LibreOffice or pandoc.

HIGH

Attack Vectors

  • DDE command execution
  • Embedded OLE objects
  • Macro injection via DOCM variant
  • Track changes information leakage
  • XML External Entity (XXE) injection

Mitigation: Open DOCX only in trusted word processors. Disable macros by default, and never enable them for documents from unknown senders. Use `.docm` as a red flag — macro-enabled documents can execute code on open. Inspect internal XML structure in a text editor before opening untrusted files. FileDex does not parse DOCX — this page is static, no upload.

The original DOCX creator — full WordprocessingML support including track changes, macros, and DDE
Google Docs service
Browser-based DOCX editor with real-time collaboration and automatic format conversion
pandoc tool
Universal document converter supporting DOCX to Markdown, HTML, PDF, LaTeX, and 40+ formats
python-docx library
Python library for creating and modifying DOCX files without Word dependency
Free open-source word processor with strong DOCX compatibility
Apache POI library
Java library for OOXML document processing in enterprise applications
mammoth.js library
JavaScript library converting DOCX to clean semantic HTML for web publishing
docx4j library
Java library for full OOXML manipulation using JAXB binding