.DOC Microsoft Word Document
.doc

Microsoft Word Document

DOC is Microsoft Word's legacy binary format (Word 97-2003), built on the OLE2 Compound File Binary Format. It stores formatted text, images, tables, and VBA macros in a proprietary binary structure that was superseded by the XML-based DOCX in Office 2007.

Document structure
Header version
Body content tree
Index references
OLE2 BinaryCompound DocumentVBA MacrosWord 97-20031997
By FileDex
Not convertible

Binary Word format requires Microsoft's proprietary rendering engine not available in browser WASM.

Common questions

What is the difference between DOC and DOCX?

DOC is a proprietary binary format (OLE2 Compound File) used by Word 97-2003. DOCX is a ZIP archive of XML files introduced in Office 2007 as an open standard (ECMA-376). DOCX produces smaller files, is easier to parse programmatically, and is the current default for all modern Word versions.

Is it safe to open DOC files from unknown sources?

No. DOC files can contain VBA macros that auto-execute malware, OLE embedded objects, and DDE fields that run system commands. Always open untrusted DOC files in Protected View or convert to PDF first via LibreOffice headless.

How do I convert DOC to DOCX without Microsoft Office?

Use LibreOffice from the command line: libreoffice --headless --convert-to docx input.doc. This works on Windows, macOS, and Linux without a GUI. Google Docs also converts DOC to DOCX on upload.

Can I recover text from a corrupted DOC file?

Try antiword — it reads the WordDocument stream directly and can extract text even when the Table stream (formatting data) is damaged. If antiword fails, opening the file in a hex editor and searching for ASCII text blocks can recover raw content.

What makes .DOC special

What is a DOC file?

DOC is a binary file format used by Microsoft Word from 1997 to 2007. It stores formatted text, images, tables, and other document elements in a proprietary binary structure (OLE Compound Document). It was superseded by the XML-based DOCX format in Office 2007.

Continue reading — full technical deep dive

How to open DOC files

  • Microsoft Word (Windows, macOS) — Full editing support
  • Google Docs (Web) — Free, online editing
  • LibreOffice Writer (Windows, macOS, Linux) — Free, open-source
  • Apple Pages (macOS, iOS) — Free
  • WPS Office (Windows, macOS, Linux, Mobile) — Free

Technical specifications

Property Value
Format OLE2 Compound Document
Encoding Binary
Macros VBA macro support
Max Size ~512 MB
Compatibility Office 97–2003

Programs that open DOC files

  • Microsoft Word — Native editor
  • LibreOffice Writer — Free alternative
  • Google Docs — Online editing
  • WPS Office — Free office suite
  • Apple Pages — macOS/iOS word processor

Common use cases

  • Legacy documents: Old Word files from pre-2007
  • Compatibility: Sharing with users on older software
  • Macro documents: VBA-enabled templates

.DOC compared to alternatives

.DOC compared to alternative formats
Formats Criteria Winner
.DOC vs .DOCX
File size and openness
DOCX is a ZIP archive of XML files, producing files 30-75% smaller than equivalent DOC. DOCX is an open standard (ECMA-376); DOC is a proprietary binary format.
DOCX wins
.DOC vs .PDF
Document fidelity across platforms
PDF renders identically on every device. DOC rendering varies based on installed fonts, Word version, and operating system, causing layout shifts.
PDF wins
.DOC vs .RTF
Cross-application compatibility
RTF is a text-based interchange format readable by virtually every word processor. DOC's binary structure requires a full OLE2 parser, limiting compatibility to Microsoft Office and a few open-source implementations.
RTF wins
.DOC vs .ODT
Open standard compliance
ODT (ODF) is an ISO-standardized document format with full LibreOffice support. DOC is proprietary and even Microsoft has moved to DOCX as the default.
ODT wins

Technical reference

MIME Type
application/msword
Magic Bytes
D0 CF 11 E0 A1 B1 1A E1 OLE2 Compound Binary File header.
Developer
Microsoft
Year Introduced
1983
Open Standard
No
00000000D0CF11E0A1B11AE1 ........

OLE2 Compound Binary File header.

Binary Structure

DOC files use the OLE2 Compound File Binary Format (CFBF), which is a file system within a file. The file begins with the CFBF header (512 bytes) containing the magic signature D0 CF 11 E0 A1 B1 1A E1, sector size (512 or 4096 bytes), FAT/DIFAT sector chains, and the first directory sector location. The internal directory tree contains named streams: 'WordDocument' (main text stream with FIB at offset 0), '1Table' or '0Table' (formatting and property tables selected by FIB flag), 'Data' (embedded objects and images), and optionally 'Macros' (VBA project storage) and '\x05SummaryInformation' (document metadata). The FIB (File Information Block) in the WordDocument stream is the master index — it contains offsets and lengths for every data structure (character positions, paragraph formatting, section breaks, fonts, styles). Text is stored as a continuous byte stream with formatting applied via the Table stream's PLCFs (Plex of Character/Paragraph Formatting).

OffsetLengthFieldExampleDescription
0x00 8 bytes CFBF Signature D0 CF 11 E0 A1 B1 1A E1 OLE2 magic bytes. Identifies the file as a Compound File Binary Format container. Shared by DOC, XLS, PPT, and other OLE2 formats.
0x08 16 bytes CLSID 00 00 00 00 ... (16x 00) Class identifier. Usually all zeros for DOC files.
0x18 2 bytes Minor version 3E 00 Minor version of the CFBF specification.
0x1A 2 bytes Major version 03 00 3 = CFBF v3 (512-byte sectors). 4 = CFBF v4 (4096-byte sectors).
0x1C 2 bytes Byte order FE FF Always FE FF (little-endian). CFBF does not support big-endian.
0x1E 2 bytes Sector size power 09 00 Sector size as power of 2. 9 = 512 bytes (v3). 12 = 4096 bytes (v4).
0x2C 4 bytes FAT sectors count 01 00 00 00 Total number of FAT sectors. The FAT maps sector chains for all streams.
0x30 4 bytes First directory sector 00 00 00 00 Location of the first directory sector containing the stream/storage entries.
1983Microsoft Word 1.0 released for DOS with its own binary format1997Word 97 introduces the OLE2-based DOC binary format (CFBF), used through Word 20032006Microsoft publishes the DOC binary format specification under the Microsoft Open Specification Promise2007Office 2007 defaults to DOCX (Office Open XML), making DOC a legacy compatibility format2008ECMA-376 and ISO/IEC 29500 standardize OOXML, further marginalizing DOC
Convert DOC to PDF via LibreOffice other
libreoffice --headless --convert-to pdf input.doc

LibreOffice runs in headless mode (no GUI), parsing the OLE2 binary and rendering to PDF. Preserves formatting, images, and tables. The output file is created in the current directory.

Convert DOC to DOCX via LibreOffice other
libreoffice --headless --convert-to docx input.doc

Converts the binary DOC format to DOCX (Office Open XML). Some complex VBA macros may not transfer. Font substitution may occur if the original fonts are not installed.

Extract plain text from DOC other
antiword input.doc > output.txt

Antiword reads the WordDocument stream directly and extracts text content with basic paragraph structure. Strips all formatting, images, and embedded objects.

Batch convert all DOC files to PDF other
libreoffice --headless --convert-to pdf *.doc

Converts every DOC file in the current directory to PDF. Useful for migrating legacy document archives. Add --outdir /path/to/output to control the destination.

DOC DOCX transcode near-lossless DOCX uses ZIP-compressed XML, producing smaller files with open-standard interoperability. Converting DOC to DOCX enables editing in Google Docs, modern Office versions, and web-based editors without compatibility mode warnings.
DOC PDF render near-lossless PDF locks the document layout for universal viewing and printing. Converting DOC to PDF eliminates font substitution issues and formatting drift across different Word versions and operating systems.
DOC TXT export lossy Plain text extraction strips all formatting and embedded objects, producing a lightweight file for text indexing, search engines, and data pipelines that process document content without layout.
HIGH

Attack Vectors

  • VBA macros — DOC files can contain auto-executing macros that download and run malware (macro viruses remain the top Office attack vector)
  • OLE2 embedded objects — ActiveX controls and OLE objects inside DOC can execute code when opened
  • Equation Editor exploits — CVE-2017-11882 targets the legacy Equation Editor component embedded in DOC files, enabling remote code execution
  • DDE (Dynamic Data Exchange) fields can execute arbitrary commands when the document is opened, even without macros enabled

Mitigation: FileDex does not open or parse DOC files in the browser. DOC is a reference-only page. Users should open untrusted DOC files in Protected View (Microsoft Word) or convert to PDF via LibreOffice headless before viewing.

LibreOffice tool
Open-source office suite with full DOC read/write and headless CLI conversion
antiword tool
Lightweight CLI tool that extracts plain text from DOC files without dependencies
Apache POI library
Java library for reading and writing OLE2 compound documents including DOC, XLS, PPT
python-docx library
Python library for DOCX (not DOC). Pair with LibreOffice for DOC-to-DOCX-to-Python pipeline.
Pandoc tool
Universal document converter supporting DOC input via LibreOffice backend