ZIP Archive
ZIP is the universal archive format supported natively by Windows, macOS, Linux, Android, and iOS. Invented by Phil Katz in 1989, the format uses DEFLATE compression, CRC-32 integrity checks, and optional AES-256 encryption — all within an open specification any tool can read.
Extraction not yet available. ZIP decompression requires streaming support for the DEFLATE algorithm and central-directory parsing — a feature planned for a future FileDex update.
Common questions
How do I open a ZIP file?
Windows, macOS, and Linux all support ZIP natively without extra software. On Windows, double-click the file or right-click and choose Extract All. On macOS, double-click to auto-extract via Archive Utility. On Linux, most file managers handle ZIP directly, or run unzip archive.zip from the terminal. Android and iOS also open ZIP files natively through their built-in Files apps.
What does PK mean in ZIP files?
PK stands for Phil Katz, the programmer who created the ZIP format in 1989. Every ZIP file begins with the hex bytes 50 4B, which spell out his initials in ASCII. Katz developed PKZIP after a legal dispute over the ARC compression format, creating an open alternative that became the universal standard for file archiving.
Are DOCX and XLSX files really ZIP files?
Yes. Microsoft Office documents including DOCX, XLSX, and PPTX are ZIP archives containing XML files and media resources. EPUB e-books, Android APK packages, and Java JAR files also use ZIP as their container format. Renaming any of these to .zip and opening with an archive tool reveals the internal structure.
Is ZIP compression lossless?
Yes. ZIP uses DEFLATE compression, which is entirely lossless — extracted files are bit-for-bit identical to the originals. No data is discarded or approximated during compression. The algorithm reduces file size by finding and encoding repeated byte patterns, then reverses the process exactly during extraction.
Why is my ZIP file barely smaller than the original files?
Files that are already compressed — JPEG images, MP4 videos, MP3 audio — contain very little redundancy for DEFLATE to exploit. Zipping a folder of photos may reduce size by only one or two percent. ZIP works best on text, source code, and uncompressed data.
What is a ZIP bomb?
A ZIP bomb is a small archive that expands to an enormous size when extracted. The classic example, 42.zip, is 42 KB compressed but expands to 4.5 petabytes. Modern antivirus and archive tools detect these by checking the compression ratio before extraction.
How do I password-protect a ZIP file with strong encryption?
Use 7-Zip with AES-256 encryption. The default ZipCrypto encryption built into Windows and many tools is cryptographically broken and can be cracked in seconds. In 7-Zip, select ZIP format and choose AES-256 as the encryption method when setting a password.
What is the maximum file size a ZIP archive can hold?
Standard ZIP is limited to 4 GB per file and 4 GB total archive size due to 32-bit size fields in the original specification. ZIP64 extensions remove these limits by using 64-bit fields, supporting files and archives up to 16 exbibytes. Most modern tools create ZIP64 archives automatically when needed.
What makes .ZIP special
Behind every DOCX you email, every APK you install, and every EPUB you read sits an invisible ZIP archive — a container format so foundational that most people use it daily without knowing it exists. The story of ZIP begins not with a corporation or a standards body, but with a twenty-six-year-old programmer from Milwaukee named Phil Katz.
Continue reading — full technical deep dive
The Phil Katz Story
In 1988, Phillip Walter Katz faced a lawsuit from System Enhancement Associates (SEA), the company behind the popular ARC compression format. Katz had created PKARC, a faster and freely distributed tool compatible with SEA's format. SEA sued for trade secret misappropriation. The settlement cost PKWARE $22,500 in royalties plus $40,000 in legal expenses. Rather than continue working within someone else's format, Katz did something remarkable: he created an entirely new archive format from scratch and published the specification openly, declaring it would always be free for competing software to implement.
On February 14, 1989, PKZIP 1.0 shipped. The magic bytes at the start of every ZIP file — 50 4B in hexadecimal, the ASCII letters "PK" — are Phil Katz's initials, permanently encoded into every ZIP archive ever created. Katz was found dead in a hotel room on April 14, 2000, at age 37. His creation outlived him by decades and now underpins billions of files across every operating system on Earth.
Reading ZIP Files Backwards: The EOCD-First Model
ZIP has an architectural quirk that surprises most developers: the file is designed to be read from the end, not the beginning. A ZIP archive has three major sections:
- Local file headers + compressed data — one pair per archived file, written sequentially
- Central directory — a repeated set of entries mirroring the local headers, with full metadata and byte offsets pointing back to each local header
- End of Central Directory Record (EOCD) — a single record at the very end that points to the central directory
A ZIP parser starts by seeking to the end of the file, scanning backward to locate the EOCD signature (50 4B 05 06), then reads the central directory offset from the EOCD, jumps to the central directory, and from there locates every file in the archive. This end-anchored design has three practical consequences. First, files can be appended to a ZIP without rewriting the archive — just add new local entries and write a new central directory and EOCD. Second, a ZIP can be embedded inside another file (such as a self-extracting EXE) because the parser ignores everything before the ZIP data and reads from the end. Third, if the EOCD is lost to truncation or corruption, the entire archive becomes unreadable even though all compressed file data may be intact.
DOCX, XLSX, EPUB, APK — They Are All ZIP Files
One of the least-known facts in computing: many common file formats are simply ZIP archives with specific internal directory structures. Rename a .docx file to .zip and unzip it — inside you will find XML files describing document content, styles, relationships, and embedded media. The same applies to:
- DOCX / XLSX / PPTX — Microsoft Office Open XML (OOXML), standardized as ECMA-376 and ISO/IEC 29500
- ODT / ODS / ODP — OpenDocument Format, standardized as ISO/IEC 26300
- EPUB — E-book container defined by the W3C
- JAR / WAR / EAR — Java archive formats
- APK — Android application packages
- XPI — Firefox browser extensions
- KMZ — Google Earth placemarks
- CBZ — Comic book archives
ISO 21320-1:2015 formalized this practice by defining a restricted ZIP profile for document containers. The ISO standard permits only Store (method 0) and DEFLATE (method 8) compression, forbids encryption, forbids multi-disk spanning, and normatively references APPNOTE version 6.3.3. Any file format that claims ISO 21320-1 conformance must be a valid ZIP file meeting these restrictions.
DEFLATE: The Algorithm Inside
DEFLATE, the default and overwhelmingly dominant compression method in ZIP files, combines LZ77 sliding-window matching with Huffman coding. The compressor scans input data for repeated byte sequences within a 32 KB window, replacing matches with back-references (distance, length pairs), then encodes the result using variable-length Huffman codes. DEFLATE is also the algorithm inside gzip, PNG, and HTTP content encoding — making it arguably the most widely deployed compression algorithm in history.
DEFLATE achieves typical compression ratios of 2:1 to 5:1 on text, source code, and structured data. Already-compressed data like JPEG images, MP4 video, or MP3 audio compresses poorly or not at all — this is why zipping a folder of photos barely reduces its size. ZIP stores each file independently, so tools can choose to store incompressible files uncompressed (method 0) while compressing text files with DEFLATE.
ZIP64: Breaking the 4 GB Barrier
The original ZIP specification stores file sizes and archive offsets in 32-bit fields, limiting individual files and total archive size to approximately 4 GB (2^32 bytes). ZIP64 extensions, introduced in APPNOTE version 4.5 in 2001, use extra fields with header ID 0x0001 to store 64-bit sizes. When any size field would overflow 32 bits, the local header writes 0xFFFFFFFF as a sentinel value and stores the actual 64-bit size in the ZIP64 extended information extra field.
All modern tools — 7-Zip, WinRAR, Windows Explorer (since Windows 8), macOS Archive Utility, Python's zipfile module (since Python 3.4) — handle ZIP64 transparently. Legacy tools from before 2001 silently truncate or refuse ZIP64 archives.
Security: ZIP Bombs, Zip Slip, and Weak Encryption
ZIP's ubiquity makes it a frequent attack vector. Three vulnerabilities stand out:
ZIP bomb (42.zip): A 42 KB file that decompresses to 4.5 petabytes through nested layers of 16 ZIP files each. Non-recursive variants use overlapping file references within the central directory to bypass depth-checking defenses, achieving massive expansion ratios in a single flat archive. Any extraction tool without decompressed-size limits is vulnerable.
Zip Slip (CVE-2018-1002200): A path traversal attack where archived filenames contain ../ sequences. During extraction, vulnerable libraries write files outside the intended directory — potentially overwriting executables, configuration files, or SSH keys. Discovered by Snyk in 2018, this vulnerability affected Java, .NET, Ruby, Go, and Python extraction libraries.
ZipCrypto weakness: The traditional PKWARE encryption scheme (ZipCrypto) is cryptographically broken. It uses a stream cipher seeded from a 96-bit internal state derived from the password, and is vulnerable to known-plaintext attacks. Tools like pkcrack can recover the encryption key in seconds if any file in the archive has known content. AES-256 encryption (added in APPNOTE version 5.2, implemented by WinZip and 7-Zip) is secure but not universally supported by all extraction tools.
Compression Comparison
ZIP compresses each file independently, which enables random access but limits compression ratio. Solid-archive formats like 7z and RAR compress all files as a single stream, exploiting cross-file redundancy for 30-70% better ratios on collections of similar files (such as source code repositories). The tradeoff: extracting a single file from a solid archive requires decompressing everything before it. ZIP wins on compatibility and random access; 7z wins on compression ratio; TAR.GZ wins on Unix metadata preservation.
.ZIP compared to alternatives
| Formats | Criteria | Winner |
|---|---|---|
| .ZIP vs .7Z | Compression ratio vs compatibility 7z's LZMA2 algorithm achieves 30-70% better compression than ZIP's DEFLATE on text and code, especially in solid mode where cross-file redundancy is exploited. ZIP wins on compatibility — every OS opens it natively. | 7Z wins |
| .ZIP vs .RAR | Platform support and recovery ZIP is natively supported by every major OS without additional software. RAR requires third-party tools but offers recovery records that can repair partially corrupted archives — a feature ZIP entirely lacks. | Draw |
| .ZIP vs .TAR.GZ | Random access vs Unix metadata ZIP stores each file independently, enabling extraction of any single file without processing the entire archive. TAR.GZ preserves Unix permissions, symlinks, and ownership that ZIP discards, making it the standard for Linux source distribution. | Draw |
Technical reference
- MIME Type
application/zip- Magic Bytes
50 4B 03 04PK signature (Phil Katz initials).- Developer
- Phil Katz / PKWARE
- Year Introduced
- 1989
- Open Standard
- Yes
PK signature (Phil Katz initials).
Binary Structure
ZIP uses a three-section layout: local file entries, central directory, and End of Central Directory Record (EOCD). Each local file entry starts with signature 50 4B 03 04, followed by a 26-byte fixed header containing version, flags, compression method, timestamps, CRC-32, compressed and uncompressed sizes, and filename/extra field lengths. The compressed file data immediately follows the variable-length filename and extra field. The central directory at the end of the file mirrors each local entry with signature 50 4B 01 02, adding file attributes, comments, and the byte offset to each local header. The EOCD record (50 4B 05 06) stores the total entry count and the byte offset to the central directory start. Parsers locate the EOCD by scanning backward from the file end — this end-anchored design enables appending without rewriting but means truncation of even a few bytes at the file end destroys the entire archive's navigability.
| Offset | Length | Field | Example | Description |
|---|---|---|---|---|
0x00 | 4 bytes | Local File Header Signature | 50 4B 03 04 | PK — Phil Katz's initials. Marks the start of each local file entry. |
0x04 | 2 bytes | Version Needed to Extract | 14 00 (v2.0) | Minimum ZIP spec version required. 0x0014 = 2.0 (DEFLATE). 0x002D = 4.5 (ZIP64). |
0x06 | 2 bytes | General Purpose Bit Flag | 00 00 | Bit 0: encrypted. Bit 3: data descriptor follows. Bit 11: UTF-8 filenames. |
0x08 | 2 bytes | Compression Method | 08 00 (DEFLATE) | 0 = Store (none). 8 = DEFLATE. 14 = LZMA. 93 = Zstandard. |
0x0E | 4 bytes | CRC-32 | 48 C9 11 48 | CRC-32 checksum of uncompressed file data. Verified on extraction to detect corruption. |
0x12 | 4 bytes | Compressed Size | 0C 00 00 00 | Compressed data size in bytes. 0xFFFFFFFF triggers ZIP64 extended field. |
0x16 | 4 bytes | Uncompressed Size | 0A 00 00 00 | Original file size before compression. 0xFFFFFFFF triggers ZIP64. |
EOF-22 | 4 bytes | EOCD Signature | 50 4B 05 06 | End of Central Directory Record. Parsers scan backward from EOF to find this. Contains entry count and central directory offset. |
Extraction not yet available. ZIP decompression requires streaming support for the DEFLATE algorithm and central-directory parsing — a feature planned for a future FileDex update.
Attack Vectors
- ZIP Bomb (42.zip)
- Zip Slip (path traversal)
- ZipCrypto weak encryption
- Symlink attacks
- Malicious payload delivery
Mitigation: Extract to a new empty directory, never to system paths. Use AES-256 encryption (7-Zip -mem=AES256), never ZipCrypto. Set decompressed-size limits to prevent ZIP bomb attacks. Validate all filenames for path traversal before extraction. Scan contents with antivirus before opening extracted files.
- Specification PKWARE APPNOTE.TXT v6.3.9 — .ZIP File Format Specification
- Specification ISO/IEC 21320-1:2015 — Document Container File (restricted ZIP profile)
- Registry Library of Congress FDD — ZIP File Format (fdd000354)
- Registry IANA Media Type — application/zip
- History Wikipedia — Phil Katz biography and ZIP format history
- Industry Snyk — Zip Slip Vulnerability (CVE-2018-1002200)