.ZIP ZIP Archive
.zip

ZIP Archive

ZIP is the universal archive format supported natively by Windows, macOS, Linux, Android, and iOS. Invented by Phil Katz in 1989, the format uses DEFLATE compression, CRC-32 integrity checks, and optional AES-256 encryption — all within an open specification any tool can read.

Archive structure
PK local headers
Files compressed data
EOCD central directory
LosslessISO 213201989
Not convertible

Extraction not yet available. ZIP decompression requires streaming support for the DEFLATE algorithm and central-directory parsing — a feature planned for a future FileDex update.

Common questions

How do I open a ZIP file?

Windows, macOS, and Linux all support ZIP natively without extra software. On Windows, double-click the file or right-click and choose Extract All. On macOS, double-click to auto-extract via Archive Utility. On Linux, most file managers handle ZIP directly, or run unzip archive.zip from the terminal. Android and iOS also open ZIP files natively through their built-in Files apps.

What does PK mean in ZIP files?

PK stands for Phil Katz, the programmer who created the ZIP format in 1989. Every ZIP file begins with the hex bytes 50 4B, which spell out his initials in ASCII. Katz developed PKZIP after a legal dispute over the ARC compression format, creating an open alternative that became the universal standard for file archiving.

Are DOCX and XLSX files really ZIP files?

Yes. Microsoft Office documents including DOCX, XLSX, and PPTX are ZIP archives containing XML files and media resources. EPUB e-books, Android APK packages, and Java JAR files also use ZIP as their container format. Renaming any of these to .zip and opening with an archive tool reveals the internal structure.

Is ZIP compression lossless?

Yes. ZIP uses DEFLATE compression, which is entirely lossless — extracted files are bit-for-bit identical to the originals. No data is discarded or approximated during compression. The algorithm reduces file size by finding and encoding repeated byte patterns, then reverses the process exactly during extraction.

Why is my ZIP file barely smaller than the original files?

Files that are already compressed — JPEG images, MP4 videos, MP3 audio — contain very little redundancy for DEFLATE to exploit. Zipping a folder of photos may reduce size by only one or two percent. ZIP works best on text, source code, and uncompressed data.

What is a ZIP bomb?

A ZIP bomb is a small archive that expands to an enormous size when extracted. The classic example, 42.zip, is 42 KB compressed but expands to 4.5 petabytes. Modern antivirus and archive tools detect these by checking the compression ratio before extraction.

How do I password-protect a ZIP file with strong encryption?

Use 7-Zip with AES-256 encryption. The default ZipCrypto encryption built into Windows and many tools is cryptographically broken and can be cracked in seconds. In 7-Zip, select ZIP format and choose AES-256 as the encryption method when setting a password.

What is the maximum file size a ZIP archive can hold?

Standard ZIP is limited to 4 GB per file and 4 GB total archive size due to 32-bit size fields in the original specification. ZIP64 extensions remove these limits by using 64-bit fields, supporting files and archives up to 16 exbibytes. Most modern tools create ZIP64 archives automatically when needed.

What makes .ZIP special

PK's Initials
Every ZIP file carries Phil Katz's name
The magic bytes 50 4B (ASCII: PK) at the start of every ZIP file are the initials of Phil Katz, who created the format in 1989 after a lawsuit over his PKARC tool. He died in 2000 at age 37.
Hidden ZIP Files
DOCX, XLSX, EPUB, APK are all ZIP archives
Rename any .docx to .zip and extract it — inside are XML files. Microsoft Office, EPUB e-books, Android APK packages, and Java JAR files all use ZIP as their container format.
DEFLATE Everywhere
The same algorithm powers ZIP, gzip, and PNG
DEFLATE (LZ77 + Huffman coding) is the default ZIP compression method and also the algorithm inside gzip, PNG files, and HTTP content encoding — arguably the most deployed compression algorithm ever.
ZIP64
From 4 GB limit to 16 exabytes
The original 32-bit size fields capped ZIP at 4 GB. ZIP64 extensions (2001) use 64-bit fields, supporting individual files up to 16 exabytes. All modern tools handle ZIP64 transparently.

Behind every DOCX you email, every APK you install, and every EPUB you read sits an invisible ZIP archive — a container format so foundational that most people use it daily without knowing it exists. The story of ZIP begins not with a corporation or a standards body, but with a twenty-six-year-old programmer from Milwaukee named Phil Katz.

Continue reading — full technical deep dive

The Phil Katz Story

In 1988, Phillip Walter Katz faced a lawsuit from System Enhancement Associates (SEA), the company behind the popular ARC compression format. Katz had created PKARC, a faster and freely distributed tool compatible with SEA's format. SEA sued for trade secret misappropriation. The settlement cost PKWARE $22,500 in royalties plus $40,000 in legal expenses. Rather than continue working within someone else's format, Katz did something remarkable: he created an entirely new archive format from scratch and published the specification openly, declaring it would always be free for competing software to implement.

On February 14, 1989, PKZIP 1.0 shipped. The magic bytes at the start of every ZIP file — 50 4B in hexadecimal, the ASCII letters "PK" — are Phil Katz's initials, permanently encoded into every ZIP archive ever created. Katz was found dead in a hotel room on April 14, 2000, at age 37. His creation outlived him by decades and now underpins billions of files across every operating system on Earth.

Reading ZIP Files Backwards: The EOCD-First Model

ZIP has an architectural quirk that surprises most developers: the file is designed to be read from the end, not the beginning. A ZIP archive has three major sections:

  1. Local file headers + compressed data — one pair per archived file, written sequentially
  2. Central directory — a repeated set of entries mirroring the local headers, with full metadata and byte offsets pointing back to each local header
  3. End of Central Directory Record (EOCD) — a single record at the very end that points to the central directory

A ZIP parser starts by seeking to the end of the file, scanning backward to locate the EOCD signature (50 4B 05 06), then reads the central directory offset from the EOCD, jumps to the central directory, and from there locates every file in the archive. This end-anchored design has three practical consequences. First, files can be appended to a ZIP without rewriting the archive — just add new local entries and write a new central directory and EOCD. Second, a ZIP can be embedded inside another file (such as a self-extracting EXE) because the parser ignores everything before the ZIP data and reads from the end. Third, if the EOCD is lost to truncation or corruption, the entire archive becomes unreadable even though all compressed file data may be intact.

DOCX, XLSX, EPUB, APK — They Are All ZIP Files

One of the least-known facts in computing: many common file formats are simply ZIP archives with specific internal directory structures. Rename a .docx file to .zip and unzip it — inside you will find XML files describing document content, styles, relationships, and embedded media. The same applies to:

  • DOCX / XLSX / PPTX — Microsoft Office Open XML (OOXML), standardized as ECMA-376 and ISO/IEC 29500
  • ODT / ODS / ODP — OpenDocument Format, standardized as ISO/IEC 26300
  • EPUB — E-book container defined by the W3C
  • JAR / WAR / EAR — Java archive formats
  • APK — Android application packages
  • XPI — Firefox browser extensions
  • KMZ — Google Earth placemarks
  • CBZ — Comic book archives

ISO 21320-1:2015 formalized this practice by defining a restricted ZIP profile for document containers. The ISO standard permits only Store (method 0) and DEFLATE (method 8) compression, forbids encryption, forbids multi-disk spanning, and normatively references APPNOTE version 6.3.3. Any file format that claims ISO 21320-1 conformance must be a valid ZIP file meeting these restrictions.

DEFLATE: The Algorithm Inside

DEFLATE, the default and overwhelmingly dominant compression method in ZIP files, combines LZ77 sliding-window matching with Huffman coding. The compressor scans input data for repeated byte sequences within a 32 KB window, replacing matches with back-references (distance, length pairs), then encodes the result using variable-length Huffman codes. DEFLATE is also the algorithm inside gzip, PNG, and HTTP content encoding — making it arguably the most widely deployed compression algorithm in history.

DEFLATE achieves typical compression ratios of 2:1 to 5:1 on text, source code, and structured data. Already-compressed data like JPEG images, MP4 video, or MP3 audio compresses poorly or not at all — this is why zipping a folder of photos barely reduces its size. ZIP stores each file independently, so tools can choose to store incompressible files uncompressed (method 0) while compressing text files with DEFLATE.

ZIP64: Breaking the 4 GB Barrier

The original ZIP specification stores file sizes and archive offsets in 32-bit fields, limiting individual files and total archive size to approximately 4 GB (2^32 bytes). ZIP64 extensions, introduced in APPNOTE version 4.5 in 2001, use extra fields with header ID 0x0001 to store 64-bit sizes. When any size field would overflow 32 bits, the local header writes 0xFFFFFFFF as a sentinel value and stores the actual 64-bit size in the ZIP64 extended information extra field.

All modern tools — 7-Zip, WinRAR, Windows Explorer (since Windows 8), macOS Archive Utility, Python's zipfile module (since Python 3.4) — handle ZIP64 transparently. Legacy tools from before 2001 silently truncate or refuse ZIP64 archives.

Security: ZIP Bombs, Zip Slip, and Weak Encryption

ZIP's ubiquity makes it a frequent attack vector. Three vulnerabilities stand out:

ZIP bomb (42.zip): A 42 KB file that decompresses to 4.5 petabytes through nested layers of 16 ZIP files each. Non-recursive variants use overlapping file references within the central directory to bypass depth-checking defenses, achieving massive expansion ratios in a single flat archive. Any extraction tool without decompressed-size limits is vulnerable.

Zip Slip (CVE-2018-1002200): A path traversal attack where archived filenames contain ../ sequences. During extraction, vulnerable libraries write files outside the intended directory — potentially overwriting executables, configuration files, or SSH keys. Discovered by Snyk in 2018, this vulnerability affected Java, .NET, Ruby, Go, and Python extraction libraries.

ZipCrypto weakness: The traditional PKWARE encryption scheme (ZipCrypto) is cryptographically broken. It uses a stream cipher seeded from a 96-bit internal state derived from the password, and is vulnerable to known-plaintext attacks. Tools like pkcrack can recover the encryption key in seconds if any file in the archive has known content. AES-256 encryption (added in APPNOTE version 5.2, implemented by WinZip and 7-Zip) is secure but not universally supported by all extraction tools.

Compression Comparison

ZIP compresses each file independently, which enables random access but limits compression ratio. Solid-archive formats like 7z and RAR compress all files as a single stream, exploiting cross-file redundancy for 30-70% better ratios on collections of similar files (such as source code repositories). The tradeoff: extracting a single file from a solid archive requires decompressing everything before it. ZIP wins on compatibility and random access; 7z wins on compression ratio; TAR.GZ wins on Unix metadata preservation.

.ZIP compared to alternatives

.ZIP compared to alternative formats
Formats Criteria Winner
.ZIP vs .7Z
Compression ratio vs compatibility
7z's LZMA2 algorithm achieves 30-70% better compression than ZIP's DEFLATE on text and code, especially in solid mode where cross-file redundancy is exploited. ZIP wins on compatibility — every OS opens it natively.
7Z wins
.ZIP vs .RAR
Platform support and recovery
ZIP is natively supported by every major OS without additional software. RAR requires third-party tools but offers recovery records that can repair partially corrupted archives — a feature ZIP entirely lacks.
Draw
.ZIP vs .TAR.GZ
Random access vs Unix metadata
ZIP stores each file independently, enabling extraction of any single file without processing the entire archive. TAR.GZ preserves Unix permissions, symlinks, and ownership that ZIP discards, making it the standard for Linux source distribution.
Draw

Technical reference

MIME Type
application/zip
Magic Bytes
50 4B 03 04 PK signature (Phil Katz initials).
Developer
Phil Katz / PKWARE
Year Introduced
1989
Open Standard
Yes
00000000504B0304 PK..

PK signature (Phil Katz initials).

Binary Structure

ZIP uses a three-section layout: local file entries, central directory, and End of Central Directory Record (EOCD). Each local file entry starts with signature 50 4B 03 04, followed by a 26-byte fixed header containing version, flags, compression method, timestamps, CRC-32, compressed and uncompressed sizes, and filename/extra field lengths. The compressed file data immediately follows the variable-length filename and extra field. The central directory at the end of the file mirrors each local entry with signature 50 4B 01 02, adding file attributes, comments, and the byte offset to each local header. The EOCD record (50 4B 05 06) stores the total entry count and the byte offset to the central directory start. Parsers locate the EOCD by scanning backward from the file end — this end-anchored design enables appending without rewriting but means truncation of even a few bytes at the file end destroys the entire archive's navigability.

OffsetLengthFieldExampleDescription
0x00 4 bytes Local File Header Signature 50 4B 03 04 PK — Phil Katz's initials. Marks the start of each local file entry.
0x04 2 bytes Version Needed to Extract 14 00 (v2.0) Minimum ZIP spec version required. 0x0014 = 2.0 (DEFLATE). 0x002D = 4.5 (ZIP64).
0x06 2 bytes General Purpose Bit Flag 00 00 Bit 0: encrypted. Bit 3: data descriptor follows. Bit 11: UTF-8 filenames.
0x08 2 bytes Compression Method 08 00 (DEFLATE) 0 = Store (none). 8 = DEFLATE. 14 = LZMA. 93 = Zstandard.
0x0E 4 bytes CRC-32 48 C9 11 48 CRC-32 checksum of uncompressed file data. Verified on extraction to detect corruption.
0x12 4 bytes Compressed Size 0C 00 00 00 Compressed data size in bytes. 0xFFFFFFFF triggers ZIP64 extended field.
0x16 4 bytes Uncompressed Size 0A 00 00 00 Original file size before compression. 0xFFFFFFFF triggers ZIP64.
EOF-22 4 bytes EOCD Signature 50 4B 05 06 End of Central Directory Record. Parsers scan backward from EOF to find this. Contains entry count and central directory offset.
1989Phil Katz releases PKZIP 1.0 and publishes the ZIP specification (APPNOTE.TXT) as an open standard after losing a lawsuit over his PKARC utility1993DEFLATE compression (method 8) added in PKZIP 2.0, replacing the original Shrink/Reduce/Implode methods and becoming the dominant compression algorithm1993Info-ZIP project provides free cross-platform zip/unzip tools, spreading ZIP support to Unix and mainframe systems2000Phil Katz dies on April 14 at age 37. His creation lives on as the most widely used archive format in computing2001ZIP64 extensions published in APPNOTE v4.5, breaking the 4 GB file and archive size limitation with 64-bit size fields2003WinZip implements AES-256 encryption, providing a secure alternative to the broken ZipCrypto scheme2006Microsoft Office 2007 adopts ZIP-based OOXML format (.docx, .xlsx, .pptx), making every Office document a ZIP archive2015ISO 21320-1 published, defining a restricted ZIP profile for document containers used by OOXML, ODF, and EPUB2020APPNOTE v6.3.9 adds Zstandard (method 93) as a supported compression algorithm, modernizing ZIP compression options
Create a ZIP archive recursively other
zip -r archive.zip /path/to/folder/

-r recurses into subdirectories, including all nested files and folders in the archive.

Extract ZIP to a specific directory other
unzip archive.zip -d /output/path/

-d specifies the destination directory. Without it, files extract to the current working directory.

Test ZIP integrity without extracting other
unzip -t archive.zip

Tests each file's CRC-32 checksum against the stored value without writing any data to disk.

Create AES-256 encrypted ZIP with 7-Zip other
7z a -tzip -mem=AES256 -p secure.zip files/

-tzip sets output format to ZIP. -mem=AES256 selects AES-256 encryption instead of the weak ZipCrypto. -p prompts for the encryption password.

List ZIP contents with Python (stdlib) other
python3 -m zipfile -l archive.zip

Python's built-in zipfile module lists all entries with compressed/uncompressed sizes and modification dates. No external dependencies required.

Extraction not yet available. ZIP decompression requires streaming support for the DEFLATE algorithm and central-directory parsing — a feature planned for a future FileDex update.

HIGH

Attack Vectors

  • ZIP Bomb (42.zip)
  • Zip Slip (path traversal)
  • ZipCrypto weak encryption
  • Symlink attacks
  • Malicious payload delivery

Mitigation: Extract to a new empty directory, never to system paths. Use AES-256 encryption (7-Zip -mem=AES256), never ZipCrypto. Set decompressed-size limits to prevent ZIP bomb attacks. Validate all filenames for path traversal before extraction. Scan contents with antivirus before opening extracted files.

7-Zip tool
Open-source file archiver supporting ZIP, 7z, RAR, and 20+ formats with AES-256 encryption
Info-ZIP tool
Classic open-source zip/unzip CLI tools — the de facto standard on Unix systems
WinRAR tool
Commercial archive manager with ZIP, RAR, and 7z support
Python zipfile library
Python standard library module for reading, writing, and testing ZIP archives
fflate library
High-performance JavaScript ZIP compression/decompression library for browser and Node.js
The canonical ZIP format specification maintained by PKWARE since 1989, currently at version 6.3.10
The Unarchiver tool
Free macOS archive utility that handles edge cases like non-UTF-8 filenames and legacy encodings