Apache Parquet
Apache Parquet is an open-source columnar storage format designed for efficient analytical queries at scale. Developed by Twitter and Cloudera in 2013, it stores data by column with Dremel encoding and supports Snappy, Gzip, LZ4, and Zstd compression.
Parquet is a columnar binary format requiring specialized deserialization not available in browser WASM.
أسئلة شائعة
How do I open or inspect a Parquet file?
Parquet is a binary format and cannot be opened in a text editor. Use DuckDB (SELECT * FROM 'file.parquet' LIMIT 10), Python with PyArrow (pq.read_table), or the Parquet Viewer VS Code extension for visual inspection.
Why is Parquet better than CSV for analytics?
Parquet stores data by column, so queries reading a few columns from a wide table skip irrelevant data entirely. Combined with columnar compression (5-10x smaller than CSV) and embedded statistics for predicate pushdown, Parquet queries can be 10-100x faster.
Can I append rows to an existing Parquet file?
Parquet files are immutable once written — the footer metadata must reference all row groups. To add data, write a new Parquet file or use a table format like Delta Lake or Apache Iceberg that manages multiple Parquet files as a single logical table.
What compression should I use with Parquet?
Snappy is the default and offers the best decompression speed. Zstd provides better compression ratios with slightly slower decompression. Gzip is widely compatible but slower. For most analytical workloads, Snappy or Zstd at default level is recommended.
ما يميز .PARQUET
What is a Parquet file?
Apache Parquet is an open-source columnar storage format designed for efficient data processing at scale. Unlike row-based formats like CSV, Parquet stores data by column, enabling excellent compression and fast analytical queries that only read relevant columns. It is the de facto standard for big data lakes.
اكتشف التفاصيل التقنية
How to open Parquet files
- DuckDB —
SELECT * FROM 'file.parquet'for fast SQL queries - Python pandas —
pd.read_parquet('file.parquet') - Apache Spark — Distributed processing
- Parquet Viewer (VS Code extension) — Visual inspection
Technical specifications
| Property | Value |
|---|---|
| Storage | Columnar |
| Compression | Snappy, Gzip, LZ4, Zstd |
| Encoding | Dictionary, RLE, Delta, Bit-packing |
| Schema | Self-describing (embedded schema) |
| Types | Primitive + logical types (decimal, date, timestamp) |
Common use cases
- Data lakes: S3/GCS storage for analytics.
- ETL pipelines: Efficient intermediate data format.
- Machine learning: Feature stores and training datasets.
- Business intelligence: Fast analytical queries.
المرجع التقني
- نوع MIME
application/vnd.apache.parquet- Magic Bytes
50 41 52 31PAR1 magic at start and end of file.- المطوّر
- Apache Software Foundation
- سنة التقديم
- 2013
- معيار مفتوح
- نعم
PAR1 magic at start and end of file.
البنية الثنائية
A Parquet file opens and closes with the 4-byte magic PAR1 (50 41 52 31). Between the magics, the file contains one or more row groups, each holding a horizontal partition of the dataset. Within each row group, data is stored as column chunks — one per column. Each column chunk contains one or more data pages (plain, dictionary, or data page v2) with page-level encoding (dictionary, RLE, delta, bit-packing) and optional page-level compression (Snappy, Gzip, LZ4, Zstd, or Brotli). After the last row group, a file footer encoded in Apache Thrift compact protocol stores the full schema definition, row group metadata (column chunk offsets, sizes, statistics, encoding info), and key-value metadata. The footer length is stored as a 4-byte little-endian integer immediately before the closing PAR1 magic.
| Offset | Length | Field | Example | Description |
|---|---|---|---|---|
0x00 | 4 bytes | Header Magic | 50 41 52 31 (PAR1) | Identifies the file as Apache Parquet. Must match the footer magic at EOF. |
0x04 | variable | Row Group 1 | varies | First row group containing column chunks. Each column chunk has page headers followed by compressed/encoded page data. |
varies | variable | Additional Row Groups | varies | Subsequent row groups, each independently readable for parallel processing. |
footer start | variable | Footer (Thrift) | varies | Thrift compact protocol encoding of FileMetaData: schema, row group metadata, column statistics, key-value pairs. |
EOF-8 | 4 bytes | Footer Length | varies | Little-endian 32-bit integer specifying the size of the Thrift footer in bytes. |
EOF-4 | 4 bytes | Footer Magic | 50 41 52 31 (PAR1) | Closing magic that must match the header magic. Readers seeking from EOF use this to locate the footer. |
نقاط الضعف
- Deserialization of complex Thrift-encoded footer metadata can trigger buffer overflows in older Parquet readers
- Malformed column statistics in footer can cause predicate pushdown to return incorrect results silently
- Crafted Parquet files with extreme row group counts or column counts can exhaust reader memory during metadata parsing
الحماية: FileDex does not parse, deserialize, or execute Parquet content. Reference page only — no server-side processing.