.PARQUET Apache Parquet
.parquet

Apache Parquet

Apache Parquet is an open-source columnar storage format designed for efficient analytical queries at scale. Developed by Twitter and Cloudera in 2013, it stores data by column with Dremel encoding and supports Snappy, Gzip, LZ4, and Zstd compression.

بنية الصيغة
Header schema
Records structured data
DataColumnarDremel encodingSelf-describing2013
بواسطة FileDex
غير قابل للتحويل

Parquet is a columnar binary format requiring specialized deserialization not available in browser WASM.

أسئلة شائعة

How do I open or inspect a Parquet file?

Parquet is a binary format and cannot be opened in a text editor. Use DuckDB (SELECT * FROM 'file.parquet' LIMIT 10), Python with PyArrow (pq.read_table), or the Parquet Viewer VS Code extension for visual inspection.

Why is Parquet better than CSV for analytics?

Parquet stores data by column, so queries reading a few columns from a wide table skip irrelevant data entirely. Combined with columnar compression (5-10x smaller than CSV) and embedded statistics for predicate pushdown, Parquet queries can be 10-100x faster.

Can I append rows to an existing Parquet file?

Parquet files are immutable once written — the footer metadata must reference all row groups. To add data, write a new Parquet file or use a table format like Delta Lake or Apache Iceberg that manages multiple Parquet files as a single logical table.

What compression should I use with Parquet?

Snappy is the default and offers the best decompression speed. Zstd provides better compression ratios with slightly slower decompression. Gzip is widely compatible but slower. For most analytical workloads, Snappy or Zstd at default level is recommended.

ما يميز .PARQUET

What is a Parquet file?

Apache Parquet is an open-source columnar storage format designed for efficient data processing at scale. Unlike row-based formats like CSV, Parquet stores data by column, enabling excellent compression and fast analytical queries that only read relevant columns. It is the de facto standard for big data lakes.

اكتشف التفاصيل التقنية

How to open Parquet files

  • DuckDBSELECT * FROM 'file.parquet' for fast SQL queries
  • Python pandaspd.read_parquet('file.parquet')
  • Apache Spark — Distributed processing
  • Parquet Viewer (VS Code extension) — Visual inspection

Technical specifications

Property Value
Storage Columnar
Compression Snappy, Gzip, LZ4, Zstd
Encoding Dictionary, RLE, Delta, Bit-packing
Schema Self-describing (embedded schema)
Types Primitive + logical types (decimal, date, timestamp)

Common use cases

  • Data lakes: S3/GCS storage for analytics.
  • ETL pipelines: Efficient intermediate data format.
  • Machine learning: Feature stores and training datasets.
  • Business intelligence: Fast analytical queries.

المرجع التقني

نوع MIME
application/vnd.apache.parquet
Magic Bytes
50 41 52 31 PAR1 magic at start and end of file.
المطوّر
Apache Software Foundation
سنة التقديم
2013
معيار مفتوح
نعم
0000000050415231 PAR1

PAR1 magic at start and end of file.

البنية الثنائية

A Parquet file opens and closes with the 4-byte magic PAR1 (50 41 52 31). Between the magics, the file contains one or more row groups, each holding a horizontal partition of the dataset. Within each row group, data is stored as column chunks — one per column. Each column chunk contains one or more data pages (plain, dictionary, or data page v2) with page-level encoding (dictionary, RLE, delta, bit-packing) and optional page-level compression (Snappy, Gzip, LZ4, Zstd, or Brotli). After the last row group, a file footer encoded in Apache Thrift compact protocol stores the full schema definition, row group metadata (column chunk offsets, sizes, statistics, encoding info), and key-value metadata. The footer length is stored as a 4-byte little-endian integer immediately before the closing PAR1 magic.

OffsetLengthFieldExampleDescription
0x00 4 bytes Header Magic 50 41 52 31 (PAR1) Identifies the file as Apache Parquet. Must match the footer magic at EOF.
0x04 variable Row Group 1 varies First row group containing column chunks. Each column chunk has page headers followed by compressed/encoded page data.
varies variable Additional Row Groups varies Subsequent row groups, each independently readable for parallel processing.
footer start variable Footer (Thrift) varies Thrift compact protocol encoding of FileMetaData: schema, row group metadata, column statistics, key-value pairs.
EOF-8 4 bytes Footer Length varies Little-endian 32-bit integer specifying the size of the Thrift footer in bytes.
EOF-4 4 bytes Footer Magic 50 41 52 31 (PAR1) Closing magic that must match the header magic. Readers seeking from EOF use this to locate the footer.
2013Twitter and Cloudera release Apache Parquet, inspired by Google's Dremel paper for columnar storage2015Parquet graduated to top-level Apache project; adopted by Apache Spark as default data format2017Parquet format version 2.0 adds data page v2 with separate encoding for repetition and definition levels2020Column encryption (Parquet Modular Encryption) added to specification for field-level security2022DuckDB and Polars adopt Parquet as primary on-disk format, accelerating adoption outside Hadoop2024Parquet becomes the de facto standard for data lake storage on S3, GCS, and Azure Blob with Delta Lake and Iceberg table formats built on top
Query a Parquet file with DuckDB أخرى
duckdb -c "SELECT * FROM 'data.parquet' LIMIT 10;"

DuckDB reads Parquet natively with no import step. Queries execute using columnar vectorized execution for fast results.

Inspect Parquet schema and metadata أخرى
duckdb -c "DESCRIBE SELECT * FROM 'data.parquet';"

Shows column names, types, and nullability from the Parquet footer metadata without reading row data.

Convert Parquet to CSV أخرى
duckdb -c "COPY (SELECT * FROM 'input.parquet') TO 'output.csv' (HEADER, DELIMITER ',');"

Exports all rows and columns from Parquet to a CSV file with header row. DuckDB handles decompression and type conversion.

Read Parquet with Python and PyArrow أخرى
python -c "import pyarrow.parquet as pq; t = pq.read_table('data.parquet'); print(t.schema); print(t.to_pandas().head())"

PyArrow reads Parquet files with full schema support and converts to pandas DataFrame for analysis.

Convert CSV to Parquet with DuckDB أخرى
duckdb -c "COPY (SELECT * FROM 'input.csv') TO 'output.parquet' (FORMAT PARQUET, COMPRESSION ZSTD);"

Creates a Parquet file from CSV with Zstd compression. DuckDB auto-infers column types from CSV content.

PARQUET CSV export lossy Flatten columnar Parquet data to row-based CSV for spreadsheet analysis, legacy system import, or human-readable inspection.
PARQUET JSON export lossless Convert columnar data to JSON for web APIs, configuration systems, or tools that consume JSON input. Nested Parquet schemas map naturally to JSON objects.
PARQUET NDJSON export lossless Export to NDJSON for streaming pipelines, Elasticsearch bulk indexing, or line-by-line processing. Each Parquet row becomes one JSON line.
متوسط

نقاط الضعف

  • Deserialization of complex Thrift-encoded footer metadata can trigger buffer overflows in older Parquet readers
  • Malformed column statistics in footer can cause predicate pushdown to return incorrect results silently
  • Crafted Parquet files with extreme row group counts or column counts can exhaust reader memory during metadata parsing

الحماية: FileDex does not parse, deserialize, or execute Parquet content. Reference page only — no server-side processing.

DuckDB أداة
Analytical SQL engine with native Parquet reading and columnar execution
Apache Spark أداة
Distributed data processing engine using Parquet as default storage format
PyArrow مكتبة
Python library for reading and writing Parquet with full schema support
Polars مكتبة
Fast DataFrame library using Parquet as primary on-disk format
parquet-tools أداة
CLI for inspecting Parquet metadata, schema, row counts, and column stats
Delta Lake مكتبة
ACID table format built on Parquet files with transaction log