.parquet

Apache Parquet

Apache Parquet is an open-source columnar storage format designed for efficient analytical queries at scale. Developed by Twitter and Cloudera in 2013, it stores data by column with Dremel encoding and supports Snappy, Gzip, LZ4, and Zstd compression.

اقرأ المزيد ↓

بنية الصيغة

Header schema

Records structured data

DataColumnarDremel encodingSelf-describing2013

بواسطة FileDex

غير قابل للتحويل

Parquet is a columnar binary format requiring specialized deserialization not available in browser WASM.

أسئلة شائعة

How do I open or inspect a Parquet file?

Parquet is a binary format and cannot be opened in a text editor. Use DuckDB (SELECT * FROM 'file.parquet' LIMIT 10), Python with PyArrow (pq.read_table), or the Parquet Viewer VS Code extension for visual inspection.

Why is Parquet better than CSV for analytics?

Parquet stores data by column, so queries reading a few columns from a wide table skip irrelevant data entirely. Combined with columnar compression (5-10x smaller than CSV) and embedded statistics for predicate pushdown, Parquet queries can be 10-100x faster.

Can I append rows to an existing Parquet file?

Parquet files are immutable once written — the footer metadata must reference all row groups. To add data, write a new Parquet file or use a table format like Delta Lake or Apache Iceberg that manages multiple Parquet files as a single logical table.

What compression should I use with Parquet?

Snappy is the default and offers the best decompression speed. Zstd provides better compression ratios with slightly slower decompression. Gzip is widely compatible but slower. For most analytical workloads, Snappy or Zstd at default level is recommended.

ما يميز .PARQUET

What is a Parquet file?

Apache Parquet is an open-source columnar storage format designed for efficient data processing at scale. Unlike row-based formats like CSV, Parquet stores data by column, enabling excellent compression and fast analytical queries that only read relevant columns. It is the de facto standard for big data lakes.

اكتشف التفاصيل التقنية

How to open Parquet files

DuckDB — SELECT * FROM 'file.parquet' for fast SQL queries
Python pandas — pd.read_parquet('file.parquet')
Apache Spark — Distributed processing
Parquet Viewer (VS Code extension) — Visual inspection

Technical specifications

Property	Value
Storage	Columnar
Compression	Snappy, Gzip, LZ4, Zstd
Encoding	Dictionary, RLE, Delta, Bit-packing
Schema	Self-describing (embedded schema)
Types	Primitive + logical types (decimal, date, timestamp)

Common use cases

Data lakes: S3/GCS storage for analytics.
ETL pipelines: Efficient intermediate data format.
Machine learning: Feature stores and training datasets.
Business intelligence: Fast analytical queries.

المرجع التقني

المواصفات CLI التحويلات الأمان المنظومة

نوع MIME: application/vnd.apache.parquet
Magic Bytes: 50 41 52 31 PAR1 magic at start and end of file.
المطوّر: Apache Software Foundation
سنة التقديم: 2013
معيار مفتوح: نعم

0000000050415231 PAR1

PAR1 magic at start and end of file.

البنية الثنائية

A Parquet file opens and closes with the 4-byte magic PAR1 (50 41 52 31). Between the magics, the file contains one or more row groups, each holding a horizontal partition of the dataset. Within each row group, data is stored as column chunks — one per column. Each column chunk contains one or more data pages (plain, dictionary, or data page v2) with page-level encoding (dictionary, RLE, delta, bit-packing) and optional page-level compression (Snappy, Gzip, LZ4, Zstd, or Brotli). After the last row group, a file footer encoded in Apache Thrift compact protocol stores the full schema definition, row group metadata (column chunk offsets, sizes, statistics, encoding info), and key-value metadata. The footer length is stored as a 4-byte little-endian integer immediately before the closing PAR1 magic.

Offset	Length	Field	Example	Description
`0x00`	`4 bytes`	Header Magic	`50 41 52 31 (PAR1)`	Identifies the file as Apache Parquet. Must match the footer magic at EOF.
`0x04`	`variable`	Row Group 1	`varies`	First row group containing column chunks. Each column chunk has page headers followed by compressed/encoded page data.
`varies`	`variable`	Additional Row Groups	`varies`	Subsequent row groups, each independently readable for parallel processing.
`footer start`	`variable`	Footer (Thrift)	`varies`	Thrift compact protocol encoding of FileMetaData: schema, row group metadata, column statistics, key-value pairs.
`EOF-8`	`4 bytes`	Footer Length	`varies`	Little-endian 32-bit integer specifying the size of the Thrift footer in bytes.
`EOF-4`	`4 bytes`	Footer Magic	`50 41 52 31 (PAR1)`	Closing magic that must match the header magic. Readers seeking from EOF use this to locate the footer.

2013Twitter and Cloudera release Apache Parquet, inspired by Google's Dremel paper for columnar storage2015Parquet graduated to top-level Apache project; adopted by Apache Spark as default data format2017Parquet format version 2.0 adds data page v2 with separate encoding for repetition and definition levels2020Column encryption (Parquet Modular Encryption) added to specification for field-level security2022DuckDB and Polars adopt Parquet as primary on-disk format, accelerating adoption outside Hadoop2024Parquet becomes the de facto standard for data lake storage on S3, GCS, and Azure Blob with Delta Lake and Iceberg table formats built on top

Query a Parquet file with DuckDB أخرى

duckdb -c "SELECT * FROM 'data.parquet' LIMIT 10;"

DuckDB reads Parquet natively with no import step. Queries execute using columnar vectorized execution for fast results.

Inspect Parquet schema and metadata أخرى

duckdb -c "DESCRIBE SELECT * FROM 'data.parquet';"

Shows column names, types, and nullability from the Parquet footer metadata without reading row data.

Convert Parquet to CSV أخرى

duckdb -c "COPY (SELECT * FROM 'input.parquet') TO 'output.csv' (HEADER, DELIMITER ',');"

Exports all rows and columns from Parquet to a CSV file with header row. DuckDB handles decompression and type conversion.

Read Parquet with Python and PyArrow أخرى

python -c "import pyarrow.parquet as pq; t = pq.read_table('data.parquet'); print(t.schema); print(t.to_pandas().head())"

PyArrow reads Parquet files with full schema support and converts to pandas DataFrame for analysis.

Convert CSV to Parquet with DuckDB أخرى

duckdb -c "COPY (SELECT * FROM 'input.csv') TO 'output.parquet' (FORMAT PARQUET, COMPRESSION ZSTD);"

Creates a Parquet file from CSV with Zstd compression. DuckDB auto-infers column types from CSV content.

PARQUET → CSV export lossy Flatten columnar Parquet data to row-based CSV for spreadsheet analysis, legacy system import, or human-readable inspection.

PARQUET → JSON export lossless Convert columnar data to JSON for web APIs, configuration systems, or tools that consume JSON input. Nested Parquet schemas map naturally to JSON objects.

PARQUET → NDJSON export lossless Export to NDJSON for streaming pipelines, Elasticsearch bulk indexing, or line-by-line processing. Each Parquet row becomes one JSON line.

متوسط

نقاط الضعف

Deserialization of complex Thrift-encoded footer metadata can trigger buffer overflows in older Parquet readers
Malformed column statistics in footer can cause predicate pushdown to return incorrect results silently
Crafted Parquet files with extreme row group counts or column counts can exhaust reader memory during metadata parsing

الحماية: FileDex does not parse, deserialize, or execute Parquet content. Reference page only — no server-side processing.

DuckDB أداة

Analytical SQL engine with native Parquet reading and columnar execution

Apache Spark أداة

Distributed data processing engine using Parquet as default storage format

PyArrow مكتبة

Python library for reading and writing Parquet with full schema support

Polars مكتبة

Fast DataFrame library using Parquet as primary on-disk format

parquet-tools أداة

CLI for inspecting Parquet metadata, schema, row counts, and column stats

Delta Lake مكتبة

ACID table format built on Parquet files with transaction log