.parquet

Apache Parquet

Apache Parquet is an open-source columnar storage format designed for efficient analytical queries at scale. Developed by Twitter and Cloudera in 2013, it stores data by column with Dremel encoding and supports Snappy, Gzip, LZ4, and Zstd compression.

Learn more ↓

Data layout

Header schema

Records structured data

DataColumnarDremel encodingSelf-describing2013

By FileDex

Not convertible

Parquet is a columnar binary format requiring specialized deserialization not available in browser WASM.

Common questions

How do I open or inspect a Parquet file?

Parquet is a binary format and cannot be opened in a text editor. Use DuckDB (SELECT * FROM 'file.parquet' LIMIT 10), Python with PyArrow (pq.read_table), or the Parquet Viewer VS Code extension for visual inspection.

Why is Parquet better than CSV for analytics?

Parquet stores data by column, so queries reading a few columns from a wide table skip irrelevant data entirely. Combined with columnar compression (5-10x smaller than CSV) and embedded statistics for predicate pushdown, Parquet queries can be 10-100x faster.

Can I append rows to an existing Parquet file?

Parquet files are immutable once written — the footer metadata must reference all row groups. To add data, write a new Parquet file or use a table format like Delta Lake or Apache Iceberg that manages multiple Parquet files as a single logical table.

What compression should I use with Parquet?

Snappy is the default and offers the best decompression speed. Zstd provides better compression ratios with slightly slower decompression. Gzip is widely compatible but slower. For most analytical workloads, Snappy or Zstd at default level is recommended.

What makes .PARQUET special

What is a Parquet file?

Apache Parquet is an open-source columnar storage format designed for efficient data processing at scale. Unlike row-based formats like CSV, Parquet stores data by column, enabling excellent compression and fast analytical queries that only read relevant columns. It is the de facto standard for big data lakes.

Continue reading — full technical deep dive

How to open Parquet files

DuckDB — SELECT * FROM 'file.parquet' for fast SQL queries
Python pandas — pd.read_parquet('file.parquet')
Apache Spark — Distributed processing
Parquet Viewer (VS Code extension) — Visual inspection

Technical specifications

Property	Value
Storage	Columnar
Compression	Snappy, Gzip, LZ4, Zstd
Encoding	Dictionary, RLE, Delta, Bit-packing
Schema	Self-describing (embedded schema)
Types	Primitive + logical types (decimal, date, timestamp)

Common use cases

Data lakes: S3/GCS storage for analytics.
ETL pipelines: Efficient intermediate data format.
Machine learning: Feature stores and training datasets.
Business intelligence: Fast analytical queries.

.PARQUET compared to alternatives

.PARQUET compared to alternative formats
Formats	Criteria	Winner
.PARQUET vs .CSV	Query performance Parquet's columnar layout allows reading only the columns needed for a query, skipping irrelevant data. CSV requires scanning every row. On wide tables, Parquet can be 10-100x faster.	PARQUET wins
.PARQUET vs .CSV	Storage size Parquet with Snappy compression typically stores data at 5-10x smaller than equivalent CSV due to columnar compression and encoding schemes that exploit per-column value patterns.	PARQUET wins
.PARQUET vs .AVRO	Analytical queries Parquet's columnar storage reads only required columns, while Avro's row-based format must deserialize entire rows. For SELECT on a few columns from wide tables, Parquet is significantly faster.	PARQUET wins
.PARQUET vs .AVRO	Write performance Avro's row-based format supports faster sequential writes and append operations. Parquet requires buffering a full row group before writing, adding latency for streaming ingestion.	AVRO wins

Technical reference

Specs CLI Conversions Security Ecosystem

MIME Type: application/vnd.apache.parquet
Magic Bytes: 50 41 52 31 PAR1 magic at start and end of file.
Developer: Apache Software Foundation
Year Introduced: 2013
Open Standard: Yes

0000000050415231 PAR1

PAR1 magic at start and end of file.

Binary Structure

A Parquet file opens and closes with the 4-byte magic PAR1 (50 41 52 31). Between the magics, the file contains one or more row groups, each holding a horizontal partition of the dataset. Within each row group, data is stored as column chunks — one per column. Each column chunk contains one or more data pages (plain, dictionary, or data page v2) with page-level encoding (dictionary, RLE, delta, bit-packing) and optional page-level compression (Snappy, Gzip, LZ4, Zstd, or Brotli). After the last row group, a file footer encoded in Apache Thrift compact protocol stores the full schema definition, row group metadata (column chunk offsets, sizes, statistics, encoding info), and key-value metadata. The footer length is stored as a 4-byte little-endian integer immediately before the closing PAR1 magic.

Offset	Length	Field	Example	Description
`0x00`	`4 bytes`	Header Magic	`50 41 52 31 (PAR1)`	Identifies the file as Apache Parquet. Must match the footer magic at EOF.
`0x04`	`variable`	Row Group 1	`varies`	First row group containing column chunks. Each column chunk has page headers followed by compressed/encoded page data.
`varies`	`variable`	Additional Row Groups	`varies`	Subsequent row groups, each independently readable for parallel processing.
`footer start`	`variable`	Footer (Thrift)	`varies`	Thrift compact protocol encoding of FileMetaData: schema, row group metadata, column statistics, key-value pairs.
`EOF-8`	`4 bytes`	Footer Length	`varies`	Little-endian 32-bit integer specifying the size of the Thrift footer in bytes.
`EOF-4`	`4 bytes`	Footer Magic	`50 41 52 31 (PAR1)`	Closing magic that must match the header magic. Readers seeking from EOF use this to locate the footer.

2013Twitter and Cloudera release Apache Parquet, inspired by Google's Dremel paper for columnar storage2015Parquet graduated to top-level Apache project; adopted by Apache Spark as default data format2017Parquet format version 2.0 adds data page v2 with separate encoding for repetition and definition levels2020Column encryption (Parquet Modular Encryption) added to specification for field-level security2022DuckDB and Polars adopt Parquet as primary on-disk format, accelerating adoption outside Hadoop2024Parquet becomes the de facto standard for data lake storage on S3, GCS, and Azure Blob with Delta Lake and Iceberg table formats built on top

Query a Parquet file with DuckDB other

duckdb -c "SELECT * FROM 'data.parquet' LIMIT 10;"

DuckDB reads Parquet natively with no import step. Queries execute using columnar vectorized execution for fast results.

Inspect Parquet schema and metadata other

duckdb -c "DESCRIBE SELECT * FROM 'data.parquet';"

Shows column names, types, and nullability from the Parquet footer metadata without reading row data.

Convert Parquet to CSV other

duckdb -c "COPY (SELECT * FROM 'input.parquet') TO 'output.csv' (HEADER, DELIMITER ',');"

Exports all rows and columns from Parquet to a CSV file with header row. DuckDB handles decompression and type conversion.

Read Parquet with Python and PyArrow other

python -c "import pyarrow.parquet as pq; t = pq.read_table('data.parquet'); print(t.schema); print(t.to_pandas().head())"

PyArrow reads Parquet files with full schema support and converts to pandas DataFrame for analysis.

Convert CSV to Parquet with DuckDB other

duckdb -c "COPY (SELECT * FROM 'input.csv') TO 'output.parquet' (FORMAT PARQUET, COMPRESSION ZSTD);"

Creates a Parquet file from CSV with Zstd compression. DuckDB auto-infers column types from CSV content.

PARQUET → CSV export lossy Flatten columnar Parquet data to row-based CSV for spreadsheet analysis, legacy system import, or human-readable inspection.

PARQUET → JSON export lossless Convert columnar data to JSON for web APIs, configuration systems, or tools that consume JSON input. Nested Parquet schemas map naturally to JSON objects.

PARQUET → NDJSON export lossless Export to NDJSON for streaming pipelines, Elasticsearch bulk indexing, or line-by-line processing. Each Parquet row becomes one JSON line.

MEDIUM

Attack Vectors

Deserialization of complex Thrift-encoded footer metadata can trigger buffer overflows in older Parquet readers
Malformed column statistics in footer can cause predicate pushdown to return incorrect results silently
Crafted Parquet files with extreme row group counts or column counts can exhaust reader memory during metadata parsing

Mitigation: FileDex does not parse, deserialize, or execute Parquet content. Reference page only — no server-side processing.

DuckDB tool

Analytical SQL engine with native Parquet reading and columnar execution

Apache Spark tool

Distributed data processing engine using Parquet as default storage format

PyArrow library

Python library for reading and writing Parquet with full schema support

Polars library

Fast DataFrame library using Parquet as primary on-disk format

parquet-tools tool

CLI for inspecting Parquet metadata, schema, row counts, and column stats

Delta Lake library

ACID table format built on Parquet files with transaction log