.PARQUET Apache Parquet
.parquet

Apache Parquet

Apache Parquet is an open-source columnar storage format designed for efficient analytical queries at scale. Developed by Twitter and Cloudera in 2013, it stores data by column with Dremel encoding and supports Snappy, Gzip, LZ4, and Zstd compression.

Data layout
Header schema
Records structured data
DataColumnarDremel encodingSelf-describing2013
By FileDex
Not convertible

Parquet is a columnar binary format requiring specialized deserialization not available in browser WASM.

Common questions

How do I open or inspect a Parquet file?

Parquet is a binary format and cannot be opened in a text editor. Use DuckDB (SELECT * FROM 'file.parquet' LIMIT 10), Python with PyArrow (pq.read_table), or the Parquet Viewer VS Code extension for visual inspection.

Why is Parquet better than CSV for analytics?

Parquet stores data by column, so queries reading a few columns from a wide table skip irrelevant data entirely. Combined with columnar compression (5-10x smaller than CSV) and embedded statistics for predicate pushdown, Parquet queries can be 10-100x faster.

Can I append rows to an existing Parquet file?

Parquet files are immutable once written — the footer metadata must reference all row groups. To add data, write a new Parquet file or use a table format like Delta Lake or Apache Iceberg that manages multiple Parquet files as a single logical table.

What compression should I use with Parquet?

Snappy is the default and offers the best decompression speed. Zstd provides better compression ratios with slightly slower decompression. Gzip is widely compatible but slower. For most analytical workloads, Snappy or Zstd at default level is recommended.

What makes .PARQUET special

What is a Parquet file?

Apache Parquet is an open-source columnar storage format designed for efficient data processing at scale. Unlike row-based formats like CSV, Parquet stores data by column, enabling excellent compression and fast analytical queries that only read relevant columns. It is the de facto standard for big data lakes.

Continue reading — full technical deep dive

How to open Parquet files

  • DuckDBSELECT * FROM 'file.parquet' for fast SQL queries
  • Python pandaspd.read_parquet('file.parquet')
  • Apache Spark — Distributed processing
  • Parquet Viewer (VS Code extension) — Visual inspection

Technical specifications

Property Value
Storage Columnar
Compression Snappy, Gzip, LZ4, Zstd
Encoding Dictionary, RLE, Delta, Bit-packing
Schema Self-describing (embedded schema)
Types Primitive + logical types (decimal, date, timestamp)

Common use cases

  • Data lakes: S3/GCS storage for analytics.
  • ETL pipelines: Efficient intermediate data format.
  • Machine learning: Feature stores and training datasets.
  • Business intelligence: Fast analytical queries.

.PARQUET compared to alternatives

.PARQUET compared to alternative formats
Formats Criteria Winner
.PARQUET vs .CSV
Query performance
Parquet's columnar layout allows reading only the columns needed for a query, skipping irrelevant data. CSV requires scanning every row. On wide tables, Parquet can be 10-100x faster.
PARQUET wins
.PARQUET vs .CSV
Storage size
Parquet with Snappy compression typically stores data at 5-10x smaller than equivalent CSV due to columnar compression and encoding schemes that exploit per-column value patterns.
PARQUET wins
.PARQUET vs .AVRO
Analytical queries
Parquet's columnar storage reads only required columns, while Avro's row-based format must deserialize entire rows. For SELECT on a few columns from wide tables, Parquet is significantly faster.
PARQUET wins
.PARQUET vs .AVRO
Write performance
Avro's row-based format supports faster sequential writes and append operations. Parquet requires buffering a full row group before writing, adding latency for streaming ingestion.
AVRO wins

Technical reference

MIME Type
application/vnd.apache.parquet
Magic Bytes
50 41 52 31 PAR1 magic at start and end of file.
Developer
Apache Software Foundation
Year Introduced
2013
Open Standard
Yes
0000000050415231 PAR1

PAR1 magic at start and end of file.

Binary Structure

A Parquet file opens and closes with the 4-byte magic PAR1 (50 41 52 31). Between the magics, the file contains one or more row groups, each holding a horizontal partition of the dataset. Within each row group, data is stored as column chunks — one per column. Each column chunk contains one or more data pages (plain, dictionary, or data page v2) with page-level encoding (dictionary, RLE, delta, bit-packing) and optional page-level compression (Snappy, Gzip, LZ4, Zstd, or Brotli). After the last row group, a file footer encoded in Apache Thrift compact protocol stores the full schema definition, row group metadata (column chunk offsets, sizes, statistics, encoding info), and key-value metadata. The footer length is stored as a 4-byte little-endian integer immediately before the closing PAR1 magic.

OffsetLengthFieldExampleDescription
0x00 4 bytes Header Magic 50 41 52 31 (PAR1) Identifies the file as Apache Parquet. Must match the footer magic at EOF.
0x04 variable Row Group 1 varies First row group containing column chunks. Each column chunk has page headers followed by compressed/encoded page data.
varies variable Additional Row Groups varies Subsequent row groups, each independently readable for parallel processing.
footer start variable Footer (Thrift) varies Thrift compact protocol encoding of FileMetaData: schema, row group metadata, column statistics, key-value pairs.
EOF-8 4 bytes Footer Length varies Little-endian 32-bit integer specifying the size of the Thrift footer in bytes.
EOF-4 4 bytes Footer Magic 50 41 52 31 (PAR1) Closing magic that must match the header magic. Readers seeking from EOF use this to locate the footer.
2013Twitter and Cloudera release Apache Parquet, inspired by Google's Dremel paper for columnar storage2015Parquet graduated to top-level Apache project; adopted by Apache Spark as default data format2017Parquet format version 2.0 adds data page v2 with separate encoding for repetition and definition levels2020Column encryption (Parquet Modular Encryption) added to specification for field-level security2022DuckDB and Polars adopt Parquet as primary on-disk format, accelerating adoption outside Hadoop2024Parquet becomes the de facto standard for data lake storage on S3, GCS, and Azure Blob with Delta Lake and Iceberg table formats built on top
Query a Parquet file with DuckDB other
duckdb -c "SELECT * FROM 'data.parquet' LIMIT 10;"

DuckDB reads Parquet natively with no import step. Queries execute using columnar vectorized execution for fast results.

Inspect Parquet schema and metadata other
duckdb -c "DESCRIBE SELECT * FROM 'data.parquet';"

Shows column names, types, and nullability from the Parquet footer metadata without reading row data.

Convert Parquet to CSV other
duckdb -c "COPY (SELECT * FROM 'input.parquet') TO 'output.csv' (HEADER, DELIMITER ',');"

Exports all rows and columns from Parquet to a CSV file with header row. DuckDB handles decompression and type conversion.

Read Parquet with Python and PyArrow other
python -c "import pyarrow.parquet as pq; t = pq.read_table('data.parquet'); print(t.schema); print(t.to_pandas().head())"

PyArrow reads Parquet files with full schema support and converts to pandas DataFrame for analysis.

Convert CSV to Parquet with DuckDB other
duckdb -c "COPY (SELECT * FROM 'input.csv') TO 'output.parquet' (FORMAT PARQUET, COMPRESSION ZSTD);"

Creates a Parquet file from CSV with Zstd compression. DuckDB auto-infers column types from CSV content.

PARQUET CSV export lossy Flatten columnar Parquet data to row-based CSV for spreadsheet analysis, legacy system import, or human-readable inspection.
PARQUET JSON export lossless Convert columnar data to JSON for web APIs, configuration systems, or tools that consume JSON input. Nested Parquet schemas map naturally to JSON objects.
PARQUET NDJSON export lossless Export to NDJSON for streaming pipelines, Elasticsearch bulk indexing, or line-by-line processing. Each Parquet row becomes one JSON line.
MEDIUM

Attack Vectors

  • Deserialization of complex Thrift-encoded footer metadata can trigger buffer overflows in older Parquet readers
  • Malformed column statistics in footer can cause predicate pushdown to return incorrect results silently
  • Crafted Parquet files with extreme row group counts or column counts can exhaust reader memory during metadata parsing

Mitigation: FileDex does not parse, deserialize, or execute Parquet content. Reference page only — no server-side processing.

DuckDB tool
Analytical SQL engine with native Parquet reading and columnar execution
Apache Spark tool
Distributed data processing engine using Parquet as default storage format
PyArrow library
Python library for reading and writing Parquet with full schema support
Polars library
Fast DataFrame library using Parquet as primary on-disk format
parquet-tools tool
CLI for inspecting Parquet metadata, schema, row counts, and column stats
Delta Lake library
ACID table format built on Parquet files with transaction log