Apache Avro
Apache Avro is a row-based data serialization format that embeds its JSON schema in the file header. It supports schema evolution, compact binary encoding, and is the default format for Apache Kafka. This is a reference page only.
Binary data serialization format. Schema-dependent conversion requires runtime deserialization not available in browser.
أسئلة شائعة
What is an Avro file?
An Avro file is a binary data container that stores records in row-based format alongside a JSON schema in the file header. It supports schema evolution, meaning you can add or remove fields without breaking existing readers. Avro is widely used in Apache Kafka and Hadoop pipelines.
How do I open an Avro file?
Use avro-tools (java -jar avro-tools.jar tojson file.avro) to convert to readable JSON, or Python's fastavro library to read programmatically. VS Code with the Avro Viewer extension can display Avro contents visually.
What is the difference between Avro and Parquet?
Avro is row-based (good for streaming, writing, and full-record access), while Parquet is columnar (good for analytical queries that read few columns from many rows). Avro embeds its schema in the file; Parquet stores schema in footer metadata. Use Avro for Kafka messages and data ingestion, Parquet for data warehouse queries.
Can I convert Avro to CSV?
Yes, but with limitations. Avro supports nested records, arrays, maps, and union types that CSV cannot represent. Flat Avro records convert cleanly; nested structures require manual flattening. Use fastavro in Python or Apache Spark for the conversion.
ما يميز .AVRO
What is an Avro file?
Apache Avro is a row-based data serialization system that stores data alongside its schema in JSON format. Developed within the Hadoop ecosystem, Avro supports schema evolution (adding/removing fields without breaking readers), compact binary encoding, and RPC. It is widely used in streaming data platforms.
اكتشف التفاصيل التقنية
How to open Avro files
- avro-tools —
java -jar avro-tools.jar tojson file.avro - Python fastavro —
pip install fastavrofor reading - Apache Spark — Distributed processing
- Avro Viewer (VS Code extension) — Visual inspection
Technical specifications
| Property | Value |
|---|---|
| Storage | Row-based |
| Encoding | Binary or JSON |
| Schema | JSON (embedded in file header) |
| Schema Evolution | Forward and backward compatible |
| Compression | Snappy, Deflate, Bzip2, Zstd |
Common use cases
- Apache Kafka: Default serialization format for messages.
- Data pipelines: Hadoop and Spark data processing.
- Schema registry: Confluent Schema Registry integration.
- Event sourcing: Serializing domain events.
المرجع التقني
- نوع MIME
application/avro- Magic Bytes
4F 62 6A 01Obj followed by version 01.- المطوّر
- Apache Software Foundation
- سنة التقديم
- 2009
- معيار مفتوح
- نعم
Obj followed by version 01.
البنية الثنائية
Avro files begin with a 4-byte magic sequence (4F 62 6A 01 — ASCII 'Obj' followed by version 0x01), followed by a file header containing the schema as a JSON string and a sync marker (16-byte random token). Data is stored in blocks — each block has a count of objects, the byte size of serialized data, the compressed data bytes, and a copy of the 16-byte sync marker. The sync marker allows readers to recover from corruption by scanning forward to the next valid block boundary. Blocks can be independently compressed using null, deflate, snappy, zstd, or bzip2 codecs.
| Offset | Length | Field | Example | Description |
|---|---|---|---|---|
0x00 | 4 bytes | Magic | 4F 62 6A 01 | ASCII 'Obj' + version byte 0x01 — identifies file as Avro Object Container. |
0x04 | variable | File metadata | (map of string->bytes) | Avro map containing 'avro.schema' (JSON string) and 'avro.codec' (compression codec name). Encoded as Avro long-prefixed key-value pairs. |
variable | 16 bytes | Sync marker | (random 16-byte token) | Randomly generated sync marker unique to this file. Repeated at the end of every data block for block boundary detection and corruption recovery. |
نقاط الضعف
- Maliciously crafted schema JSON in the file header could exploit JSON parser vulnerabilities
- Extremely large block sizes declared in header could cause out-of-memory conditions during deserialization
الحماية: FileDex does not open, execute, or parse these files. Reference page only.