Avro, Parquet, JSON, ORC Compression using ADF
Summary
The source file is a 126.48 MB CSV file which contains Postcodes and Local Authority data.
In this post, I’ll be testing the file size of a transformation output to various file formats (Avro, JSON, Parquet and ORC).
The file will be ingested from Azure Blob Storage Data Lake. The transformation process is also fairly simple, to remove duplicate rows in the source file. The Dataflow uses Surrogate Keys to filter out matching rows. This process is documented here:
ADF Dataflow
The Data Flow pipeline is as follows. Duplicate entries are removed.
File Compression Results
File | Size (MB) |
---|---|
CSV | 126.48 |
Avro | 112.81 |
JSON | 280.08 |
Parquet | 21.63 |
ORC | 21.26 |
Which file types to use?
The Avro file format should be preferred when I/O patterns are more write-heavy, or the query patterns return multiple rows of records for every request. Avro format works well with a message broker such as Event Hub or Kafka that write multiple events/messages in succession.
Parquet and ORC file formats should be used when the I/O patterns are more read-heavy or when query patterns are returning a specific subset of columns in the records (Customer name, phone, email only). Read transactions can be optimised to retrieve specific columns instead of reading the entire record.
Best Use Case
File | Optimal Use Case |
---|---|
ORC, PARQUET | Read Heavy operations (ADF, Databricks) |
AVRO | Write Heavy Operations (Event Hub, Kafka) |