Avro, Parquet, JSON, ORC Compression using ADF

Post by: syed hussain in All Reference Architecture Tips

Summary

The source file is a 126.48 MB CSV file which contains Postcodes and Local Authority data.

In this post, I’ll be testing the file size of a transformation output to various file formats (Avro, JSON, Parquet and ORC).

The file will be ingested from Azure Blob Storage Data Lake. The transformation process is also fairly simple, to remove duplicate rows in the source file. The Dataflow uses Surrogate Keys to filter out matching rows. This process is documented here:

Find duplicate records with Azure Data Factory (same source file)

ADF Dataflow

The Data Flow pipeline is as follows. Duplicate entries are removed.

File Compression Results

File	Size (MB)
CSV	126.48
Avro	112.81
JSON	280.08
Parquet	21.63
ORC	21.26

Which file types to use?

The Avro file format should be preferred when I/O patterns are more write-heavy, or the query patterns return multiple rows of records for every request. Avro format works well with a message broker such as Event Hub or Kafka that write multiple events/messages in succession.

Parquet and ORC file formats should be used when the I/O patterns are more read-heavy or when query patterns are returning a specific subset of columns in the records (Customer name, phone, email only). Read transactions can be optimised to retrieve specific columns instead of reading the entire record.

Best Use Case

File	Optimal Use Case
ORC, PARQUET	Read Heavy operations (ADF, Databricks)
AVRO	Write Heavy Operations (Event Hub, Kafka)

Tags: avro json orc parquet

09 Jun 2022