file formats for big data

3 minute read

There are certain parameters to consider when chossing a file format.

storage space consumption
processing time
io consumption
read/write speed
if data can be split in files
schema evolution
compression
compatible with framework like hive/spark etc.

ways in which data can be stored

row based: write is simple but read will have to read full row even if subset of column is read.

column based: all columns values are stored together. Write will be slower comparatively. But read will be efficient. In this we can get good compression as well.

file formats

csv:

all data stored as text so takes a lot of storage for example if column has integer value then that consumes more storage because its stored as text.
processing will be slow as conversion need to be done.
io will be slow as data storage is more so will do more io.

xml:

semi structure
all negative of csv applies here as well.
these files can not split.

json:

semi structure
all negative of csv applies here as well.

avro:

row based storage
faster write as its row based
slow read for subset of columns
schema of file is stored in json
data is self describing because schema is embeded as part of data
actual data is in binary format
general file format, programming language agnostic can used in many languages
matuare in schema evolution
serialization format

orc

optimized row columnar
write are not effecient
effecient reads
highly effecient in terms of storage
compression (dictionary encoding, bit packing, delta encoding, run length encoding along with generalzed compression like snappy/gzip)
predicate pushdown
best fit for hive, supports all datatypes including complex used in hive
initially specially designed for hive
supports schema evolution, not matuare as avro
self describing, as stores metadata(using protocol buffers) in the end of file itself

parquet

column based storage
writes are not effecient
effecient reads
shares many design patterns as of orc, but more general purpose
very good in handling nested data
compression is effecient
self describing, as stores metadata in the end of file itself
supports schema evolution adding/removing columns in the end

orc storage internals

Data is stored as shown in the below image. Mainly it has below sections.

Header

It contains text ORC

Body

In it data is divided in multiple stripes (default size is 250MB) and each stripe has-

Index data: max, min, count of each column in every row group in the stripe
Row data: data is broken in row groups each row group has 10000 rows by default
Stripe footer: stores encoding used

Footer

File footer: contains metadata at file and stripe level like max, min, count.
Postscript: stores which compression is used like snappy/gzip, postscipt is never compressed

orc internals

Note: Flow is like header is read to identify orc file and then postscript is read to get compression used and then file footer then stripes and row data.

parquet storage internals

Data is stored as shown in the below image. Mainly it has below sections.

Header

It contains text PAR1

Row group

In it data is divided in column chunks which is further divided in pages.

Footer

File metadata: encoding, schema, etc.
Lenght of file metadata
Magic number PAR1

parquet internals

compression

dictionary encoding

Suppose we have sales data where product name column, customer name column exist then that will have same values for a lot of rows. For this dictonary encoding helps by having dictionary where distinct vaules are stored and then referenced.

bit packing

Suppose we have int column in the dateset, for that column if the vaule is less than bit packing can help by representing same number with less bits.

delta encoding

Suppose we have timestamp column in dataset then first timestamp is stored and then for next column vaule it can store difference only. Example vaule is 123456 and next vaule is 123457 then base vaule stored can be 123456 and next value 1.

run length encoding

Suppose we have column in dataset which has value dddddfffgg then the vaule stored is d5f3g2

Note:

serialization is converting data into a form which can be easily transferred over the network and stored in file system.
there is no other file format better than avro for schema evolution.
avro can be best fit for landing raw data in data lake.
in avro, orc, parquet any compression can be used, compression code is stored in metadata, so reader can get to know compression code from metadata.

Sources

https://orc.apache.org/specification/ORCv2/
https://parquet.apache.org/documentation/latest/

file formats for big data

ways in which data can be stored

file formats

orc storage internals

parquet storage internals

compression

Sources

Comments

You May Also Enjoy

spark streaming

spark optimizations

spark part-II

hive basics