Senior Data Engineer

compression in hadoop

less than 1 minute read

Compression will help to:

save storage
reduce io cost

Note: compression and uncompression adds some cost as cpu resources will be used but io cost is saved more comparatively.

Compression techniques

some compression codes are optimized for:

storage
speed

snappy:

fast compression codec
optimized for speed rather than storage
by default is not splittable but file format like avro/orc/parquet takes care of splits. So snappy can be used with these file formats.
distributed with hadoop

lzo:

optimized for speed rather than storage
it is splittable but requires additional indexing step
good for plain text files
is not distributed with hadoop, requires seperate install
compratively slower than snappy

gzip:

optimized for storage, 2.5x times compression compared to snappy
not splittable but can be used with container file formats like snappy
processing is slow, as compression is more it will have less blocks, can reduce block size to increase parellelism which will process faster

bzip2:

excellent storage, compress around 9% more comparitively to gzip
significantly slower, around 10x comparitively to gzip
splittable
might be used for archival

Note: snappy is used in mostly, as it provides good trade off between speed and size.

Comments

You May Also Enjoy

spark streaming

9 minute read

In batch processing, at certain frequency batch jobs are run but in case we require that batch to be very very small (depending on requirement, lets say we h...

spark optimizations

9 minute read

optimizations can be at application code level or at cluster level, here we are looking more at cluster level optimizations

spark part-II

9 minute read

spark core works on rdds (spark 1 style) but we have high level constructs to query/process data easily, its dataframe/datasets

hive basics

7 minute read

it is open source datawarehouse to process structured data on top of hadoop