Posts by Tag

hadoop

hive basics

6 minute read

it is open source datawarehouse to process structured data on top of hadoop

yarn

3 minute read

Yet Another Resource Negotiator Lets first go through how things are in hadoop initial version and what the limitations are which is solved by YARN.

spark part-I

11 minute read

hadoop offers: hdfs: for storage mapreduce: for computation yarn: for resource management

compression in hadoop

less than 1 minute read

Compression will help to: save storage reduce io cost

MapReduce working

2 minute read

MapReduce is programming model for processing big datsets. It consists of two stages: map reduce

hdfs architecture

2 minute read

hdfs is hadoop distributed file system. Highly fault tolerant and is designed to deploy on low cost machines.

Back to Top ↑

hdfs

yarn

3 minute read

Yet Another Resource Negotiator Lets first go through how things are in hadoop initial version and what the limitations are which is solved by YARN.

spark part-I

11 minute read

hadoop offers: hdfs: for storage mapreduce: for computation yarn: for resource management

sqoop working

1 minute read

It is used to transfer data between rdbms to hdfs and vice versa.

hdfs architecture

2 minute read

hdfs is hadoop distributed file system. Highly fault tolerant and is designed to deploy on low cost machines.

Back to Top ↑

hive

hive basics

6 minute read

it is open source datawarehouse to process structured data on top of hadoop

hive for processing and hbase for low latency read

less than 1 minute read

creating table which can be accessed both by hive and hbase, this is done in cases where we require quick (low latency) searches and faster processing of dat...

hive features

2 minute read

hive server / thrift server

hive optimizations

less than 1 minute read

Vectorization

Back to Top ↑

nosql

hive for processing and hbase for low latency read

less than 1 minute read

creating table which can be accessed both by hive and hbase, this is done in cases where we require quick (low latency) searches and faster processing of dat...

cassandra

less than 1 minute read

features

cap theorem

less than 1 minute read

It is for distributed databases. And says that we can have only two out of three gurantees.

hbase

5 minute read

rdbms properties

Back to Top ↑

database

hive for processing and hbase for low latency read

less than 1 minute read

creating table which can be accessed both by hive and hbase, this is done in cases where we require quick (low latency) searches and faster processing of dat...

cassandra

less than 1 minute read

features

cap theorem

less than 1 minute read

It is for distributed databases. And says that we can have only two out of three gurantees.

hbase

5 minute read

rdbms properties

Back to Top ↑

spark

spark optimizations

8 minute read

optimizations can be at application code level or at cluster level, here we are looking more at cluster level optimizations

spark part-II

8 minute read

spark core works on rdds (spark 1 style) but we have high level constructs to query/process data easily, its dataframe/datasets

yarn

3 minute read

Yet Another Resource Negotiator Lets first go through how things are in hadoop initial version and what the limitations are which is solved by YARN.

spark part-I

11 minute read

hadoop offers: hdfs: for storage mapreduce: for computation yarn: for resource management

Back to Top ↑

hbase

hive for processing and hbase for low latency read

less than 1 minute read

creating table which can be accessed both by hive and hbase, this is done in cases where we require quick (low latency) searches and faster processing of dat...

cassandra

less than 1 minute read

features

hbase

5 minute read

rdbms properties

Back to Top ↑

scala

spark part-I

11 minute read

hadoop offers: hdfs: for storage mapreduce: for computation yarn: for resource management

scala part II

5 minute read

scala runs on top of jvm scala is like java so requires main, or we can extends App then we dont have to define main method

scala part I

9 minute read

spark code can be written in different languages (scala, python, java, r), scala is hybrib, oops + functional.

Back to Top ↑

compression

compression in hadoop

less than 1 minute read

Compression will help to: save storage reduce io cost

file formats for big data

3 minute read

There are certain parameters to consider when chossing a file format.

Back to Top ↑

jekyll

Build and deploy static website

1 minute read

To host website we have different ways and for this blog we are focussing on a use case where we need to have a website for blogging, and we are using: je...

Back to Top ↑

github

Build and deploy static website

1 minute read

To host website we have different ways and for this blog we are focussing on a use case where we need to have a website for blogging, and we are using: je...

Back to Top ↑

mapreduce

MapReduce working

2 minute read

MapReduce is programming model for processing big datsets. It consists of two stages: map reduce

Back to Top ↑

sqoop

sqoop working

1 minute read

It is used to transfer data between rdbms to hdfs and vice versa.

Back to Top ↑

csv

file formats for big data

3 minute read

There are certain parameters to consider when chossing a file format.

Back to Top ↑

xml

file formats for big data

3 minute read

There are certain parameters to consider when chossing a file format.

Back to Top ↑

json

file formats for big data

3 minute read

There are certain parameters to consider when chossing a file format.

Back to Top ↑

avro

file formats for big data

3 minute read

There are certain parameters to consider when chossing a file format.

Back to Top ↑

orc

file formats for big data

3 minute read

There are certain parameters to consider when chossing a file format.

Back to Top ↑

parquet

file formats for big data

3 minute read

There are certain parameters to consider when chossing a file format.

Back to Top ↑

optimization

hive optimizations

less than 1 minute read

Vectorization

Back to Top ↑

scd

slowly changing dimensions

less than 1 minute read

It is for dimension tables where changes are less in source rdbms which we want to get into datawarehouse or hdfs

Back to Top ↑

datawarehouse

slowly changing dimensions

less than 1 minute read

It is for dimension tables where changes are less in source rdbms which we want to get into datawarehouse or hdfs

Back to Top ↑

cap theorem

less than 1 minute read

It is for distributed databases. And says that we can have only two out of three gurantees.

Back to Top ↑

cassandra

less than 1 minute read

features

Back to Top ↑

time complexity

2 minute read

A way to calculate time consumed by an algorithm, as a function of input.

Back to Top ↑

data structures

time complexity

2 minute read

A way to calculate time consumed by an algorithm, as a function of input.

Back to Top ↑

pyspark

spark part-I

11 minute read

hadoop offers: hdfs: for storage mapreduce: for computation yarn: for resource management

Back to Top ↑

python

python basics

less than 1 minute read

module

Back to Top ↑

yarn

3 minute read

Yet Another Resource Negotiator Lets first go through how things are in hadoop initial version and what the limitations are which is solved by YARN.

Back to Top ↑

spark streaming

8 minute read

In batch processing, at certain frequency batch jobs are run but in case we require that batch to be very very small (depending on requirement, lets say we h...

Back to Top ↑