Posts by Tag

hadoop

hive basics

6 minute read

it is open source datawarehouse to process structured data on top of hadoop

yarn

3 minute read

Yet Another Resource Negotiator Lets first go through how things are in hadoop initial version and what the limitations are which is solved by YARN.

spark part-I

11 minute read

hadoop offers: hdfs: for storage mapreduce: for computation yarn: for resource management

MapReduce working

2 minute read

MapReduce is programming model for processing big datsets. It consists of two stages: map reduce

hdfs architecture

2 minute read

hdfs is hadoop distributed file system. Highly fault tolerant and is designed to deploy on low cost machines.

Back to Top ↑

hdfs

yarn

3 minute read

Yet Another Resource Negotiator Lets first go through how things are in hadoop initial version and what the limitations are which is solved by YARN.

spark part-I

11 minute read

hadoop offers: hdfs: for storage mapreduce: for computation yarn: for resource management

sqoop working

1 minute read

It is used to transfer data between rdbms to hdfs and vice versa.

hdfs architecture

2 minute read

hdfs is hadoop distributed file system. Highly fault tolerant and is designed to deploy on low cost machines.

Back to Top ↑

hive

hive basics

6 minute read

it is open source datawarehouse to process structured data on top of hadoop

Back to Top ↑

nosql

cap theorem

less than 1 minute read

It is for distributed databases. And says that we can have only two out of three gurantees.

hbase

5 minute read

rdbms properties

Back to Top ↑

database

cap theorem

less than 1 minute read

It is for distributed databases. And says that we can have only two out of three gurantees.

hbase

5 minute read

rdbms properties

Back to Top ↑

spark

spark optimizations

8 minute read

optimizations can be at application code level or at cluster level, here we are looking more at cluster level optimizations

spark part-II

8 minute read

spark core works on rdds (spark 1 style) but we have high level constructs to query/process data easily, its dataframe/datasets

yarn

3 minute read

Yet Another Resource Negotiator Lets first go through how things are in hadoop initial version and what the limitations are which is solved by YARN.

spark part-I

11 minute read

hadoop offers: hdfs: for storage mapreduce: for computation yarn: for resource management

Back to Top ↑

hbase

hbase

5 minute read

rdbms properties

Back to Top ↑

scala

spark part-I

11 minute read

hadoop offers: hdfs: for storage mapreduce: for computation yarn: for resource management

scala part II

5 minute read

scala runs on top of jvm scala is like java so requires main, or we can extends App then we dont have to define main method

scala part I

9 minute read

spark code can be written in different languages (scala, python, java, r), scala is hybrib, oops + functional.

Back to Top ↑

compression

Back to Top ↑

jekyll

Build and deploy static website

1 minute read

To host website we have different ways and for this blog we are focussing on a use case where we need to have a website for blogging, and we are using: je...

Back to Top ↑

github

Build and deploy static website

1 minute read

To host website we have different ways and for this blog we are focussing on a use case where we need to have a website for blogging, and we are using: je...

Back to Top ↑

mapreduce

MapReduce working

2 minute read

MapReduce is programming model for processing big datsets. It consists of two stages: map reduce

Back to Top ↑

sqoop

sqoop working

1 minute read

It is used to transfer data between rdbms to hdfs and vice versa.

Back to Top ↑

csv

Back to Top ↑

xml

Back to Top ↑

json

Back to Top ↑

avro

Back to Top ↑

orc

Back to Top ↑

parquet

Back to Top ↑

optimization

Back to Top ↑

scd

slowly changing dimensions

less than 1 minute read

It is for dimension tables where changes are less in source rdbms which we want to get into datawarehouse or hdfs

Back to Top ↑

datawarehouse

slowly changing dimensions

less than 1 minute read

It is for dimension tables where changes are less in source rdbms which we want to get into datawarehouse or hdfs

Back to Top ↑

cap theorem

cap theorem

less than 1 minute read

It is for distributed databases. And says that we can have only two out of three gurantees.

Back to Top ↑

cassandra

Back to Top ↑

time complexity

time complexity

2 minute read

A way to calculate time consumed by an algorithm, as a function of input.

Back to Top ↑

data structures

time complexity

2 minute read

A way to calculate time consumed by an algorithm, as a function of input.

Back to Top ↑

pyspark

spark part-I

11 minute read

hadoop offers: hdfs: for storage mapreduce: for computation yarn: for resource management

Back to Top ↑

python

Back to Top ↑

yarn

yarn

3 minute read

Yet Another Resource Negotiator Lets first go through how things are in hadoop initial version and what the limitations are which is solved by YARN.

Back to Top ↑

spark streaming

spark streaming

8 minute read

In batch processing, at certain frequency batch jobs are run but in case we require that batch to be very very small (depending on requirement, lets say we h...

Back to Top ↑