yarn

3 minute read

Yet Another Resource Negotiator Lets first go through how things are in hadoop initial version and what the limitations are which is solved by YARN.

processing in hadoop / mr version 1

in hdfs storage metadata is stored in name node and actual data blocks in data node. job execution is controlled by job tracker(running on master) and task tracker (on slave)

job tracker

a lot of work is done in terms of-

scheduling:

deciding which job to execute first based on scheduling logic
priority of jobs
checking available resources
allocating resources to jobs

monitoring:

tracking the progress of job
rerun task to recover from failure
for slow task start on other node

task tracker

tracks tasks on each data node and informs the job tracker

limitations of hadoop / mr version 1

scalability: when cluster size goes beyond 4k data nodes then the job tracker used to become a bottleneck

resource utilization: need to configure fixed number of map and reduce slots at the time of cluster setup. if a job requires more map slots to be used then its problem

job support: only map reduce jobs were supported

solution

major bottleneck was that job tracker was doing a lot of work in scheduling and monitoring. yarn solved these problems.

it has three components:

resource manager

monitoring aspect is taken away from job tracker and in yarn its scalled as resource manager
only does scheduling
the resource manager creates a container(cpu + memory) on one of the node manager(task tracker), inside this container the resource manager creates an application master
allocates the resources in the form of containers and details(containerid, hostname) will be send to application master

node manager

same as task tracker

application master

take care of end to end monitoring for the application
asks the resource manager for resources
requests for resources in the form of number of containers(cpu + memory)

limitations of hadoop / mr version 1 (solved)

scalability: monitoring is offloaded to application master so for different applications, will have different application master to monitor

resource utilization: no more fixed map reduce slots, with the logical containers the resource allocation is much more dynamic and we can request for any amount of cpu or memory with this the cluster utilization is improved.

job support: can run spark, tez etc

uberization

if job is very small that can be solved inside application master itself then application master will not ask for more resources and run within, this mode is uberization.

yarn architecture

ways to execute spark jobs

interactive mode via spark-shell/pyspark/notebooks
as a job by spark-submit

cluster manager

controls cluster and allocates driver and executor, different cluster managers are:

yarn
k8s
mesos

here we will be looking in yarn only

how program is executed on cluster

master/slave architecture
driver analyses work, divides in many tasks, distributes tasks, schedules task and monitor
executor is jvm process which has some cpu and memory, exeute code

As shown below, when client submits spark job(spark-submit) then driver and executors are launched in cluster, for each job different driver and executors will be launched.

spark-submit

Note:

executors always launched on cluster (worker nodes)
driver can run on client or cluster

client mode: when driver runs on client machine

launch spark-shell creates spark session
once spark session created request goes to resource manager
resource manager then will create a container on one of the node manager and will launch application master for the submitted spark application
application master will then negotiate resources from resource manager in form of containers
resource manager will then create containers in the node managers
application master will then create executors in those containers
driver and executors can now communicate directly

cluster mode: when driver runs on cluster

it is similar to client mode with difference, driver will be running on application master and will negotiate with resource manager for resources

local mode: single jvm process with driver and executor used for local testing

Note: spark session is like data structures where driver maintains all information including location, status etc.

yarn

processing in hadoop / mr version 1

job tracker

task tracker

limitations of hadoop / mr version 1

solution

limitations of hadoop / mr version 1 (solved)

uberization

yarn architecture

ways to execute spark jobs

cluster manager

how program is executed on cluster

Comments

You May Also Enjoy

spark streaming

spark optimizations

spark part-II

hive basics