hive basics

6 minute read

it is open source datawarehouse to process structured data on top of hadoop

why hive

writing map reduce code to process data on hadoop requires lot of code, to cater this problem hive was developed which is sql like query known as hive ql, internally hive produces map reduce and run it on hadoop cluster

architecture

actual data: stored in hdfs as files
metadata: stored in database as hdfs is for write once and read only, hdfs does not support modification for schema evolution and hdfs will also have high latency compared to database thats why database is used for metadata

hive-architecture

hive is schema on read, no constraints
minimal index support
only equi-joins
restricted sub queries

metastore

stores metadata of all hive tables
map files and directories in hdfs to tables
holds table schema
any database with jdbc driver can be used as metastore
when query is done in hive then schema from metastore is imposed on the data to get tabular view
default is derby database, not recommended as it allows only one connection at a time

In clodera quickstart mysql is used for metadata, database name is metastore and we can connect to that database to query what metadata is stored. For example TBLS tables has details of the hive tables

data storage details

by default data is stored in warehouse directory /user/hive/warehouse
creating new database dbname, location will be /user/hive/warehouse/dbname.db
creating new table abc in database dbname, location will be /user/hive/warehouse/dbname.db/abc, which will have mutiple files depnending on data
default location can be change by setting property hive.metastore.warehouse.dir in file hive-site.xml
tables created in default will be directly in /user/hive/warehouse

complex data types

array

same data types in an array
no fixed size

create table table1 (
  id int
  col1 array<string>
);

insert into table1
select 1, array("value1", "value2");

--output: 1 value1
select id, col1[0] from table1;

/*
if we have more columns in table and we load data from file which has less columns
then also load will happen it wont fail, additional columns in table will get null values
*/

map

key, vaule pair
keys should be distinct
value can be accessed using key
key and value can have different data types
unordered, order does not matter
each key, value pair is called an entity

create table table1 (
  col1 int,
  col2 array<string>,
  col3 map<string, boolean>
)
row format delimited
field terminated by ','
collection items terminated by '#'
map keys terminated by ':';

struct

its like a class
logical group of data
can have different data types
can hod any number of values
value refrenced by name

create table table1 (
  col1 int,
  col2 array<string>,
  col3 struct<s1:string, s2:string>
);

select col1, col3.s1 from table1;

built in functions

UDF (User Defined Functions)

system functions
works on single row and output single row
example: concat(), round(), length()

UDAF (User Defined Aggeregate Functions)

works on multiple rows and output single row
example: sum(), avg()

UDTF (User Defined Tablegenerating Functions)

works on single row and output multiple rows
example: explode(), posexplode()

--
create table table1(co11 string, col2 array<string>)
row format delimited 
fields terminated by ','
collection items terminated by ":";

load data local inpath '/path/to/file'
into table table1;

/*
--output:
name1 ["a", "b", "c"]
name2 ["d", "e", "f"]
.
.
*/
select * from table1;

/*
--output:
a
b
c
d
e
f
.
.
*/
select explode(col2) from table1;

--with explode we only get exploded column not other columns for selecting other columns we need to use lateral view

lateral view

virtual table formed by exploded view, which can be joined with original table to allow complex query

select col1, exploded_col from table1
lateral view explode(col2) exploded_data as exploded_col;

custom functions (UDF)

in case in built functions can not solve the requirement then we can write custom functions in hive using java

example: lets say we want to make function which can do uppercase of string

write java code for making input to uppercase
bundle code in jar file
add jar and create temporary function

package example;
import org.apache.hadoop.hive.ql.exec.UDF;
public class ExampleClass extends UDF {
  public String evaluate(String input){
    if(input==null)
    {
      return null;
    }
    return (input.toString().toUpperCase());
  }
}

ADD JAR /pat/to/jar/example.jar;
create temporary function ucase as'example.ExampleClass';

create table table1 (
  col1 string
);

insert into table1
select 'val1'

select ucase(col1) from table1;

Note: Some external jars might be required like hive-exec-1.2.2.jar

This temporary function will be available only in the connected hive session not globally. For solving this problem we need to add jar in hdfs and create permanent function.

hadoop fs -put example.jar /path/to/hdfs/

CREATE FUNCTION ucase AS
'example.ExampleClass' using JAR
'hdfs://localhost:8020/path/to/hdfs/example.jar';

set operations

union

supported in hive
union removes duplicates
union all preserves duplicates

minus

not supported in hive

intersect

not supported in hive

sub queries

from clause

select * from (
  select id from table1
  union all
  select if from table2
) t;

where clause

supports only two types:

in/not in:

subquery should be for single column
subquery should return list of column values
subquery should not reference parent query

select id from table1
where table1.id in (
  select id from table2
);

exists/not exists:

subquery should reference parent query

select id from table1
where exists (
  select id from table2 where table2.id=table1.id
);

views

virtual table with subset of data from a larger table
stored as query in hive metastore

create view view1
as
select col1, col2
from table1

restrict access to data
create different logical tables from base table

basic commands

--this will give all the databases in hive
show databases;

--to create new database
create database abc;

--to change database
use abc;

--to create new table
create table new (
  id int,
  name string
);

--to show tables
show tables;

--to show schema of table
describe new;

--to show extended details of table like location, type, etc.
describe formatted new;

--to insert in table, it will run map reduce job
insert into new values
(1, "abc");

--to insert mutiple rows
insert into new values
(1, "A"),(2, "B");

to check files in the hdfs where hive tables are stored

hadoop fs -ls /user/hive/warehouse/new.db/abc

to check content of file in the hdfs where hive tables are stored

hadoop fs -cat /user/hive/warehouse/new.db/abc/*

connecting to hive

hive prompt: directly in shell launch hive and query

hue: hadoop user experience, its UI to query hive, spark, etc.

beeline: gives better view and secure

beeline -u jdbc:hive2://

#to execute query from terminal
beeline -u jdbc:hive2:// -e "select * from new.abc"

#to execute file with hive queries from terminal
beeline -u jdbc:hive2:// -f /path/to/file.hql

--to quit from beeline
!q

--to execute file from beeline
source /path/to/file.hql

types of table

managed

hive manages the table, have control of data(files & directories)
if table is dropped both data and metadata will be deleted
by default table created is managed
table type can be checked in describe formatted table_name;

external

hive manages only metadata
if tables is dropped only metadata is deleted
if data is used by other apps like spark then we should be using external tables

create external table prod(
  id int,
  name string
)
location 'path/to/data';

Note: In hive command prompt if we need to query file we can use dfs

example: dfs -ls /path/to/file

load data in hive

insert command

not recommended takes lot of time, doing inserts may create some tmp tables which will be there until session is alive.

load from file

from local to hive table

create table if not exists table1(
  id int,
  name string
)
row format delimited
fields terminated by ','
stored as textfile;

--to load data in the table
load data local inpath
'/local/path/to/file'
into table table1;

here table will be managed as its default type and data path will also be in default location. It will be copied over from local to hdfs.

from hdfs to hive table

hadoop fs -copyFromLocal /local/path/to/file /path/on/hdfs

--to load data in the table
load data inpath
'/path/on/hdfs'
into table table1;

--to overwrite
load data inpath
'/path/on/hdfs'
overwrite into table table1;

here it will be cut paste instead of copy

table to table

create table if not exists table2(
  id int,
  name string
)
row format delimited
fields terminated by ','
stored as textfile;

insert into table table2
select * from table1;