sqoop working

1 minute read

It is used to transfer data between rdbms to hdfs and vice versa.

  • sqoop import: transfer from rdbms to hdfs
  • sqoop export: transfer from hdfs to rdbms
  • sqoop eval: to run queries on databases

Setup cloudera quickstart machine on local

We have multiple ways to setup cloudera machine either as VM or docker image, in this case we will use docker and local machine is ubuntu. So, prerequisite is to:

  • setup docker, follow here
  • run below to pull cloudera quickstart image and then run image which will setup cloudera machine to interact via terminal
docker pull cloudera/quickstart:latest
docker run --hostname=quickstart.cloudera --privileged=true -t -i -p 8888:8888 -p 80:80 -p 8080:8080 cloudera/quickstart /usr/bin/docker-quickstart

Commands

list databases

sqoop-list-databases \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username root \
--password cloudera

list tables

sqoop-list-tables \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username root \
--password cloudera

query database

sqoop-eval \
--connect "jdbc:mysql://quickstart.cloudera:3306" \
--username root \
--password cloudera \
--query "select * from retail_db.customers limit 5"

import one table

It will run mapreduce job where only mappers will work. Default four mappers will run but can be changed. It divide work based on primary key. If no primary key then command will failm in that case we have two options:

  1. change number of mapper to one: use -m 1 parameter
  2. use split by column.

Example: Here orders table has primary key and we testing import on this table.

sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username root \
--password cloudera \
--table orders \
--target-dir /result

Run below command after above import command to check files produced by import command, default number of mappers are 4 so thats why it will have four part-m files.

hadoop fs -ls /result

import all tables

sqoop-import-all-tables \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username root \
--password cloudera \
--table orders \
--warehouse-dir /result

Note:

  • target-dir: files will directly be stored in the directory mentioned in command.
  • warehouse-dir: files will be stored in the subdirectory (same as tablename) of the directory mentioned in command.

Note: sqoop is retired as of June 2021

Sources:

  • Cloudera quickstart docker image for testing, source

Comments