Export from Hadoop File System to a SQL Server Database Table

Export from Hadoop File System to a SQL Server Database Table

Problem

I need to export data from the Hadoop Distributed File System (HDFS) to a SQL Server database table. How can I do this?

Solution

Apache’s Sqoop allows for importing data from a database such as SQL Server to the HDFS, and for exporting data from the HDFS to a database table. In this tip we will use Apache Sqoop’s export functionality to move data stored on the HDFS to a SQL Server database table. As with previous tips in this series, this tip is written using SQL Server 2014 and a Cloudera virtual machine running on a laptop.

The contents of the file on the HDFS are shown below. There are three columns in this comma-separated value file. The first column is a unique integer identifier. The second column is a string used for a description. The third column is the unit cost. The destination table will have columns to accommodate these three columns and their data types.

[hdfs@localhost:/mssql]$ hdfs dfs -cat mssql/linuxDataFile.csv
1,Product A,1.01
2,Product B,2.02
3,Product C,3.03
4,Product D,4.04
5,Product E,5.05
6,Product F,6.06
7,Product G,7.07
8,Product H,8.08
9,Product I,9.09
10,Product J,10.10
11,Product K,11.11
12,Product L,12.12
13,Product M,13.13
14,Product N,14.14
15,Product O,15.15
16,Product P,16.16
17,Product Q,17.17
18,Product R,18.18
19,Product S,19.19
20,Product T,20.20
21,Product U,21.21
22,Product V,22.22
23,Product W,23.23
24,Product X,24.24
25,Product Y,25.25
26,Product Z,26.26
[hdfs@localhost:/mssql]$

The T-SQL below was used to create the destination table for this tip.

create table dbo.tblProductData
(
ProductKey int not null PRIMARY KEY,
ProductName varchar(50),
UnitCost money
)

The image below shows the destination table in the AdventureWorks2014 database as displayed in the Object Explorer.

Hadoop, SQL Server Database

The Sqoop command we will examine is listed below and we will break down each argument in the following bullet points. Please note that the command is supposed to be on one complete line or with the backslash (the Linux command line continuation character) at the end of each line except the last.

 

sqoop export –connect ‘jdbc:sqlserver://aaa.bbb.ccc.ddd:pppp;databasename=AdventureWorks2014’ –username ‘sqoopLogin’ -P –export-dir ‘mssql’ –table ‘tblProductData’ — –schema dbo

  • sqoop export – The executable is named sqoop and we are instructing it to export the data from an HDFS file to a database.
  • –connect – With the –connect argument, we are passing in the jdbc connect string for SQL Server. In this case, we use the IP address, port number, and database name.
  • –username – In this example, the user name is a SQL Server login, not a Windows login. Our database is set up to authenticate in mixed mode. We have a server login named sqoopLogin, and we have a database user name sqoopUser which is a member of the db_datawriter role and has a default schema of dbo.
  • -P – This will prompt the command line user for the password. If Sqoop is rarely executed, this might be a good option. There are multiple other ways to automatically pass the password to the command, but we are trying to keep it simple for this tip.
  • –export-dir – The export directory tells sqoop in which directory the file is stored on the HDFS. This argument is required by Sqoop when using the export option.
  • –table – This argument provides sqoop with the destination table on the SQL Server database. This argument is also required by Sqoop when using the export option.
  • This is not a typographical error. The double dash tells Sqoop that the following arguments are database specific.
  • –schema – This argument specifies the schema in which our table resides.

The output from the sqoop command is shown below.

[hdfs@localhost:/mssql]$ ./sqoopExportCommand.sh
16/09/19 12:03:22 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.2.0
Enter password:
16/09/19 12:03:25 INFO manager.SqlManager: Using default fetchSize of 1000
16/09/19 12:03:25 INFO manager.SQLServerManager: We will use schema dbo
16/09/19 12:03:25 INFO tool.CodeGenTool: Beginning code generation
16/09/19 12:03:26 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM [dbo].[tblProductData] AS t WHERE 1=0
16/09/19 12:03:26 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce
Note: /tmp/sqoop-training/compile/3eaf9ad50f5ebf5ef893fba0e41859b1/tblProductData.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
16/09/19 12:03:29 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-training/compile/3eaf9ad50f5ebf5ef893fba0e41859b1/tblProductData.jar
16/09/19 12:03:29 INFO mapreduce.ExportJobBase: Beginning export of tblProductData
16/09/19 12:03:30 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
16/09/19 12:03:32 INFO input.FileInputFormat: Total input paths to process : 1
16/09/19 12:03:32 INFO input.FileInputFormat: Total input paths to process : 1
16/09/19 12:03:33 INFO mapred.JobClient: Running job: job_201609071028_0005
16/09/19 12:03:34 INFO mapred.JobClient: map 0% reduce 0%
16/09/19 12:03:48 INFO mapred.JobClient: map 25% reduce 0%
16/09/19 12:03:49 INFO mapred.JobClient: map 50% reduce 0%
16/09/19 12:04:00 INFO mapred.JobClient: map 100% reduce 0%
16/09/19 12:04:02 INFO mapred.JobClient: Job complete: job_201609071028_0005
16/09/19 12:04:02 INFO mapred.JobClient: Counters: 24
16/09/19 12:04:02 INFO mapred.JobClient: File System Counters
16/09/19 12:04:02 INFO mapred.JobClient: FILE: Number of bytes read=0
16/09/19 12:04:02 INFO mapred.JobClient: FILE: Number of bytes written=1185776
16/09/19 12:04:02 INFO mapred.JobClient: FILE: Number of read operations=0
16/09/19 12:04:02 INFO mapred.JobClient: FILE: Number of large read operations=0
16/09/19 12:04:02 INFO mapred.JobClient: FILE: Number of write operations=0
16/09/19 12:04:02 INFO mapred.JobClient: HDFS: Number of bytes read=1778
16/09/19 12:04:02 INFO mapred.JobClient: HDFS: Number of bytes written=0
16/09/19 12:04:02 INFO mapred.JobClient: HDFS: Number of read operations=16
16/09/19 12:04:02 INFO mapred.JobClient: HDFS: Number of large read operations=0
16/09/19 12:04:02 INFO mapred.JobClient: HDFS: Number of write operations=0
16/09/19 12:04:02 INFO mapred.JobClient: Job Counters
16/09/19 12:04:02 INFO mapred.JobClient: Launched map tasks=4
16/09/19 12:04:02 INFO mapred.JobClient: Data-local map tasks=4
16/09/19 12:04:02 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=47822
16/09/19 12:04:02 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0
16/09/19 12:04:02 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
16/09/19 12:04:02 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
16/09/19 12:04:02 INFO mapred.JobClient: Map-Reduce Framework
16/09/19 12:04:02 INFO mapred.JobClient: Map input records=26
16/09/19 12:04:02 INFO mapred.JobClient: Map output records=26
16/09/19 12:04:02 INFO mapred.JobClient: Input split bytes=576
16/09/19 12:04:02 INFO mapred.JobClient: Spilled Records=0
16/09/19 12:04:02 INFO mapred.JobClient: CPU time spent (ms)=4580
16/09/19 12:04:02 INFO mapred.JobClient: Physical memory (bytes) snapshot=452489216
16/09/19 12:04:02 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2948849664
16/09/19 12:04:02 INFO mapred.JobClient: Total committed heap usage (bytes)=127401984
16/09/19 12:04:02 INFO mapreduce.ExportJobBase: Transferred 1.7363 KB in 31.5282 seconds (56.394 bytes/sec)
16/09/19 12:04:02 INFO mapreduce.ExportJobBase: Exported 26 records.
[hdfs@localhost:/mssql]$

Notice the last line of output above shows that 26 records were exported. This corresponds to the 26 records in the file on the HDFS. When we query the count of rows in our database table, we see that 26 rows are present as displayed in the image below. Also in the image below, we see the first 17 records returned by the query.

Hadoop, SQL Server Database

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s