搜索
您的当前位置:首页正文

Hive数据导入导出的几种方式

2021-12-12 来源:抵帆知识网
Hive数据导⼊导出的⼏种⽅式

⼀,Hive数据导⼊的⼏种⽅式

⾸先列出讲述下⾯⼏种导⼊⽅式的数据和hive表。导⼊:

1. 本地⽂件导⼊到Hive表;2. Hive表导⼊到Hive表;3. HDFS⽂件导⼊到Hive表;

4. 创建表的过程中从其他表导⼊;

5. 通过sqoop将mysql库导⼊到Hive表;⽰例见《》和《》导出:

1. Hive表导出到本地⽂件系统;2. Hive表导出到HDFS;

3. 通过sqoop将Hive表导出到mysql库;Hive表:创建testA:

CREATE TABLE testA ( id INT,

name string, area string

) PARTITIONED BY (create_time string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

创建testB:

CREATE TABLE testB ( id INT,

name string, area string, code string

) PARTITIONED BY (create_time string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

数据⽂件(sourceA.txt):

1,fish1,SZ 2,fish2,SH 3,fish3,HZ 4,fish4,QD 5,fish5,SR

数据⽂件(sourceB.txt):

1,zy1,SZ,1001 2,zy2,SH,1002 3,zy3,HZ,1003 4,zy4,QD,1004 5,zy5,SR,1005

(1)本地⽂件导⼊到Hive表

hive> LOAD DATA LOCAL INPATH '/home/hadoop/sourceA.txt' INTO TABLE testA PARTITION(create_time='2015-07-08'); Copying data from file:/home/hadoop/sourceA.txt Copying file: file:/home/hadoop/sourceA.txt

Loading data to table default.testa partition (create_time=2015-07-08)

Partition default.testa{create_time=2015-07-08} stats: [numFiles=1, numRows=0, totalSize=58, rawDataSize=0] OK

Time taken: 0.237 seconds

hive> LOAD DATA LOCAL INPATH '/home/hadoop/sourceB.txt' INTO TABLE testB PARTITION(create_time='2015-07-09'); Copying data from file:/home/hadoop/sourceB.txt Copying file: file:/home/hadoop/sourceB.txt

Loading data to table default.testb partition (create_time=2015-07-09)

Partition default.testb{create_time=2015-07-09} stats: [numFiles=1, numRows=0, totalSize=73, rawDataSize=0] OK

Time taken: 0.212 seconds hive> select * from testA; OK

1 fish1 SZ 2015-07-08 2 fish2 SH 2015-07-08 3 fish3 HZ 2015-07-08 4 fish4 QD 2015-07-08 5 fish5 SR 2015-07-08

Time taken: 0.029 seconds, Fetched: 5 row(s) hive> select * from testB; OK

1 zy1 SZ 1001 2015-07-09 2 zy2 SH 1002 2015-07-09 3 zy3 HZ 1003 2015-07-09 4 zy4 QD 1004 2015-07-09 5 zy5 SR 1005 2015-07-09

Time taken: 0.047 seconds, Fetched: 5 row(s)

(2)Hive表导⼊到Hive表将testB的数据导⼊到testA表

hive> INSERT INTO TABLE testA PARTITION(create_time='2015-07-11') select id, name, area from testB where id = 1; ...(省略) OK

Time taken: 14.744 seconds

hive> INSERT INTO TABLE testA PARTITION(create_time) select id, name, area, code from testB where id = 2;

...(省略)

OKTime taken: 19.852 secondshive> select * from testA;OK2 zy2 SH 10021 fish1 SZ 2015-07-082 fish2 SH 2015-07-083 fish3 HZ 2015-07-084 fish4 QD 2015-07-085 fish5 SR 2015-07-081 zy1 SZ 2015-07-11Time taken: 0.032 seconds, Fetched

说明:

1,将testB中id=1的⾏,导⼊到testA,分区为2015-07-11

2,将testB中id=2的⾏,导⼊到testA,分区create_time为id=2⾏的code值。(3)HDFS⽂件导⼊到Hive表

将sourceA.txt和sourceB.txt传到HDFS中,路径分别是/home/hadoop/sourceA.txt和/home/hadoop/sourceB.txt中

hive> LOAD DATA INPATH '/home/hadoop/sourceA.txt' INTO TABLE testA PARTITION(create_time='2015-07-08'); ...(省略) OK

Time taken: 0.237 seconds

hive> LOAD DATA INPATH '/home/hadoop/sourceB.txt' INTO TABLE testB PARTITION(create_time='2015-07-09');

...(省略)

OK

Time taken: 0.212 seconds hive> select * from testA; OK

1 fish1 SZ 2015-07-08 2 fish2 SH 2015-07-08 3 fish3 HZ 2015-07-08 4 fish4 QD 2015-07-08 5 fish5 SR 2015-07-08

Time taken: 0.029 seconds, Fetched: 5 row(s) hive> select * from testB; OK

1 zy1 SZ 1001 2015-07-09 2 zy2 SH 1002 2015-07-09 3 zy3 HZ 1003 2015-07-09 4 zy4 QD 1004 2015-07-09 5 zy5 SR 1005 2015-07-09

Time taken: 0.047 seconds, Fetched: 5 row(s)

/home/hadoop/sourceA.txt'导⼊到testA表/home/hadoop/sourceB.txt'导⼊到testB表

(4)创建表的过程中从其他表导⼊

hive> create table testC as select name, code from testB; Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_1449746265797_0106, Tracking URL = http://hadoopcluster79:8088/proxy/application_1449746265797_0106/ Kill Command = /home/hadoop/apache/hadoop-2.4.1/bin/hadoop job -kill job_1449746265797_0106 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2015-12-24 16:40:17,981 Stage-1 map = 0%, reduce = 0%

2015-12-24 16:40:23,115 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.11 sec MapReduce Total cumulative CPU time: 1 seconds 110 msec Ended Job = job_1449746265797_0106 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://hadoop2cluster/tmp/hive-root/hive_2015-12-24_16-40-09_983_6048680148773453194-1/-ext-10001 Moving data to: hdfs://hadoop2cluster/home/hadoop/hivedata/warehouse/testc

Table default.testc stats: [numFiles=1, numRows=0, totalSize=45, rawDataSize=0] MapReduce Jobs Launched:

Job 0: Map: 1 Cumulative CPU: 1.11 sec HDFS Read: 297 HDFS Write: 45 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 110 msec OK

Time taken: 14.292 seconds hive> desc testC; OK

name string code string Time taken: 0.032 seconds, Fetched: 2 row(s)

⼆、Hive数据导出的⼏种⽅式

(1)导出到本地⽂件系统

hive> INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/output' ROW FORMAT DELIMITED FIELDS TERMINATED by ',' select * from testA; Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_1451024007879_0001, Tracking URL = http://hadoopcluster79:8088/proxy/application_1451024007879_0001/ Kill Command = /home/hadoop/apache/hadoop-2.4.1/bin/hadoop job -kill job_1451024007879_0001 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2015-12-25 17:04:30,447 Stage-1 map = 0%, reduce = 0%

2015-12-25 17:04:35,616 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.16 sec MapReduce Total cumulative CPU time: 1 seconds 160 msec Ended Job = job_1451024007879_0001

Copying data to local directory /home/hadoop/output Copying data to local directory /home/hadoop/output MapReduce Jobs Launched:

Job 0: Map: 1 Cumulative CPU: 1.16 sec HDFS Read: 305 HDFS Write: 110 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 160 msec OK

Time taken: 16.701 seconds

查看数据结果:

[hadoop@hadoopcluster78 output]$ cat /home/hadoop/output/000000_0 1,fish1,SZ,2015-07-08 2,fish2,SH,2015-07-08 3,fish3,HZ,2015-07-08 4,fish4,QD,2015-07-08 5,fish5,SR,2015-07-08

通过INSERT OVERWRITE LOCAL DIRECTORY将hive表testA数据导⼊到/home/hadoop⽬录,众所周知,HQL会启动Mapreduce完成,其实/home/hadoop就是Mapreduce输出路径,产⽣的结果存放在⽂件名为:000000_0。

(2)导出到HDFS

导⼊到HDFS和导⼊本地⽂件类似,去掉HQL语句的LOCAL就可以了

hive> INSERT OVERWRITE DIRECTORY '/home/hadoop/output' select * from testA; Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there's no reduce operator

Starting Job = job_1451024007879_0002, Tracking URL = http://hadoopcluster79:8088/proxy/application_1451024007879_0002/ Kill Command = /home/hadoop/apache/hadoop-2.4.1/bin/hadoop job -kill job_1451024007879_0002 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2015-12-25 17:08:51,034 Stage-1 map = 0%, reduce = 0%

2015-12-25 17:08:59,313 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.4 sec MapReduce Total cumulative CPU time: 1 seconds 400 msec Ended Job = job_1451024007879_0002 Stage-3 is selected by condition resolver. Stage-2 is filtered out by condition resolver. Stage-4 is filtered out by condition resolver.

Moving data to: hdfs://hadoop2cluster/home/hadoop/hivedata/hive-hadoop/hive_2015-12-25_17-08-43_733_1768532778392261937-1/-ext-10000 Moving data to: /home/hadoop/output MapReduce Jobs Launched:

Job 0: Map: 1 Cumulative CPU: 1.4 sec HDFS Read: 305 HDFS Write: 110 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 400 msec OK

Time taken: 16.667 seconds

查看hfds输出⽂件:

[hadoop@hadoopcluster78 bin]$ ./hadoop fs -cat /home/hadoop/output/000000_0 1fish1SZ2015-07-08 2fish2SH2015-07-08 3fish3HZ2015-07-08 4fish4QD2015-07-08 5fish5SR2015-07-08

其他

采⽤hive的-e和-f参数来导出数据。

参数为: -e 的使⽤⽅式,后⾯接SQL语句。>>后⾯为输出⽂件路径

[hadoop@hadoopcluster78 bin]$ ./hive -e \"select * from testA\" >> /home/hadoop/output/testA.txt

15/12/25 17:15:07 WARN conf.HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead

Logging initialized using configuration in file:/home/hadoop/apache/hive-0.13.1/conf/hive-log4j.properties OK

Time taken: 1.128 seconds, Fetched: 5 row(s)

[hadoop@hadoopcluster78 bin]$ cat /home/hadoop/output/testA.txt 1 fish1 SZ 2015-07-08 2 fish2 SH 2015-07-08 3 fish3 HZ 2015-07-08 4 fish4 QD 2015-07-08 5 fish5 SR 2015-07-08

参数为: -f 的使⽤⽅式,后⾯接存放sql语句的⽂件。>>后⾯为输出⽂件路径SQL语句⽂件:

[hadoop@hadoopcluster78 bin]$ cat /home/hadoop/output/sql.sql select * from testA

使⽤-f参数执⾏:

[hadoop@hadoopcluster78 bin]$ ./hive -f /home/hadoop/output/sql.sql >> /home/hadoop/output/testB.txt

15/12/25 17:20:52 WARN conf.HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead

Logging initialized using configuration in file:/home/hadoop/apache/hive-0.13.1/conf/hive-log4j.properties OK

Time taken: 1.1 seconds, Fetched: 5 row(s)

参看结果:

[hadoop@hadoopcluster78 bin]$ cat /home/hadoop/output/testB.txt 1 fish1 SZ 2015-07-08 2 fish2 SH 2015-07-08 3 fish3 HZ 2015-07-08 4 fish4 QD 2015-07-08 5 fish5 SR 2015-07-08

因篇幅问题不能全部显示,请点此查看更多更全内容

Top