Hive分区表新增字段+重刷历史方法(避免旧分区新增字段为NULL)

  • Post author:
  • Post category:其他


1-1.建立测试表

hive> CREATE  EXTERNAL TABLE table_for_test_add_column(
    >    original_column string COMMENT '原始数据'
    >  )
    >  COMMENT 'add_column的测试表'
    >  PARTITIONED BY (
    >  `dt` string COMMENT '日'
    >  )
    >  row format delimited fields terminated by '\t'
    >  STORED AS textfile;
OK
Time taken: 0.734 seconds

1-2.插入数据

hive> insert into table hive_test.table_for_test_add_column partition(dt='20180114')  select '测试数据_0114';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = xxx_20180114135008_cd63d301-89d5-461d-b0c0-be292b308e24
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
....
Loading data to table hive_test.table_for_test_add_column partition (dt=20180114)
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   HDFS Read: 4304 HDFS Write: 121 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 31.641 seconds

1-3.查询现有数据

hive> select * from hive_test.table_for_test_add_column;
OK
测试数据_0114   20180114
Time taken: 0.125 seconds, Fetched: 1 row(s)

1-4.添加列(alter table add column)


Add/Replace Columns

ALTER TABLE table_name 
  [PARTITION partition_spec]                 -- (Note: Hive 0.14.0 and later)
  ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
  [CASCADE|RESTRICT]                         -- (Note: Hive 1.1.0 and later)

ADD COLUMNS lets you add new columns to the end of the existing columns but before the partition columns. This is supported for Avro backed tables as well, for Hive 0.14 and later.

添加一列added_column 【

备注

DDL语句最后添加CASCADE

否则新增的列在旧分区中不可见,查询数据时为NULL,重新刷数据时仍为NULL

相似问题见

hive分区表增加字段新增字段值为空的bug

hive> alter table hive_test.table_for_test_add_column add columns (added_column string COMMENT '新添加的列') CASCADE;
OK
Time taken: 0.236 seconds

查看建表语句,已新增一列

hive> show create table  hive_test.table_for_test_add_column ;
OK
CREATE EXTERNAL TABLE `hive_test.table_for_test_add_column`(
  `original_column` string COMMENT '原始数据',
  `added_column` string COMMENT '新添加的列')
COMMENT 'add_column的测试表'
PARTITIONED BY (
  `dt` string COMMENT '日')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'field.delim'='\t',
  'serialization.format'='\t')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://localhost:9000/user/hive/warehouse/hive_test.db/table_for_test_add_column'
TBLPROPERTIES (
  'last_modified_by'='xxx',
  'last_modified_time'='1515909925',
  'transient_lastDdlTime'='1515909925')
Time taken: 0.125 seconds, Fetched: 21 row(s)

查看数据

hive> select * from hive_test.table_for_test_add_column;
OK
测试数据_0114   NULL    20180114
Time taken: 0.121 seconds, Fetched: 1 row(s)

1-5.新增数据

hive>  insert into table hive_test.table_for_test_add_column partition(dt='20180115')  select '测试数据_0115', '新增列数据_0115';

校验数据

hive> select * from hive_test.table_for_test_add_column where dt >= 20180114;
OK
测试数据_0114   NULL    20180114
测试数据_0115   新增列数据_0115  20180115
Time taken: 0.453 seconds, Fetched: 2 row(s)

1-6.重刷旧分区数据

hive>  insert overwrite table hive_test.table_for_test_add_column partition(dt='20180114')  select original_column, '新增列数据_0114' from hive_test.table_for_ted_column where dt = '20180114';
OK
测试数据_0114   新增列数据_0114  20180114
测试数据_0115   新增列数据_0115  20180115
Time taken: 0.073 seconds, Fetched: 2 row(s)

1-7.Hive 修改字段


参考hive wiki

备注

修改字段类型有诸多限制,要格外注意,可能导致写入成功但查询失败

Change Column Name/Type/Position/Comment

ALTER TABLE table_name [PARTITION partition_spec] CHANGE [COLUMN] col_old_name col_new_name column_type

[COMMENT col_comment] [FIRST|AFTER column_name] [CASCADE|RESTRICT];

This command will allow users to change a column’s name, data type, comment, or position, or an arbitrary combination of them. The PARTITION clause is available in Hive 0.14.0 and later; see Upgrading Pre-Hive 0.13.0 Decimal Columns for usage. A patch for Hive 0.13 is also available (see HIVE-7971).

The CASCADE|RESTRICT clause is available in Hive 1.1.0. ALTER TABLE CHANGE COLUMN with CASCADE command changes the columns of a table’s metadata, and cascades the same change to all the partition metadata. RESTRICT is the default, limiting column change only to table metadata.

ALTER TABLE CHANGE COLUMN CASCADE clause will override the table partition’s column metadata regardless of the table or partition’s protection mode. Use with discretion.

The column change command will only modify Hive’s metadata, and will not modify data. Users should make sure the actual data layout of the table/partition conforms with the metadata definition.

Example:

CREATE TABLE test_change (a int, b int, c int);

// First change column a's name to a1.
ALTER TABLE test_change CHANGE a a1 INT;

// Next change column a1's name to a2, its data type to string, and put it after column b.
ALTER TABLE test_change CHANGE a1 a2 STRING AFTER b;
// The new table's structure is:  b int, a2 string, c int.

// Then change column c's name to c1, and put it as the first column.
ALTER TABLE test_change CHANGE c c1 INT FIRST;
// The new table's structure is:  c1 int, b int, a2 string.

// Add a comment to column a1
ALTER TABLE test_change CHANGE a1 a1 INT COMMENT 'this is column a1';

参考



版权声明:本文为hjw199089原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。