将日志数据导入到hive的ODS层

Post author:xfxia
Post published:2023年7月23日
Post category:其他

刚入门,学的没有那么深,没那么多条条道道,直接导入.后面用sql语句对它动动手脚.

应该会得到一个比较宽的表.

先将数据导入到虚拟机中(含有客户的设备,时间戳,客户id,使用上网的渠道的等等).

启动

hive

,调用

start-all.sh

.启动完成后,调用

元数据服务

:

hive –service metastore

,然后启动

hiveserver2

,在调用远程端口.

创建一个表,此次使用的数据类型,一行就一个字段(json类型).

create table tb_log(
log String
)partitioned by(dt String);
load data local inpath "/home/event.log" into table tb_log partition(dt='202001007');
--根据数据的时间戳所得到的时间来进行分区(这个数据时间戳都是一天,意思意思,静态分区)

select * from tb_log limit 10 ;
--看看数据,应该大差不差

看的多少有点不得劲,根据数据格式,用json_tuple解析一下,在给其取个别名,岂不妙哉.

select
json_tuple(log,'account' ,'appId' ,'appVersion','carrier','deviceId','deviceType','eventId','ip','latitude','longitude','netType','osName','osVersion','properties','releaseChannel','resolution','sessionId' ,'timeStamp') 
as (account ,appId ,appVersion,carrier,deviceId,deviceType,eventId,ip,latitude,longitude,netType,osName,osVersion,properties,releaseChannel,resolution,sessionId ,`timeStamp`)
from
tb_log limit 10; 
--数据又粗又长,就不粘贴了

在过滤掉account和deviceId不为空的.清洗一下数据

create table tb_ods_log as
select 
if(account='' , deviceId,account) as guid,--如果account是空,取devicId
* 
from
(select
json_tuple(log,'account' ,'appId' ,'appVersion','carrier','deviceId','deviceType','eventId','ip','latitude','longitude','netType','osName','osVersion','properties','releaseChannel','resolution','sessionId' ,'timeStamp') 
as (account ,appId ,appVersion,carrier,deviceId,deviceType,eventId,ip,latitude,longitude,netType,osName,osVersion,properties,releaseChannel,resolution,sessionId ,`timeStamp`)
from
tb_log) t
where account != '' or deviceId !='';--条件判断一下,都是空,犹如鸡肋,食之无味,弃之可惜,果决一点,不要了.

来看看记个数,瞅一瞅

select guid , count(1) from tb_ods_log group by guid;

这只是将数据传入,后面根据个人需求,创建表格时,注意一下,不要搞太多没用的信息.

原文链接：https://blog.csdn.net/fqcbb/article/details/110843367

你可能也喜欢