数据仓库搭建之DIM层搭建
在开发数据仓库的DIM层时,我们需要注意以下几点:
1)DIM层的设计依据是维度建模理论,该层存储维度模型的维度表。
2)在我们该项目当中,DIM层的数据存储格式为orc列式存储+snappy压缩。
3)DIM层表名的命名规范为dim_表名_全量表或者拉链表标识(full/zip)。
1.维度确定
我们根据之前构建的业务总线矩阵,来确定我们当前需要构建的维度表。
我们可以看到,我们所有的业务过程所涉及到的维度有时间、用户、商品、地区、活动、优惠券、支付方式、退单类型、退单原因类型、渠道以及设备。
虽然有这么多的维度,但是我们并不会将这些维度都构建成维度表。但是我们考虑到维度退化,一些维度中字段比较少,较为简单,因此我们将该维度中的字段退化到与之对应的事实表当中。
因此,我们最终选择的维度有时间、用户、商品、地区、活动和优惠券共六个维度。
2.维度表设计
2.1商品维度表
2.1.1确定维度
这里的维度已经确定,是商品维度。
2.1.2确定主维表和相关维表
此处的主维表和相关维表均指电商业务系统中与某维度相关的表。
由于我们表中的字段有部分冗余(为了提高查询的速度),因此我们最终的主维表和相关维表如下图所示(图中有颜色的是最终的主维表和相关维表):
2.1.3确定维度属性
DROP TABLE IF EXISTS dim_sku_full;
CREATE EXTERNAL TABLE dim_sku_full
(
`id` STRING COMMENT 'sku_id',
`price` DECIMAL(16, 2) COMMENT '商品价格',
`sku_name` STRING COMMENT '商品名称',
`sku_desc` STRING COMMENT '商品描述',
`weight` DECIMAL(16, 2) COMMENT '重量',
`is_sale` BOOLEAN COMMENT '是否在售',
`spu_id` STRING COMMENT 'spu编号',
`spu_name` STRING COMMENT 'spu名称',
`category3_id` STRING COMMENT '三级分类id',
`category3_name` STRING COMMENT '三级分类名称',
`category2_id` STRING COMMENT '二级分类id',
`category2_name` STRING COMMENT '二级分类名称',
`category1_id` STRING COMMENT '一级分类id',
`category1_name` STRING COMMENT '一级分类名称',
`tm_id` STRING COMMENT '品牌id',
`tm_name` STRING COMMENT '品牌名称',
`sku_attr_values` ARRAY<STRUCT<attr_id :STRING,value_id :STRING,attr_name :STRING,value_name:STRING>> COMMENT '平台属性',
`sku_sale_attr_values` ARRAY<STRUCT<sale_attr_id :STRING,sale_attr_value_id :STRING,sale_attr_name :STRING,sale_attr_value_name:STRING>> COMMENT '销售属性',
`create_time` STRING COMMENT '创建时间'
) COMMENT '商品维度表'
PARTITIONED BY (`dt` STRING)
STORED AS ORC
LOCATION '/warehouse/gmall/dim/dim_sku_full/'
TBLPROPERTIES ('orc.compress' = 'snappy');
2.1.4数据的装载逻辑(以2022-05-01为例)
我们需要从每一个表当中取出需要的数据,之后再通过join连接起来。
(1)sku_info表
select
id,
price,
sku_name,
sku_desc,
weight,
is_sale,
spu_id,
category3_id,
tm_id,
create_time
from ods_sku_info_full
where dt='2022-05-01'
(2)spu_info表
select
id,
spu_name
from ods_spu_info_full
where dt='2022-05-01'
(3)base_caregory3表
select
id,
name,
category2_id
from ods_base_category3_full
where dt='2022-05-01'
(3)base_caregory2表
select
id,
name,
category1_id
from ods_base_category2_full
where dt='2022-05-01'
(4)base_caregory1表
select
id,
name
from ods_base_category1_full
where dt='2022-05-01'
(5)base_trademark表
select
id,
tm_name
from ods_base_trademark_full
where dt='2022-05-01'
(6)sku_attr_value表
select
sku_id, collect_set(named_struct('attr_id',attr_id,'value_id',value_id,'attr_name',attr_name,'value_name',value_name)) attrs
from ods_sku_attr_value_full
where dt='2022-05-01'
group by sku_id
(7)sku_sale_attr_value表
select
sku_id, collect_set(named_struct('sale_attr_id',sale_attr_id,'sale_attr_value_id',sale_attr_value_id,'sale_attr_name',sale_attr_name,'sale_attr_value_name',sale_attr_value_name)) sale_attrs
from ods_sku_sale_attr_value_full
where dt='2022-05-01'
group by sku_id
最终,我们将上述从表中取到的数据进行join,然后装载到该商品维度表当中:
with
sku as
(
select
id,
price,
sku_name,
sku_desc,
weight,
is_sale,
spu_id,
category3_id,
tm_id,
create_time
from ods_sku_info_full
where dt='2022-05-01'
),
spu as
(
select
id,
spu_name
from ods_spu_info_full
where dt='2022-05-01'
),
c3 as
(
select
id,
name,
category2_id
from ods_base_category3_full
where dt='2022-05-01'
),
c2 as
(
select
id,
name,
category1_id
from ods_base_category2_full
where dt='2022-05-01'
),
c1 as
(
select
id,
name
from ods_base_category1_full
where dt='2022-05-01'
),
tm as
(
select
id,
tm_name
from ods_base_trademark_full
where dt='2022-05-01'
),
attr as
(
select
sku_id,
collect_set(named_struct('attr_id',attr_id,'value_id',value_id,'attr_name',attr_name,'value_name',value_name)) attrs
from ods_sku_attr_value_full
where dt='2022-05-01'
group by sku_id
),
sale_attr as
(
select
sku_id,
collect_set(named_struct('sale_attr_id',sale_attr_id,'sale_attr_value_id',sale_attr_value_id,'sale_attr_name',sale_attr_name,'sale_attr_value_name',sale_attr_value_name)) sale_attrs
from ods_sku_sale_attr_value_full
where dt='2022-05-01'
group by sku_id
)
insert overwrite table dim_sku_full partition(dt='2022-05-01')
select
sku.id,
sku.price,
sku.sku_name,
sku.sku_desc,
sku.weight,
sku.is_sale,
sku.spu_id,
spu.spu_name,
sku.category3_id,
c3.name,
c3.category2_id,
c2.name,
c2.category1_id,
c1.name,
sku.tm_id,
tm.tm_name,
attr.attrs,
sale_attr.sale_attrs,
sku.create_time
from sku
left join spu on sku.spu_id=spu.id
left join c3 on sku.category3_id=c3.id
left join c2 on c3.category2_id=c2.id
left join c1 on c2.category1_id=c1.id
left join tm on sku.tm_id=tm.id
left join attr on sku.id=attr.sku_id
left join sale_attr on sku.id=sale_attr.sku_id;
2.2用户维度表
2.2.1确定维度
这里的维度已经确定,是用户维度。
2.2.2确定主维表和相关维表
此处的主维表和相关维表均指电商业务系统中与某维度相关的表。
我们最终只选择user_info一张表作为用户维度的主维表,因为我们对用户的地址表不经常使用,因此此处不添加用户地址表进行join。
2.2.3确定维度属性
DROP TABLE IF EXISTS dim_user_zip;
CREATE EXTERNAL TABLE dim_user_zip
(
`id` STRING COMMENT '用户id',
`login_name` STRING COMMENT '用户名称',
`nick_name` STRING COMMENT '用户昵称',
`name` STRING COMMENT '用户姓名',
`phone_num` STRING COMMENT '手机号码',
`email` STRING COMMENT '邮箱',
`user_level` STRING COMMENT '用户等级',
`birthday` STRING COMMENT '生日',
`gender` STRING COMMENT '性别',
`create_time` STRING COMMENT '创建时间',
`operate_time` STRING COMMENT '操作时间',
`start_date` STRING COMMENT '开始日期',
`end_date` STRING COMMENT '结束日期'
) COMMENT '用户表'
PARTITIONED BY (`dt` STRING)
STORED AS ORC
LOCATION '/warehouse/gmall/dim/dim_user_zip/'
TBLPROPERTIES ('orc.compress' = 'snappy');
2.2.4数据的分区规划
拉链表的意义就在于能够更加高效的保存维度信息的历史状态。拉链表适合于数据会发生变化,但是变化频率并不高的维度(缓慢变化维)。因此,我们的用户维度表设计为拉链表,因为每天变化的比例并不高。
我们的数据分区规划如下所示:
我们将全量最新的用户数据存储到dt=9999-12-31分区当中。
2.2.4数据装载
2.2.4.1拉链表数据装载过程
2.2.4.3拉链表数据流向
2.2.4.4用户维表首日装载(以2022-05-01作为首日)
用户的首日装载较为简单,即装载全量的数据:
insert overwrite table dim_user_zip partition (dt='9999-12-31')
select
data.id,
data.login_name,
data.nick_name,
md5(data.name),
md5(data.phone_num),
md5(data.email),
data.user_level,
data.birthday,
data.gender,
data.create_time,
data.operate_time,
'2022-05-01' start_date,
'9999-12-31' end_date
from ods_user_info_inc
where dt='2022-05-01'
and type='bootstrap-insert';
2.2.4.5用户维表每日装载
(1)用户维度表每日装载思路
(2)具体装载语句
with
tmp as
(
select
old.id old_id,
old.login_name old_login_name,
old.nick_name old_nick_name,
old.name old_name,
old.phone_num old_phone_num,
old.email old_email,
old.user_level old_user_level,
old.birthday old_birthday,
old.gender old_gender,
old.create_time old_create_time,
old.operate_time old_operate_time,
old.start_date old_start_date,
old.end_date old_end_date,
new.id new_id,
new.login_name new_login_name,
new.nick_name new_nick_name,
new.name new_name,
new.phone_num new_phone_num,
new.email new_email,
new.user_level new_user_level,
new.birthday new_birthday,
new.gender new_gender,
new.create_time new_create_time,
new.operate_time new_operate_time,
new.start_date new_start_date,
new.end_date new_end_date
from
(
select
id,
login_name,
nick_name,
name,
phone_num,
email,
user_level,
birthday,
gender,
create_time,
operate_time,
start_date,
end_date
from dim_user_zip
where dt='9999-12-31'
)old
full outer join
(
select
id,
login_name,
nick_name,
md5(name) name,
md5(phone_num) phone_num,
md5(email) email,
user_level,
birthday,
gender,
create_time,
operate_time,
'2020-05-02' start_date,
'9999-12-31' end_date
from
(
select
data.id,
data.login_name,
data.nick_name,
data.name,
data.phone_num,
data.email,
data.user_level,
data.birthday,
data.gender,
data.create_time,
data.operate_time,
row_number() over (partition by data.id order by ts desc) rn
from ods_user_info_inc
where dt='2022-05-02'
)t1
where rn=1
)new
on old.id=new.id
)
insert overwrite table dim_user_zip partition(dt)
select
if(new_id is not null,new_id,old_id),
if(new_id is not null,new_login_name,old_login_name),
if(new_id is not null,new_nick_name,old_nick_name),
if(new_id is not null,new_name,old_name),
if(new_id is not null,new_phone_num,old_phone_num),
if(new_id is not null,new_email,old_email),
if(new_id is not null,new_user_level,old_user_level),
if(new_id is not null,new_birthday,old_birthday),
if(new_id is not null,new_gender,old_gender),
if(new_id is not null,new_create_time,old_create_time),
if(new_id is not null,new_operate_time,old_operate_time),
if(new_id is not null,new_start_date,old_start_date),
if(new_id is not null,new_end_date,old_end_date),
if(new_id is not null,new_end_date,old_end_date) dt
from tmp
union all
select
old_id,
old_login_name,
old_nick_name,
old_name,
old_phone_num,
old_email,
old_user_level,
old_birthday,
old_gender,
old_create_time,
old_operate_time,
old_start_date,
cast(date_add('2022-05-02',-1) as string) old_end_date,
cast(date_add('2022-05-02',-1) as string) dt
from tmp
where old_id is not null
and new_id is not null;
2.3地区维度表
2.3.1确定维度
这里的维度已经确定,是地区维度。
2.3.2确定主维表和相关维表
我们选择的主维表是省份表,相关维表是地区表。
2.3.3确定维度属性
DROP TABLE IF EXISTS dim_province_full;
CREATE EXTERNAL TABLE dim_province_full
(
`id` STRING COMMENT 'id',
`province_name` STRING COMMENT '省市名称',
`area_code` STRING COMMENT '地区编码',
`iso_code` STRING COMMENT '旧版ISO-3166-2编码,供可视化使用',
`iso_3166_2` STRING COMMENT '新版IOS-3166-2编码,供可视化使用',
`region_id` STRING COMMENT '地区id',
`region_name` STRING COMMENT '地区名称'
) COMMENT '地区维度表'
PARTITIONED BY (`dt` STRING)
STORED AS ORC
LOCATION '/warehouse/gmall/dim/dim_province_full/'
TBLPROPERTIES ('orc.compress' = 'snappy');
2.3.4数据的装载逻辑(以2022-05-01为例)
我们需要从每一个表当中取出需要的数据,之后再通过join连接起来。
(1)base_province表
select
id,
name,
region_id,
area_code,
iso_code,
iso_3166_2
from ods_base_province_full
where dt='2022-05-01'
(2)base_region表
select
id,
region_name
from ods_base_region_full
where dt='2022-05-01'
最终,我们将上述从表中取到的数据进行join,然后装载到该地区维度表当中:
insert overwrite table dim_province_full partition(dt='2022-05-01')
select
province.id,
province.name,
province.area_code,
province.iso_code,
province.iso_3166_2,
region_id,
region_name
from
(
select
id,
name,
region_id,
area_code,
iso_code,
iso_3166_2
from ods_base_province_full
where dt='2022-05-01'
)province
left join
(
select
id,
region_name
from ods_base_region_full
where dt='2022-05-01'
)region
on province.region_id=region.id;
2.4优惠券维度表
2.4.1确定维度
这里的维度是优惠券维度。
2.4.2确定主维表和相关维表
我们选择的主维表是优惠券表,相关维表是字典表(需要获取购物券类型和优惠券范围)。
2.4.3确定维度属性
DROP TABLE IF EXISTS dim_coupon_full;
CREATE EXTERNAL TABLE dim_coupon_full
(
`id` STRING COMMENT '购物券编号',
`coupon_name` STRING COMMENT '购物券名称',
`coupon_type_code` STRING COMMENT '购物券类型编码',
`coupon_type_name` STRING COMMENT '购物券类型名称',
`condition_amount` DECIMAL(16, 2) COMMENT '满额数',
`condition_num` BIGINT COMMENT '满件数',
`activity_id` STRING COMMENT '活动编号',
`benefit_amount` DECIMAL(16, 2) COMMENT '减金额',
`benefit_discount` DECIMAL(16, 2) COMMENT '折扣',
`benefit_rule` STRING COMMENT '优惠规则:满元*减*元,满*件打*折',
`create_time` STRING COMMENT '创建时间',
`range_type_code` STRING COMMENT '优惠范围类型编码',
`range_type_name` STRING COMMENT '优惠范围类型名称',
`limit_num` BIGINT COMMENT '最多领取次数',
`taken_count` BIGINT COMMENT '已领取次数',
`start_time` STRING COMMENT '可以领取的开始日期',
`end_time` STRING COMMENT '可以领取的结束日期',
`operate_time` STRING COMMENT '修改时间',
`expire_time` STRING COMMENT '过期时间'
) COMMENT '优惠券维度表'
PARTITIONED BY (`dt` STRING)
STORED AS ORC
LOCATION '/warehouse/gmall/dim/dim_coupon_full/'
TBLPROPERTIES ('orc.compress' = 'snappy');
2.4.4数据的装载逻辑(以2022-05-01为例)
我们需要从每一个表当中取出需要的数据,之后再通过join连接起来。
(1)coupon_info表
select
id,
coupon_name,
coupon_type,
condition_amount,
condition_num,
activity_id,
benefit_amount,
benefit_discount,
create_time,
range_type,
limit_num,
taken_count,
start_time,
end_time,
operate_time,
expire_time
from ods_coupon_info_full
where dt='2022-05-01'
(2)base_dic表(购物券类型)
select
dic_code,
dic_name
from ods_base_dic_full
where dt='2022-05-01'
and parent_code='32'
(3)base_region表(优惠券范围)
select
dic_code,
dic_name
from ods_base_dic_full
where dt='2022-05-01'
and parent_code='33'
最终,我们将上述从表中取到的数据进行join,然后装载到该优惠券维度表当中:
insert overwrite table dim_coupon_full partition(dt='2022-05-01')
select
id,
coupon_name,
coupon_type,
coupon_dic.dic_name,
condition_amount,
condition_num,
activity_id,
benefit_amount,
benefit_discount,
case coupon_type
when '3201' then concat('满',condition_amount,'元减',benefit_amount,'元')
when '3202' then concat('满',condition_num,'件打',10*(1-benefit_discount),'折')
when '3203' then concat('减',benefit_amount,'元')
end benefit_rule,
create_time,
range_type,
range_dic.dic_name,
limit_num,
taken_count,
start_time,
end_time,
operate_time,
expire_time
from
(
select
id,
coupon_name,
coupon_type,
condition_amount,
condition_num,
activity_id,
benefit_amount,
benefit_discount,
create_time,
range_type,
limit_num,
taken_count,
start_time,
end_time,
operate_time,
expire_time
from ods_coupon_info_full
where dt='2022-05-01'
)ci
left join
(
select
dic_code,
dic_name
from ods_base_dic_full
where dt='2022-05-01'
and parent_code='32'
)coupon_dic
on ci.coupon_type=coupon_dic.dic_code
left join
(
select
dic_code,
dic_name
from ods_base_dic_full
where dt='2022-05-01'
and parent_code='33'
)range_dic
on ci.range_type=range_dic.dic_code;
2.5活动维度表
2.5.1确定维度
这里的维度是活动维度。
2.5.2确定主维表和相关维表
我们选择的主维表是活动规则表,相关维表是活动信息表,字典表(活动类型)。
2.5.3确定维度属性
DROP TABLE IF EXISTS dim_activity_full;
CREATE EXTERNAL TABLE dim_activity_full
(
`activity_rule_id` STRING COMMENT '活动规则ID',
`activity_id` STRING COMMENT '活动ID',
`activity_name` STRING COMMENT '活动名称',
`activity_type_code` STRING COMMENT '活动类型编码',
`activity_type_name` STRING COMMENT '活动类型名称',
`activity_desc` STRING COMMENT '活动描述',
`start_time` STRING COMMENT '开始时间',
`end_time` STRING COMMENT '结束时间',
`create_time` STRING COMMENT '创建时间',
`condition_amount` DECIMAL(16, 2) COMMENT '满减金额',
`condition_num` BIGINT COMMENT '满减件数',
`benefit_amount` DECIMAL(16, 2) COMMENT '优惠金额',
`benefit_discount` DECIMAL(16, 2) COMMENT '优惠折扣',
`benefit_rule` STRING COMMENT '优惠规则',
`benefit_level` STRING COMMENT '优惠级别'
) COMMENT '活动信息表'
PARTITIONED BY (`dt` STRING)
STORED AS ORC
LOCATION '/warehouse/gmall/dim/dim_activity_full/'
TBLPROPERTIES ('orc.compress' = 'snappy');
2.5.4数据的装载逻辑(以2022-05-01为例)
我们需要从每一个表当中取出需要的数据,之后再通过join连接起来。
(1)activity_rule表
select
id,
activity_id,
activity_type,
condition_amount,
condition_num,
benefit_amount,
benefit_discount,
benefit_level
from ods_activity_rule_full
where dt='2022-05-01'
(2)activity_info表
select
id,
activity_name,
activity_type,
activity_desc,
start_time,
end_time,
create_time
from ods_activity_info_full
where dt='2022-05-01'
(3)base_dic表(活动类型)
select
dic_code,
dic_name
from ods_base_dic_full
where dt='2022-05-01'
and parent_code='31'
最终,我们将上述从表中取到的数据进行join,然后装载到该优惠券维度表当中:
insert overwrite table dim_activity_full partition(dt='2022-05-01')
select
rule.id,
info.id,
activity_name,
rule.activity_type,
dic.dic_name,
activity_desc,
start_time,
end_time,
create_time,
condition_amount,
condition_num,
benefit_amount,
benefit_discount,
case rule.activity_type
when '3101' then concat('满',condition_amount,'元减',benefit_amount,'元')
when '3102' then concat('满',condition_num,'件打',10*(1-benefit_discount),'折')
when '3103' then concat('打',10*(1-benefit_discount),'折')
end benefit_rule,
benefit_level
from
(
select
id,
activity_id,
activity_type,
condition_amount,
condition_num,
benefit_amount,
benefit_discount,
benefit_level
from ods_activity_rule_full
where dt='2022-05-01'
)rule
left join
(
select
id,
activity_name,
activity_type,
activity_desc,
start_time,
end_time,
create_time
from ods_activity_info_full
where dt='2022-05-01'
)info
on rule.activity_id=info.id
left join
(
select
dic_code,
dic_name
from ods_base_dic_full
where dt='2022-05-01'
and parent_code='31'
)dic
on rule.activity_type=dic.dic_code;
2.6日期维度表
2.6.1确定维度
这里的维度是日期维度。
2.6.2确定主维表和相关维表
该表比较特殊,因为日期是固定的,可以算出来的,因此我们的主维表就是日期表,通常会提前导入一到三年的时间数据。
2.5.3确定维度属性
DROP TABLE IF EXISTS dim_date;
CREATE EXTERNAL TABLE dim_date
(
`date_id` STRING COMMENT '日期ID',
`week_id` STRING COMMENT '周ID,一年中的第几周',
`week_day` STRING COMMENT '周几',
`day` STRING COMMENT '每月的第几天',
`month` STRING COMMENT '一年中的第几月',
`quarter` STRING COMMENT '一年中的第几季度',
`year` STRING COMMENT '年份',
`is_workday` STRING COMMENT '是否是工作日',
`holiday_id` STRING COMMENT '节假日'
) COMMENT '时间维度表'
STORED AS ORC
LOCATION '/warehouse/gmall/dim/dim_date/'
TBLPROPERTIES ('orc.compress' = 'snappy');
2.5.4数据的装载逻辑
通常情况下,时间维度表的数据并不是来自于业务系统,而是手动写入,并且由于时间维度表数据的可预见性,无须每日导入,一般可一次性导入一年的数据。
(1)创建临时表
DROP TABLE IF EXISTS tmp_dim_date_info;
CREATE EXTERNAL TABLE tmp_dim_date_info (
`date_id` STRING COMMENT '日',
`week_id` STRING COMMENT '周ID',
`week_day` STRING COMMENT '周几',
`day` STRING COMMENT '每月的第几天',
`month` STRING COMMENT '第几月',
`quarter` STRING COMMENT '第几季度',
`year` STRING COMMENT '年',
`is_workday` STRING COMMENT '是否是工作日',
`holiday_id` STRING COMMENT '节假日'
) COMMENT '时间维度表'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/warehouse/gmall/tmp/tmp_dim_date_info/';
(2)将数据文件上传到HFDS上临时表路径/warehouse/gmall/tmp/tmp_dim_date_info
(3)执行以下语句将其导入时间维度表
insert overwrite table dim_date select * from tmp_dim_date_info;
(4)检查数据是否导入成功
select * from dim_date;
3.维度层数据装载脚本的编写
由于用户维表是拉链表,该表的首日装载和每日装载数据有区别,因此我们需要编写首日装载脚本和每日装载脚本。
3.1首日装载脚本
(1)在hadoop102的/home/root/bin目录下创建ods_to_dim_init.sh
[root@hadoop102 bin]$ vim ods_to_dim_init.sh
(2)具体的内容如下图所示:
#!/bin/bash
APP=gmall
if [ -n "$2" ] ;then
do_date=$2
else
echo "请传入日期参数"
exit
fi
dim_user_zip="
insert overwrite table ${APP}.dim_user_zip partition (dt='9999-12-31')
select
data.id,
data.login_name,
data.nick_name,
md5(data.name),
md5(data.phone_num),
md5(data.email),
data.user_level,
data.birthday,
data.gender,
data.create_time,
data.operate_time,
'$do_date' start_date,
'9999-12-31' end_date
from ${APP}.ods_user_info_inc
where dt='$do_date'
and type='bootstrap-insert';
"
dim_sku_full="
with
sku as
(
select
id,
price,
sku_name,
sku_desc,
weight,
is_sale,
spu_id,
category3_id,
tm_id,
create_time
from ${APP}.ods_sku_info_full
where dt='$do_date'
),
spu as
(
select
id,
spu_name
from ${APP}.ods_spu_info_full
where dt='$do_date'
),
c3 as
(
select
id,
name,
category2_id
from ${APP}.ods_base_category3_full
where dt='$do_date'
),
c2 as
(
select
id,
name,
category1_id
from ${APP}.ods_base_category2_full
where dt='$do_date'
),
c1 as
(
select
id,
name
from ${APP}.ods_base_category1_full
where dt='$do_date'
),
tm as
(
select
id,
tm_name
from ${APP}.ods_base_trademark_full
where dt='$do_date'
),
attr as
(
select
sku_id,
collect_set(named_struct('attr_id',attr_id,'value_id',value_id,'attr_name',attr_name,'value_name',value_name)) attrs
from ${APP}.ods_sku_attr_value_full
where dt='$do_date'
group by sku_id
),
sale_attr as
(
select
sku_id,
collect_set(named_struct('sale_attr_id',sale_attr_id,'sale_attr_value_id',sale_attr_value_id,'sale_attr_name',sale_attr_name,'sale_attr_value_name',sale_attr_value_name)) sale_attrs
from ${APP}.ods_sku_sale_attr_value_full
where dt='$do_date'
group by sku_id
)
insert overwrite table ${APP}.dim_sku_full partition(dt='$do_date')
select
sku.id,
sku.price,
sku.sku_name,
sku.sku_desc,
sku.weight,
sku.is_sale,
sku.spu_id,
spu.spu_name,
sku.category3_id,
c3.name,
c3.category2_id,
c2.name,
c2.category1_id,
c1.name,
sku.tm_id,
tm.tm_name,
attr.attrs,
sale_attr.sale_attrs,
sku.create_time
from sku
left join spu on sku.spu_id=spu.id
left join c3 on sku.category3_id=c3.id
left join c2 on c3.category2_id=c2.id
left join c1 on c2.category1_id=c1.id
left join tm on sku.tm_id=tm.id
left join attr on sku.id=attr.sku_id
left join sale_attr on sku.id=sale_attr.sku_id;
"
dim_province_full="
insert overwrite table ${APP}.dim_province_full partition(dt='$do_date')
select
province.id,
province.name,
province.area_code,
province.iso_code,
province.iso_3166_2,
region_id,
region_name
from
(
select
id,
name,
region_id,
area_code,
iso_code,
iso_3166_2
from ${APP}.ods_base_province_full
where dt='$do_date'
)province
left join
(
select
id,
region_name
from ${APP}.ods_base_region_full
where dt='$do_date'
)region
on province.region_id=region.id;
"
dim_coupon_full="
insert overwrite table ${APP}.dim_coupon_full partition(dt='$do_date')
select
id,
coupon_name,
coupon_type,
coupon_dic.dic_name,
condition_amount,
condition_num,
activity_id,
benefit_amount,
benefit_discount,
case coupon_type
when '3201' then concat('满',condition_amount,'元减',benefit_amount,'元')
when '3202' then concat('满',condition_num,'件打',10*(1-benefit_discount),'折')
when '3203' then concat('减',benefit_amount,'元')
end benefit_rule,
create_time,
range_type,
range_dic.dic_name,
limit_num,
taken_count,
start_time,
end_time,
operate_time,
expire_time
from
(
select
id,
coupon_name,
coupon_type,
condition_amount,
condition_num,
activity_id,
benefit_amount,
benefit_discount,
create_time,
range_type,
limit_num,
taken_count,
start_time,
end_time,
operate_time,
expire_time
from ${APP}.ods_coupon_info_full
where dt='$do_date'
)ci
left join
(
select
dic_code,
dic_name
from ${APP}.ods_base_dic_full
where dt='$do_date'
and parent_code='32'
)coupon_dic
on ci.coupon_type=coupon_dic.dic_code
left join
(
select
dic_code,
dic_name
from ${APP}.ods_base_dic_full
where dt='$do_date'
and parent_code='33'
)range_dic
on ci.range_type=range_dic.dic_code;
"
dim_activity_full="
insert overwrite table ${APP}.dim_activity_full partition(dt='$do_date')
select
rule.id,
info.id,
activity_name,
rule.activity_type,
dic.dic_name,
activity_desc,
start_time,
end_time,
create_time,
condition_amount,
condition_num,
benefit_amount,
benefit_discount,
case rule.activity_type
when '3101' then concat('满',condition_amount,'元减',benefit_amount,'元')
when '3102' then concat('满',condition_num,'件打',10*(1-benefit_discount),'折')
when '3103' then concat('打',10*(1-benefit_discount),'折')
end benefit_rule,
benefit_level
from
(
select
id,
activity_id,
activity_type,
condition_amount,
condition_num,
benefit_amount,
benefit_discount,
benefit_level
from ${APP}.ods_activity_rule_full
where dt='$do_date'
)rule
left join
(
select
id,
activity_name,
activity_type,
activity_desc,
start_time,
end_time,
create_time
from ${APP}.ods_activity_info_full
where dt='$do_date'
)info
on rule.activity_id=info.id
left join
(
select
dic_code,
dic_name
from ${APP}.ods_base_dic_full
where dt='$do_date'
and parent_code='31'
)dic
on rule.activity_type=dic.dic_code;
"
case $1 in
"dim_user_zip")
hive -e "$dim_user_zip"
;;
"dim_sku_full")
hive -e "$dim_sku_full"
;;
"dim_province_full")
hive -e "$dim_province_full"
;;
"dim_coupon_full")
hive -e "$dim_coupon_full"
;;
"dim_activity_full")
hive -e "$dim_activity_full"
;;
"all")
hive -e "$dim_user_zip$dim_sku_full$dim_province_full$dim_coupon_full$dim_activity_full"
;;
esac
(3)我们为该脚本增加执行权限
[root@hadoop102 bin]$ chmod +x ods_to_dim_init.sh
(4)我们需要在数仓的首日执行该脚本(我们数仓的首日定在2022-05-01)
[root@hadoop102 bin]$ ods_to_dim_init.sh all 2022-05-01
3.2每日装载脚本
(1)在hadoop102的/home/root/bin目录下创建ods_to_dim.sh
[root@hadoop102 bin]$ vim ods_to_dim.sh
(2)脚本中的内容如下所示:
#!/bin/bash
APP=gmall
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$2" ] ;then
do_date=$2
else
do_date=`date -d "-1 day" +%F`
fi
dim_user_zip="
set hive.exec.dynamic.partition.mode=nonstrict;
with
tmp as
(
select
old.id old_id,
old.login_name old_login_name,
old.nick_name old_nick_name,
old.name old_name,
old.phone_num old_phone_num,
old.email old_email,
old.user_level old_user_level,
old.birthday old_birthday,
old.gender old_gender,
old.create_time old_create_time,
old.operate_time old_operate_time,
old.start_date old_start_date,
old.end_date old_end_date,
new.id new_id,
new.login_name new_login_name,
new.nick_name new_nick_name,
new.name new_name,
new.phone_num new_phone_num,
new.email new_email,
new.user_level new_user_level,
new.birthday new_birthday,
new.gender new_gender,
new.create_time new_create_time,
new.operate_time new_operate_time,
new.start_date new_start_date,
new.end_date new_end_date
from
(
select
id,
login_name,
nick_name,
name,
phone_num,
email,
user_level,
birthday,
gender,
create_time,
operate_time,
start_date,
end_date
from ${APP}.dim_user_zip
where dt='9999-12-31'
)old
full outer join
(
select
id,
login_name,
nick_name,
md5(name) name,
md5(phone_num) phone_num,
md5(email) email,
user_level,
birthday,
gender,
create_time,
operate_time,
'$do_date' start_date,
'9999-12-31' end_date
from
(
select
data.id,
data.login_name,
data.nick_name,
data.name,
data.phone_num,
data.email,
data.user_level,
data.birthday,
data.gender,
data.create_time,
data.operate_time,
row_number() over (partition by data.id order by ts desc) rn
from ${APP}.ods_user_info_inc
where dt='$do_date'
)t1
where rn=1
)new
on old.id=new.id
)
insert overwrite table ${APP}.dim_user_zip partition(dt)
select
if(new_id is not null,new_id,old_id),
if(new_id is not null,new_login_name,old_login_name),
if(new_id is not null,new_nick_name,old_nick_name),
if(new_id is not null,new_name,old_name),
if(new_id is not null,new_phone_num,old_phone_num),
if(new_id is not null,new_email,old_email),
if(new_id is not null,new_user_level,old_user_level),
if(new_id is not null,new_birthday,old_birthday),
if(new_id is not null,new_gender,old_gender),
if(new_id is not null,new_create_time,old_create_time),
if(new_id is not null,new_operate_time,old_operate_time),
if(new_id is not null,new_start_date,old_start_date),
if(new_id is not null,new_end_date,old_end_date),
if(new_id is not null,new_end_date,old_end_date) dt
from tmp
union all
select
old_id,
old_login_name,
old_nick_name,
old_name,
old_phone_num,
old_email,
old_user_level,
old_birthday,
old_gender,
old_create_time,
old_operate_time,
old_start_date,
cast(date_add('$do_date',-1) as string) old_end_date,
cast(date_add('$do_date',-1) as string) dt
from tmp
where old_id is not null
and new_id is not null;
"
dim_sku_full="
with
sku as
(
select
id,
price,
sku_name,
sku_desc,
weight,
is_sale,
spu_id,
category3_id,
tm_id,
create_time
from ${APP}.ods_sku_info_full
where dt='$do_date'
),
spu as
(
select
id,
spu_name
from ${APP}.ods_spu_info_full
where dt='$do_date'
),
c3 as
(
select
id,
name,
category2_id
from ${APP}.ods_base_category3_full
where dt='$do_date'
),
c2 as
(
select
id,
name,
category1_id
from ${APP}.ods_base_category2_full
where dt='$do_date'
),
c1 as
(
select
id,
name
from ${APP}.ods_base_category1_full
where dt='$do_date'
),
tm as
(
select
id,
tm_name
from ${APP}.ods_base_trademark_full
where dt='$do_date'
),
attr as
(
select
sku_id,
collect_set(named_struct('attr_id',attr_id,'value_id',value_id,'attr_name',attr_name,'value_name',value_name)) attrs
from ${APP}.ods_sku_attr_value_full
where dt='$do_date'
group by sku_id
),
sale_attr as
(
select
sku_id,
collect_set(named_struct('sale_attr_id',sale_attr_id,'sale_attr_value_id',sale_attr_value_id,'sale_attr_name',sale_attr_name,'sale_attr_value_name',sale_attr_value_name)) sale_attrs
from ${APP}.ods_sku_sale_attr_value_full
where dt='$do_date'
group by sku_id
)
insert overwrite table ${APP}.dim_sku_full partition(dt='$do_date')
select
sku.id,
sku.price,
sku.sku_name,
sku.sku_desc,
sku.weight,
sku.is_sale,
sku.spu_id,
spu.spu_name,
sku.category3_id,
c3.name,
c3.category2_id,
c2.name,
c2.category1_id,
c1.name,
sku.tm_id,
tm.tm_name,
attr.attrs,
sale_attr.sale_attrs,
sku.create_time
from sku
left join spu on sku.spu_id=spu.id
left join c3 on sku.category3_id=c3.id
left join c2 on c3.category2_id=c2.id
left join c1 on c2.category1_id=c1.id
left join tm on sku.tm_id=tm.id
left join attr on sku.id=attr.sku_id
left join sale_attr on sku.id=sale_attr.sku_id;
"
dim_province_full="
insert overwrite table ${APP}.dim_province_full partition(dt='$do_date')
select
province.id,
province.name,
province.area_code,
province.iso_code,
province.iso_3166_2,
region_id,
region_name
from
(
select
id,
name,
region_id,
area_code,
iso_code,
iso_3166_2
from ${APP}.ods_base_province_full
where dt='$do_date'
)province
left join
(
select
id,
region_name
from ${APP}.ods_base_region_full
where dt='$do_date'
)region
on province.region_id=region.id;
"
dim_coupon_full="
insert overwrite table ${APP}.dim_coupon_full partition(dt='$do_date')
select
id,
coupon_name,
coupon_type,
coupon_dic.dic_name,
condition_amount,
condition_num,
activity_id,
benefit_amount,
benefit_discount,
case coupon_type
when '3201' then concat('满',condition_amount,'元减',benefit_amount,'元')
when '3202' then concat('满',condition_num,'件打',10*(1-benefit_discount),'折')
when '3203' then concat('减',benefit_amount,'元')
end benefit_rule,
create_time,
range_type,
range_dic.dic_name,
limit_num,
taken_count,
start_time,
end_time,
operate_time,
expire_time
from
(
select
id,
coupon_name,
coupon_type,
condition_amount,
condition_num,
activity_id,
benefit_amount,
benefit_discount,
create_time,
range_type,
limit_num,
taken_count,
start_time,
end_time,
operate_time,
expire_time
from ${APP}.ods_coupon_info_full
where dt='$do_date'
)ci
left join
(
select
dic_code,
dic_name
from ${APP}.ods_base_dic_full
where dt='$do_date'
and parent_code='32'
)coupon_dic
on ci.coupon_type=coupon_dic.dic_code
left join
(
select
dic_code,
dic_name
from ${APP}.ods_base_dic_full
where dt='$do_date'
and parent_code='33'
)range_dic
on ci.range_type=range_dic.dic_code;
"
dim_activity_full="
insert overwrite table ${APP}.dim_activity_full partition(dt='$do_date')
select
rule.id,
info.id,
activity_name,
rule.activity_type,
dic.dic_name,
activity_desc,
start_time,
end_time,
create_time,
condition_amount,
condition_num,
benefit_amount,
benefit_discount,
case rule.activity_type
when '3101' then concat('满',condition_amount,'元减',benefit_amount,'元')
when '3102' then concat('满',condition_num,'件打',10*(1-benefit_discount),'折')
when '3103' then concat('打',10*(1-benefit_discount),'折')
end benefit_rule,
benefit_level
from
(
select
id,
activity_id,
activity_type,
condition_amount,
condition_num,
benefit_amount,
benefit_discount,
benefit_level
from ${APP}.ods_activity_rule_full
where dt='$do_date'
)rule
left join
(
select
id,
activity_name,
activity_type,
activity_desc,
start_time,
end_time,
create_time
from ${APP}.ods_activity_info_full
where dt='$do_date'
)info
on rule.activity_id=info.id
left join
(
select
dic_code,
dic_name
from ${APP}.ods_base_dic_full
where dt='$do_date'
and parent_code='31'
)dic
on rule.activity_type=dic.dic_code;
"
case $1 in
"dim_user_zip")
hive -e "$dim_user_zip"
;;
"dim_sku_full")
hive -e "$dim_sku_full"
;;
"dim_province_full")
hive -e "$dim_province_full"
;;
"dim_coupon_full")
hive -e "$dim_coupon_full"
;;
"dim_activity_full")
hive -e "$dim_activity_full"
;;
"all")
hive -e "$dim_user_zip$dim_sku_full$dim_province_full$dim_coupon_full$dim_activity_full"
;;
esac
(3)增加该脚本执行权限
[root@hadoop102 bin]$ chmod +x ods_to_dim.sh
(4)我们每日调用该脚本即可(以2022-05-02为例)
[root@hadoop102 bin]$ ods_to_dim.sh all 2022-05-02