开窗——聚合函数与分析函数
应用场景
(1)排序
(2)动态分组group by
(3)top n
(4)累计计算
(5)层次查询
窗口函数
first_value(col):取分组内排序后,截止到当前行,第一个值
last_value(col): 取分组内排序后,截止到当前行,最后一个值
lead(col,n,default) :用于统计窗口内往下第n行值。第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为null时候,取默认值,如不指定,则为null)
lag(col,n,default) :用于统计窗口内往上第n行值。第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为null时候,取默认值,如不指定,则为null)
over从句
1、over前使用标准的聚合函数count、sum、min、max、avg,也可以使用其他分析函数row_number()、rank()、dense_rank()、cume_dist、percent_rank、ntile(n)
2、over后括号内使用partition by语句进行分组,可以是一列或多列
3、over后括号内使用order by语句进行排序,可以是一列或多列
4、综合使用partition by与order by语句先后组再排序,可以是一列或多列,数据为分组内排序后截止当前行之前计算内容
窗口规范
开窗——聚合函数
count、sum、min、max、avg、collect_set、collect_list
with tmp as
(
select 'd001' as order_id,123 as user_id,20 as amt,'2020-01-01 00:23:33' as pay_time
union all select 'd002' as order_id,123 as user_id,25 as amt,'2020-01-02 00:23:33' as pay_time
union all select 'd005' as order_id,123 as user_id,250 as amt,'2020-01-04 00:23:33' as pay_time
union all select 'd007' as order_id,123 as user_id,100 as amt,'2020-01-06 00:23:33' as pay_time
union all select 'd003' as order_id,125 as user_id,15 as amt,'2020-01-03 00:23:33' as pay_time
union all select 'd004' as order_id,125 as user_id,39 as amt,'2020-01-03 00:23:33' as pay_time
union all select 'd006' as order_id,125 as user_id,25 as amt,'2020-01-05 00:23:33' as pay_time
union all select 'd008' as order_id,125 as user_id,359 as amt,'2020-01-07 00:23:33' as pay_time
)
select user_id,order_id,amt,pay_time
,count(*)over(partition by user_id) as `当前用户订单数`
,count(distinct user_id)over(partition by 1) as `当前用户数`
,sum(amt)over(partition by user_id) as `当前用户总金额`
,sum(amt)over(order by 1) as `总金额`
,sum(amt)over(partition by user_id order by pay_time) as `当前用户截止当前支付时间累计总金额`
,max(amt)over(partition by user_id) as `当前用户最大金额`
,max(amt)over(order by pay_time) as `截止当前成单时间最大成交金额`
,min(amt)over(partition by user_id) as `当前用户最小金额`
,avg(amt)over(partition by user_id) as `当前用户订单均额`
from tmp
user_id |
order_id |
amt |
pay_time |
当前用户订单数 |
当前用户数 |
当前用户总金额 |
总金额 |
当前用户截止当前支付时间累计总金额 |
当前用户最大金额 |
截止当前成单时间最大成交金额 |
当前用户最小金额 |
当前用户订单均额 |
123 |
d001 |
20 |
2020-01-01 00:23:33 |
4 |
2 |
395 |
833 |
20 |
250 |
20 |
20 |
98.75 |
123 |
d002 |
25 |
2020-01-02 00:23:33 |
4 |
2 |
395 |
833 |
45 |
250 |
25 |
20 |
98.75 |
125 |
d004 |
39 |
2020-01-03 00:23:33 |
4 |
2 |
438 |
833 |
54 |
359 |
39 |
15 |
109.5 |
125 |
d003 |
15 |
2020-01-03 00:23:33 |
4 |
2 |
438 |
833 |
54 |
359 |
39 |
15 |
109.5 |
123 |
d005 |
250 |
2020-01-04 00:23:33 |
4 |
2 |
395 |
833 |
295 |
250 |
250 |
20 |
98.75 |
125 |
d006 |
25 |
2020-01-05 00:23:33 |
4 |
2 |
438 |
833 |
79 |
359 |
250 |
15 |
109.5 |
123 |
d007 |
100 |
2020-01-06 00:23:33 |
4 |
2 |
395 |
833 |
395 |
250 |
250 |
20 |
98.75 |
125 |
d008 |
359 |
2020-01-07 00:23:33 |
4 |
2 |
438 |
833 |
438 |
359 |
359 |
15 |
109.5 |
开窗——分析函数
row_number
() 从1开始,顺序生成序列号,如按成绩排名,也可以用于获取某学科成绩第一名,或获取某学科成绩第一名的学员姓名或其他信息等。
row_number() over(partition by col1 order by col2) as rowid
结果:1,2,3,4
rank
() 并列排序,即下一个数字会跳过
rank() over(partition by col1 order by col2) as rowid
结果:1,2,2,4,5
dense_rank
() 并列排序,下一个数据不会跳过
dense_rank() over(partition by col1 order by col2) as rowid
结果:1,2,2,3,4
ntile
(n) 用于将分组数据按照顺序切分成n片,返回当前切片值,如果切片不均匀,默认增加第一个切片的分布。(
https://blog.csdn.net/weixin_36190755/article/details/109056534
)
ntile不支持rows between,比如 ntile(2) over(partition by col1 order by col2 rows between 3 preceding and current row)。
first_value
(col) 当前分组内排序后col第一个值
last_value
(col) 当前分组内排序后col最后一个值
lag
(col1,n,default) over(partition by col1 order by col2) as up 向前取第n行
lead
(col1,n,default) over(partition by col1 order by col2) as down 向后取第n行
cume_dist
() 小于等于当前值的行数/分组内总行数。比如,统计小于等于当前薪水的人数,所占总人数的比例
percent_rank
() 分组内当前行的rank值-1/分组内总行数-1