开窗——聚合函数与分析函数

  • Post author:
  • Post category:其他


开窗——聚合函数与分析函数

应用场景

(1)排序

(2)动态分组group by

(3)top n

(4)累计计算

(5)层次查询

窗口函数

first_value(col):取分组内排序后,截止到当前行,第一个值

last_value(col): 取分组内排序后,截止到当前行,最后一个值

lead(col,n,default) :用于统计窗口内往下第n行值。第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为null时候,取默认值,如不指定,则为null)

lag(col,n,default) :用于统计窗口内往上第n行值。第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为null时候,取默认值,如不指定,则为null)

over从句

1、over前使用标准的聚合函数count、sum、min、max、avg,也可以使用其他分析函数row_number()、rank()、dense_rank()、cume_dist、percent_rank、ntile(n)

2、over后括号内使用partition by语句进行分组,可以是一列或多列

3、over后括号内使用order by语句进行排序,可以是一列或多列

4、综合使用partition by与order by语句先后组再排序,可以是一列或多列,数据为分组内排序后截止当前行之前计算内容

窗口规范

开窗——聚合函数

count、sum、min、max、avg、collect_set、collect_list


with tmp as
(
select 'd001' as order_id,123 as user_id,20 as amt,'2020-01-01 00:23:33' as pay_time
union all select 'd002' as order_id,123 as user_id,25 as amt,'2020-01-02 00:23:33' as pay_time
union all select 'd005' as order_id,123 as user_id,250 as amt,'2020-01-04 00:23:33' as pay_time
union all select 'd007' as order_id,123 as user_id,100 as amt,'2020-01-06 00:23:33' as pay_time
union all select 'd003' as order_id,125 as user_id,15 as amt,'2020-01-03 00:23:33' as pay_time
union all select 'd004' as order_id,125 as user_id,39 as amt,'2020-01-03 00:23:33' as pay_time
union all select 'd006' as order_id,125 as user_id,25 as amt,'2020-01-05 00:23:33' as pay_time
union all select 'd008' as order_id,125 as user_id,359 as amt,'2020-01-07 00:23:33' as pay_time
)
select user_id,order_id,amt,pay_time
,count(*)over(partition by user_id) as `当前用户订单数`
,count(distinct user_id)over(partition by 1) as `当前用户数`
,sum(amt)over(partition by user_id) as `当前用户总金额`
,sum(amt)over(order by 1) as `总金额`
,sum(amt)over(partition by user_id order by pay_time) as `当前用户截止当前支付时间累计总金额`
,max(amt)over(partition by user_id) as `当前用户最大金额`
,max(amt)over(order by pay_time) as `截止当前成单时间最大成交金额`
,min(amt)over(partition by user_id) as `当前用户最小金额`
,avg(amt)over(partition by user_id) as `当前用户订单均额`
from tmp

user_id

order_id

amt

pay_time

当前用户订单数

当前用户数

当前用户总金额

总金额

当前用户截止当前支付时间累计总金额

当前用户最大金额

截止当前成单时间最大成交金额

当前用户最小金额

当前用户订单均额

123

d001

20

2020-01-01 00:23:33

4

2

395

833

20

250

20

20

98.75

123

d002

25

2020-01-02 00:23:33

4

2

395

833

45

250

25

20

98.75

125

d004

39

2020-01-03 00:23:33

4

2

438

833

54

359

39

15

109.5

125

d003

15

2020-01-03 00:23:33

4

2

438

833

54

359

39

15

109.5

123

d005

250

2020-01-04 00:23:33

4

2

395

833

295

250

250

20

98.75

125

d006

25

2020-01-05 00:23:33

4

2

438

833

79

359

250

15

109.5

123

d007

100

2020-01-06 00:23:33

4

2

395

833

395

250

250

20

98.75

125

d008

359

2020-01-07 00:23:33

4

2

438

833

438

359

359

15

109.5

开窗——分析函数


row_number

() 从1开始,顺序生成序列号,如按成绩排名,也可以用于获取某学科成绩第一名,或获取某学科成绩第一名的学员姓名或其他信息等。

row_number() over(partition  by col1 order by col2) as rowid

结果:1,2,3,4


rank

() 并列排序,即下一个数字会跳过

rank() over(partition by col1 order by col2) as rowid

结果:1,2,2,4,5


dense_rank

() 并列排序,下一个数据不会跳过

dense_rank() over(partition by col1 order by col2) as rowid

结果:1,2,2,3,4


ntile

(n) 用于将分组数据按照顺序切分成n片,返回当前切片值,如果切片不均匀,默认增加第一个切片的分布。(

https://blog.csdn.net/weixin_36190755/article/details/109056534

)

ntile不支持rows between,比如 ntile(2) over(partition by col1 order by col2 rows between 3 preceding and current row)。


first_value

(col) 当前分组内排序后col第一个值


last_value

(col) 当前分组内排序后col最后一个值


lag

(col1,n,default) over(partition by col1 order by col2) as up 向前取第n行


lead

(col1,n,default) over(partition by col1 order by col2) as down 向后取第n行


cume_dist

() 小于等于当前值的行数/分组内总行数。比如,统计小于等于当前薪水的人数,所占总人数的比例


percent_rank

() 分组内当前行的rank值-1/分组内总行数-1



版权声明:本文为weixin_36190755原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。