开窗——聚合函数与分析函数

应用场景

（1）排序

（2）动态分组group by

（3）top n

（4）累计计算

（5）层次查询

窗口函数

first_value(col)：取分组内排序后，截止到当前行，第一个值

last_value(col)：取分组内排序后，截止到当前行，最后一个值

lead(col,n,default) ：用于统计窗口内往下第n行值。第一个参数为列名，第二个参数为往下第n行（可选，默认为1），第三个参数为默认值（当往下第n行为null时候，取默认值，如不指定，则为null）

lag(col,n,default) ：用于统计窗口内往上第n行值。第一个参数为列名，第二个参数为往上第n行（可选，默认为1），第三个参数为默认值（当往上第n行为null时候，取默认值，如不指定，则为null）

over从句

1、over前使用标准的聚合函数count、sum、min、max、avg，也可以使用其他分析函数row_number()、rank()、dense_rank()、cume_dist、percent_rank、ntile(n)

2、over后括号内使用partition by语句进行分组，可以是一列或多列

3、over后括号内使用order by语句进行排序，可以是一列或多列

4、综合使用partition by与order by语句先后组再排序，可以是一列或多列，数据为分组内排序后截止当前行之前计算内容

窗口规范

开窗——聚合函数

count、sum、min、max、avg、collect_set、collect_list


with tmp as
(
select 'd001' as order_id,123 as user_id,20 as amt,'2020-01-01 00:23:33' as pay_time
union all select 'd002' as order_id,123 as user_id,25 as amt,'2020-01-02 00:23:33' as pay_time
union all select 'd005' as order_id,123 as user_id,250 as amt,'2020-01-04 00:23:33' as pay_time
union all select 'd007' as order_id,123 as user_id,100 as amt,'2020-01-06 00:23:33' as pay_time
union all select 'd003' as order_id,125 as user_id,15 as amt,'2020-01-03 00:23:33' as pay_time
union all select 'd004' as order_id,125 as user_id,39 as amt,'2020-01-03 00:23:33' as pay_time
union all select 'd006' as order_id,125 as user_id,25 as amt,'2020-01-05 00:23:33' as pay_time
union all select 'd008' as order_id,125 as user_id,359 as amt,'2020-01-07 00:23:33' as pay_time
)
select user_id,order_id,amt,pay_time
,count(*)over(partition by user_id) as `当前用户订单数`
,count(distinct user_id)over(partition by 1) as `当前用户数`
,sum(amt)over(partition by user_id) as `当前用户总金额`
,sum(amt)over(order by 1) as `总金额`
,sum(amt)over(partition by user_id order by pay_time) as `当前用户截止当前支付时间累计总金额`
,max(amt)over(partition by user_id) as `当前用户最大金额`
,max(amt)over(order by pay_time) as `截止当前成单时间最大成交金额`
,min(amt)over(partition by user_id) as `当前用户最小金额`
,avg(amt)over(partition by user_id) as `当前用户订单均额`
from tmp

user_id	order_id	amt	pay_time	当前用户订单数	当前用户数	当前用户总金额	总金额	当前用户截止当前支付时间累计总金额	当前用户最大金额	截止当前成单时间最大成交金额	当前用户最小金额	当前用户订单均额
123	d001	20	2020-01-01 00:23:33	4	2	395	833	20	250	20	20	98.75
123	d002	25	2020-01-02 00:23:33	4	2	395	833	45	250	25	20	98.75
125	d004	39	2020-01-03 00:23:33	4	2	438	833	54	359	39	15	109.5
125	d003	15	2020-01-03 00:23:33	4	2	438	833	54	359	39	15	109.5
123	d005	250	2020-01-04 00:23:33	4	2	395	833	295	250	250	20	98.75
125	d006	25	2020-01-05 00:23:33	4	2	438	833	79	359	250	15	109.5
123	d007	100	2020-01-06 00:23:33	4	2	395	833	395	250	250	20	98.75
125	d008	359	2020-01-07 00:23:33	4	2	438	833	438	359	359	15	109.5

开窗——分析函数

row_number

() 从1开始，顺序生成序列号，如按成绩排名，也可以用于获取某学科成绩第一名，或获取某学科成绩第一名的学员姓名或其他信息等。

row_number() over(partition by col1 order by col2) as rowid

结果：1,2,3,4

rank

() 并列排序，即下一个数字会跳过

rank() over(partition by col1 order by col2) as rowid

结果：1，2，2，4，5

dense_rank

() 并列排序，下一个数据不会跳过

dense_rank() over(partition by col1 order by col2) as rowid

结果：1，2，2，3，4

ntile

(n) 用于将分组数据按照顺序切分成n片，返回当前切片值，如果切片不均匀，默认增加第一个切片的分布。(

https://blog.csdn.net/weixin_36190755/article/details/109056534

)

ntile不支持rows between，比如 ntile(2) over(partition by col1 order by col2 rows between 3 preceding and current row)。

first_value

(col) 当前分组内排序后col第一个值

last_value

(col) 当前分组内排序后col最后一个值

lag

(col1,n,default) over(partition by col1 order by col2) as up 向前取第n行

lead

(col1,n,default) over(partition by col1 order by col2) as down 向后取第n行

cume_dist

() 小于等于当前值的行数/分组内总行数。比如，统计小于等于当前薪水的人数，所占总人数的比例

percent_rank

() 分组内当前行的rank值-1/分组内总行数-1

原文链接：https://blog.csdn.net/weixin_36190755/article/details/109057442