Hive 高级函数----开窗函数
2022/2/22 0:12:21
本文主要是介绍Hive 高级函数----开窗函数,对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!
目录- Hive 高级函数----开窗函数
- 测试数据
- 建表语句
- 一、排列函数
- row_number:无并列排名
- dense_rank:有并列排名,并且依次递增
- rank:有并列排名,不依次递增
- percent_rank:(rank的结果-1)/(分区内数据的个数-1)
- cume_dist:计算某个窗口或分区中某个值的累积分布。
- NTILE(n):对分区内数据再分成n组,然后打上组号
- max、min、avg、count、sum:基于每个partition分区内的数据做对应的计算
- 窗口帧:用于从分区中选择指定的多条记录,供窗口函数处理
- 二、窗口函数
- LAG(col,n):查看往前第n行数据
- LEAD(col,n):查看往后第n行数据
- FIRST_VALUE:取分组内排序后,截止到当前行,第一个值
- LAST_VALUE:取分组内排序后,截止到当前行,最后一个值,对于并列的排名,取最后一个
Hive 高级函数----开窗函数
用户分组中开窗,好像给每一份数据 开一扇窗户 所以叫开窗函数
在sql中有一类函数叫做聚合函数,例如sum()、avg()、max()等等,这类函数可以将多行数据按照规则聚集为一行,一般来讲聚集后的行数是要少于聚集前的行数的.但是有时我们想要既显示聚集前的数据,又要显示聚集后的数据,这时我们便引入了窗口函数.
测试数据
111,69,class1,department1 112,80,class1,department1 113,74,class1,department1 114,94,class1,department1 115,93,class1,department1 121,74,class2,department1 122,86,class2,department1 123,78,class2,department1 124,70,class2,department1 211,93,class1,department2 212,83,class1,department2 213,94,class1,department2 214,94,class1,department2 215,82,class1,department2 216,74,class1,department2 221,99,class2,department2 222,78,class2,department2 223,74,class2,department2 224,80,class2,department2 225,85,class2,department2
建表语句
create table new_score( id int ,score int ,clazz string ,department string ) row format delimited fields terminated by ",";
一、排列函数
row_number:无并列排名
- 用法: select xxxx, row_number() over(partition by 分组字段 order by 排序字段 desc) as rn from tb group by xxxx
dense_rank:有并列排名,并且依次递增
rank:有并列排名,不依次递增
percent_rank:(rank的结果-1)/(分区内数据的个数-1)
cume_dist:计算某个窗口或分区中某个值的累积分布。
假定升序排序,则使用以下公式确定累积分布: 小于等于当前值x的行数 / 窗口或partition分区内的总行数。其中,x 等于 order by 子句中指定的列的当前行中的值。
NTILE(n):对分区内数据再分成n组,然后打上组号
max、min、avg、count、sum:基于每个partition分区内的数据做对应的计算
窗口帧:用于从分区中选择指定的多条记录,供窗口函数处理
Hive 提供了两种定义窗口帧的形式:
ROWS
和RANGE
。两种类型都需要配置上界和下界。例如,ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
表示选择分区起始记录到当前记录的所有行;SUM(close) RANGE BETWEEN 100 PRECEDING AND 200 FOLLOWING
则通过 字段差值 来进行选择。如当前行的close
字段值是200
,那么这个窗口帧的定义就会选择分区中close
字段值落在100
至400
区间的记录。以下是所有可能的窗口帧定义组合。如果没有定义窗口帧,则默认为RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
。只能运用在max、min、avg、count、sum、FIRST_VALUE、LAST_VALUE这几个窗口函数上
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING) (ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING) (ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING range between 3 PRECEDING and 11 FOLLOWINGSELECT id ,score ,clazz ,SUM(score) OVER w as sum_w ,round(avg(score) OVER w,3) as avg_w ,count(score) OVER w as cnt_w FROM new_score WINDOW w AS (PARTITION BY clazz ORDER BY score rows between 2 PRECEDING and 2 FOLLOWING);111 69 class1 217 72.333 3 113 74 class1 297 74.25 4 216 74 class1 379 75.8 5 112 80 class1 393 78.6 5 215 82 class1 412 82.4 5 212 83 class1 431 86.2 5 211 93 class1 445 89.0 5 115 93 class1 457 91.4 5 213 94 class1 468 93.6 5 114 94 class1 375 93.75 4 214 94 class1 282 94.0 3 124 70 class2 218 72.667 3 121 74 class2 296 74.0 4 223 74 class2 374 74.8 5 222 78 class2 384 76.8 5 123 78 class2 395 79.0 5 224 80 class2 407 81.4 5 225 85 class2 428 85.6 5 122 86 class2 350 87.5 4 221 99 class2 270 90.0 3
select id ,score ,clazz ,department ,row_number() over (partition by clazz order by score desc) as rn_rk ,dense_rank() over (partition by clazz order by score desc) as dense_rk ,rank() over (partition by clazz order by score desc) as rk ,percent_rank() over (partition by clazz order by score desc) as percent_rk ,round(cume_dist() over (partition by clazz order by score desc),3) as cume_rk ,NTILE(3) over (partition by clazz order by score desc) as ntile_num ,max(score) over (partition by clazz order by score desc range between 3 PRECEDING and 11 FOLLOWING) as max_p //用到了窗口帧 from new_score; //partition by clazz order by score desc:表示在每个班里学生的成绩按照降序排序 id score clazz department rn_rk ds_rk rk percent_rk cume_rk ntile_num max_p 114 94 class1 department1 1 1 1 0.0 0.273 1 94 214 94 class1 department2 2 1 1 0.0 0.273 1 94 213 94 class1 department2 3 1 1 0.0 0.273 1 94 211 93 class1 department2 4 2 4 0.3 0.455 1 94 115 93 class1 department1 5 2 4 0.3 0.455 2 94 212 83 class1 department2 6 3 6 0.5 0.545 2 94 215 82 class1 department2 7 4 7 0.6 0.636 2 94 112 80 class1 department1 8 5 8 0.7 0.727 2 94 113 74 class1 department1 9 6 9 0.8 0.909 3 94 216 74 class1 department2 10 6 9 0.8 0.909 3 94 111 69 class1 department1 11 7 11 1.0 1.0 3 94 221 99 class2 department2 1 1 1 0.0 0.111 1 99 122 86 class2 department1 2 2 2 0.125 0.222 1 99 225 85 class2 department2 3 3 3 0.25 0.333 1 99 224 80 class2 department2 4 4 4 0.375 0.444 2 99 123 78 class2 department1 5 5 5 0.5 0.667 2 99 222 78 class2 department2 6 5 5 0.5 0.667 2 99 121 74 class2 department1 7 6 7 0.75 0.889 3 99 223 74 class2 department2 8 6 7 0.75 0.889 3 99 124 70 class2 department1 9 7 9 1.0 1.0 3 99
二、窗口函数
LAG(col,n):查看往前第n行数据
LEAD(col,n):查看往后第n行数据
FIRST_VALUE:取分组内排序后,截止到当前行,第一个值
LAST_VALUE:取分组内排序后,截止到当前行,最后一个值,对于并列的排名,取最后一个
select id ,score ,clazz ,department ,lag(id,2) over (partition by clazz order by score desc) as lag_num ,LEAD(id,2) over (partition by clazz order by score desc) as lead_num ,FIRST_VALUE(id) over (partition by clazz order by score desc) as first_v_num ,LAST_VALUE(id) over (partition by clazz order by score desc) as last_v_num ,NTILE(3) over (partition by clazz order by score desc) as ntile_num from new_score; id score clazz department lag_num lead_num first_v_num last_v_num ntile_num 114 94 class1 department1 NULL 213 114 213 1 214 94 class1 department2 NULL 211 114 213 1 213 94 class1 department2 114 115 114 213 1 211 93 class1 department2 214 212 114 115 1 115 93 class1 department1 213 215 114 115 2 212 83 class1 department2 211 112 114 212 2 215 82 class1 department2 115 113 114 215 2 112 80 class1 department1 212 216 114 112 2 113 74 class1 department1 215 111 114 216 3 216 74 class1 department2 112 NULL 114 216 3 111 69 class1 department1 113 NULL 114 111 3 221 99 class2 department2 NULL 225 221 221 1 122 86 class2 department1 NULL 224 221 122 1 225 85 class2 department2 221 123 221 225 1 224 80 class2 department2 122 222 221 224 2 123 78 class2 department1 225 121 221 222 2 222 78 class2 department2 224 223 221 222 2 121 74 class2 department1 123 124 221 223 3 223 74 class2 department2 222 NULL 221 223 3 124 70 class2 department1 121 NULL 221 124 3
https://blog.csdn.net/qq_26937525/article/details/54925827
这篇关于Hive 高级函数----开窗函数的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!
- 2024-06-15matplotlib作图不显示3D图,怎么办?
- 2024-06-1503-Loki 日志监控
- 2024-06-1504-让LLM理解知识 -Prompt
- 2024-06-05做软件测试需要懂代码吗?
- 2024-06-0514-ShardingSphere的分布式主键实现
- 2024-06-03为什么以及如何要进行架构设计权衡?
- 2024-05-31全网首发第二弹!软考2024年5月《软件设计师》真题+解析+答案!(11-20题)
- 2024-05-31全网首发!软考2024年5月《软件设计师》真题+解析+答案!(21-30题)
- 2024-05-30【Java】百万数据excel导出功能如何实现
- 2024-05-30我们小公司,哪像华为一样,用得上IPD(集成产品开发)?