Flink基础(132):FLINK-SQL语法 (26) DQL(18) OPERATIONS(15)Deduplication 去重
2021/8/28 19:08:12
本文主要是介绍Flink基础(132):FLINK-SQL语法 (26) DQL(18) OPERATIONS(15)Deduplication 去重,对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!
Deduplication
Batch Streaming
Deduplication removes rows that duplicate over a set of columns, keeping only the first one or the last one. In some cases, the upstream ETL jobs are not end-to-end exactly-once; this may result in duplicate records in the sink in case of failover. However, the duplicate records will affect the correctness of downstream analytical jobs - e.g. SUM
, COUNT
- so deduplication is needed before further analysis.
Flink uses ROW_NUMBER()
to remove duplicates, just like the way of Top-N query. In theory, deduplication is a special case of Top-N in which the N is one and order by the processing time or event time.
The following shows the syntax of the Deduplication statement:
SELECT [column_list] FROM ( SELECT [column_list], ROW_NUMBER() OVER ([PARTITION BY col1[, col2...]] ORDER BY time_attr [asc|desc]) AS rownum FROM table_name) WHERE rownum = 1
Parameter Specification:
ROW_NUMBER()
: Assigns an unique, sequential number to each row, starting with one.PARTITION BY col1[, col2...]
: Specifies the partition columns, i.e. the deduplicate key.ORDER BY time_attr [asc|desc]
: Specifies the ordering column, it must be a time attribute. Currently Flink supports processing time attribute and event time atttribute. Ordering by ASC means keeping the first row, ordering by DESC means keeping the last row.WHERE rownum = 1
: Therownum = 1
is required for Flink to recognize this query is deduplication.
Note: the above pattern must be followed exactly, otherwise the optimizer won’t be able to translate the query.
The following examples show how to specify SQL queries with Deduplication on streaming tables.
CREATE TABLE Orders ( order_time STRING, user STRING, product STRING, num BIGINT, proctime AS PROCTIME() ) WITH (...); -- remove duplicate rows on order_id and keep the first occurrence row, -- because there shouldn't be two orders with the same order_id. SELECT order_id, user, product, num FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY proctime ASC) AS row_num FROM Orders) WHERE row_num = 1
这篇关于Flink基础(132):FLINK-SQL语法 (26) DQL(18) OPERATIONS(15)Deduplication 去重的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!
- 2024-11-01UniApp 中组件的生命周期是多少-icode9专业技术文章分享
- 2024-11-01如何使用Svg Sprite Icon简化网页图标管理
- 2024-10-31Excel数据导出课程:新手从入门到精通的实用教程
- 2024-10-31Excel数据导入课程:新手入门指南
- 2024-10-31RBAC的权限课程:新手入门教程
- 2024-10-31Svg Sprite Icon课程:新手入门必备指南
- 2024-10-31怎么配置 L2TP 允许多用户连接-icode9专业技术文章分享
- 2024-10-31怎么在FreeBSD上 安装 OpenResty-icode9专业技术文章分享
- 2024-10-31运行 modprobe l2tp_ppp 时收到“module not found”消息提醒是什么-icode9专业技术文章分享
- 2024-10-31FreeBSD的下载命令有哪些-icode9专业技术文章分享