spark sql 去重 distinct dropDuplicates

2022/1/6 19:35:09

编程Tag： name height Spark distinct age Alice 80 +---+------+-----+ dropDuplicates

本文主要是介绍spark sql 去重 distinct dropDuplicates，对大家解决编程问题具有一定的参考价值，需要的程序猿们随着小编来一起学习吧！

1distinct 对行级别的过滤重复的数据

df.distinct()

2dropDuplicates 可以选择对字段进行过滤重复

>>> from pyspark.sql import Row
>>> df = sc.parallelize([ \
...     Row(name='Alice', age=5, height=80), \
...     Row(name='Alice', age=5, height=80), \
...     Row(name='Alice', age=10, height=80)]).toDF()
>>> df.dropDuplicates().show()
+---+------+-----+
|age|height| name|
+---+------+-----+
|  5|    80|Alice|
| 10|    80|Alice|
+---+------+-----+

>>> df.dropDuplicates(['name', 'height']).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
|  5|    80|Alice|
+---+------+-----+

这篇关于spark sql 去重 distinct dropDuplicates的文章就介绍到这儿，希望我们推荐的文章对大家有所帮助，也希望大家多多支持为之网！

spark sql 去重 distinct dropDuplicates

1distinct 对行级别的过滤重复的数据

2dropDuplicates 可以选择对字段进行过滤重复

相关编程文章