英文:
how to filter dataset with ArrayType column such that Array doesn't contain duplicates
问题
I have dataset as
+-------+------+
| A | B |
+-------+------+
| a| [1,2] |
| a| [2,2] |
| a| [1,3] |
| a| [1,2,3]|
| a| [1,1,3]|
B列是ArrayType,有些列表内部有重复项。我需要筛选出只有列表内没有重复元素的行。
期望结果:
+-------+------+
| A | B |
+-------+------+
| a| [1,2] |
| a| [1,3] |
| a| [1,2,3]|
第二行被删除,因为它有重复的2,第五行被删除,因为它有重复的1。
如何在pyspark中实现这个要求。
英文:
I have dataset as
+-------+------+
| A | B |
+-------+------+
| a| [1,2] |
| a| [2,2] |
| a| [1,3] |
| a| [1,2,3]|
| a| [1,1,3]|
B column is ArrayType and some list have duplicates within itself . I need to filter only row that has non-duplicates elements in the list.
Expected REsult:
+-------+------+
| A | B |
+-------+------+
| a| [1,2] |
| a| [1,3] |
| a| [1,2,3]|
second row is dropped as it has duplicate 2, 5th is dropped as it has duplicate 1.
How can I get this in pyspark.
答案1
得分: 1
尝试这个:
data = [("Alice", [1, 2, 3, 4, 5]),
("Bob", [6, 7, 8, 8, 9]),
("Charlie", [10, 11, 12, 13, 14])]
df = spark.createDataFrame(data, ["Name", "Numbers"])
df_d = df.withColumn("HasD", size(col("Numbers")) > size(array_distinct(col("Numbers")))). \
where(col("HasD") == False)
df_d.show()
英文:
Try this:
data = [("Alice", [1, 2, 3, 4, 5]),
("Bob", [6, 7, 8, 8, 9]),
("Charlie", [10, 11, 12, 13, 14])]
df = spark.createDataFrame(data, ["Name", "Numbers"])
df_d = df.withColumn("HasD", size(col("Numbers")) > size(array_distinct(col("Numbers")))). \
where(col("HasD") == False)
df_d.show()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论