筛选具有ArrayType列的数据集,以使Array不包含重复项。

huangapple go评论78阅读模式
英文:

how to filter dataset with ArrayType column such that Array doesn't contain duplicates

问题

I have dataset as

+-------+------+
|    A  |    B |
+-------+------+
|      a| [1,2]  |
|      a| [2,2]  |
|      a| [1,3]  |
|      a| [1,2,3]|
|      a| [1,1,3]|

B列是ArrayType,有些列表内部有重复项。我需要筛选出只有列表内没有重复元素的行。

期望结果:

+-------+------+
|    A  |    B |
+-------+------+
|      a| [1,2]  |
|      a| [1,3]  |
|      a| [1,2,3]|

第二行被删除,因为它有重复的2,第五行被删除,因为它有重复的1。

如何在pyspark中实现这个要求。

英文:

I have dataset as

+-------+------+
|    A  |    B |
+-------+------+
|      a| [1,2]  |
|      a| [2,2]  |
|      a| [1,3]  |
|      a| [1,2,3]|
|      a| [1,1,3]|

B column is ArrayType and some list have duplicates within itself . I need to filter only row that has non-duplicates elements in the list.

Expected REsult:

+-------+------+
|    A  |    B |
+-------+------+
|      a| [1,2]  |
|      a| [1,3]  |
|      a| [1,2,3]|

second row is dropped as it has duplicate 2, 5th is dropped as it has duplicate 1.

How can I get this in pyspark.

答案1

得分: 1

尝试这个:

data = [("Alice", [1, 2, 3, 4, 5]),
        ("Bob", [6, 7, 8, 8, 9]),
        ("Charlie", [10, 11, 12, 13, 14])]

df = spark.createDataFrame(data, ["Name", "Numbers"])

df_d = df.withColumn("HasD", size(col("Numbers")) > size(array_distinct(col("Numbers")))). \
    where(col("HasD") == False)

df_d.show()
英文:

Try this:

data = [("Alice", [1, 2, 3, 4, 5]),
        ("Bob", [6, 7, 8, 8, 9]),
        ("Charlie", [10, 11, 12, 13, 14])]

df = spark.createDataFrame(data, ["Name", "Numbers"])

df_d = df.withColumn("HasD", size(col("Numbers")) > size(array_distinct(col("Numbers")))). \
    where(col("HasD") == False)

df_d.show()

huangapple
  • 本文由 发表于 2023年6月30日 04:55:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/76584562.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定