2023年6月30日 04:55:04go评论117阅读模式

英文:

how to filter dataset with ArrayType column such that Array doesn't contain duplicates

问题

I have dataset as

+-------+------+
|    A  |    B |
+-------+------+
|      a| [1,2]  |
|      a| [2,2]  |
|      a| [1,3]  |
|      a| [1,2,3]|
|      a| [1,1,3]|

B列是ArrayType，有些列表内部有重复项。我需要筛选出只有列表内没有重复元素的行。

期望结果：

+-------+------+
|    A  |    B |
+-------+------+
|      a| [1,2]  |
|      a| [1,3]  |
|      a| [1,2,3]|

第二行被删除，因为它有重复的2，第五行被删除，因为它有重复的1。

如何在pyspark中实现这个要求。

英文:

I have dataset as

+-------+------+
|    A  |    B |
+-------+------+
|      a| [1,2]  |
|      a| [2,2]  |
|      a| [1,3]  |
|      a| [1,2,3]|
|      a| [1,1,3]|

B column is ArrayType and some list have duplicates within itself . I need to filter only row that has non-duplicates elements in the list.

Expected REsult:

+-------+------+
|    A  |    B |
+-------+------+
|      a| [1,2]  |
|      a| [1,3]  |
|      a| [1,2,3]|

second row is dropped as it has duplicate 2, 5th is dropped as it has duplicate 1.

How can I get this in pyspark.

答案1

得分: 1

尝试这个：

data = [("Alice", [1, 2, 3, 4, 5]),
        ("Bob", [6, 7, 8, 8, 9]),
        ("Charlie", [10, 11, 12, 13, 14])]
df = spark.createDataFrame(data, ["Name", "Numbers"])
df_d = df.withColumn("HasD", size(col("Numbers")) > size(array_distinct(col("Numbers")))). \
    where(col("HasD") == False)
df_d.show()

英文:

Try this:

data = [(&quot;Alice&quot;, [1, 2, 3, 4, 5]),
        (&quot;Bob&quot;, [6, 7, 8, 8, 9]),
        (&quot;Charlie&quot;, [10, 11, 12, 13, 14])]
df = spark.createDataFrame(data, [&quot;Name&quot;, &quot;Numbers&quot;])
df_d = df.withColumn(&quot;HasD&quot;, size(col(&quot;Numbers&quot;)) &gt; size(array_distinct(col(&quot;Numbers&quot;)))). \
    where(col(&quot;HasD&quot;) == False)
df_d.show()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

筛选具有ArrayType列的数据集，以使Array不包含重复项。

问题

答案1

如何为通用的ArrayList实现插入排序方法？

“Databricks DLT pipeline with for..loop reports error ‘AnalysisException: Cannot redefine dataset'”

尝试在Databricks SQL中将字符串转换为日期列。

如何使用Java将Spark DataFrame 以制表符分隔的形式写入文本文件

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。