2023年3月7日 01:40:25go评论100阅读模式

英文:

Use indexed position or condition in one PySpark column to extract a value in another

问题

我是新手使用PySpark/Python。我尝试创建一个新列，并将另一个数组列中具有最高频率的字符串放入该列中。

我尝试使用以下代码：

df.select(df.Name, df.Freq,
     expr("element_at(Name, array_position(Freq, array_max(Freq)))")
  .alias("Popular")).display()

来生成以下结果：

Name	Freq	Popular
['AB', 'DC', 'TZ','ETC','RX']	[1,4,1,2,2]	DC
['XYS', 'FD', 'PA']	[2,1,6]	PA

但是我得到了"Column is not iterable"的TypeError。我的多次尝试都指向了array_position()函数。因为当我使用这个代码：

df.select(df.Name, df.Freq,
     expr("element_at(Name, array_max(Freq))")
  .alias("Popular")).display()

我生成了以下结果：

Name	Freq	Popular
['AB', 'DC', 'TZ','ETC','RX']	[1,4,1,2,2]	ETC
['XYS', 'FD', 'PA']	[2,1,6]	Null

element_at()使用频率作为位置，并在第一行中收集Name数组列的第4个元素，在第二行中返回NULL，因为数组只有3个元素，这告诉我element_at()和array_max()都运行正常。

我查看了一些网站，array_max()和array_position()都使用col对象类型。所以我对为什么array_max()有效而array_position()无效感到困惑。

英文:

I am new to PySpark/Python. I am trying to create a new column and putting a string from an array in a column that has the highest frequency in another array column.

I tried using the following

df.select(df.Name, df.Freq,
     expr(&quot;element_at(Name, array_position(Freq, array_max(Freq)))&quot;)
  .alias(&quot;Popular&quot;)).display()

to produce this:

Name	Freq	Popular
['AB', 'DC', 'TZ','ETC','RX']	[1,4,1,2,2]	DC
['XYS', 'FD', 'PA']	[2,1,6]	PA

But i get the "Column is not iterable' TypeError. My many, many trials and errors pointed me to the array_position() function. Because when I use this:

df.select(df.Name, df.Freq,
     expr(&quot;element_at(Name, array_max(Freq))&quot;)
  .alias(&quot;Popular&quot;)).display()

I produced this:

Name	Freq	Popular
['AB', 'DC', 'TZ','ETC','RX']	[1,4,1,2,2]	ETC
['XYS', 'FD', 'PA']	[2,1,6]	Null

element_at() uses the number of frequency as the position and collect the 4th element in the Name array column in the first row and NULL on the second row since the array has only 3 elements, which tells me that element_at() and array_max() work just fine.

I checked some websites, and both array_max() and array_position() use col object types. So I'm confused why array_max() work but array_position() doesn't.

答案1

得分: 0

问题的原因是这些函数的返回类型不同。

array_max

> 返回与元素类型匹配的结果。跳过NULL元素。

array_position

> 返回长整型。

我猜测你当前的Freq列是整数，因此，array_max将返回整数类型。而array_position将返回长整型。

当你使用element_at函数时，element_at函数需要第二个参数是整数值。这就是为什么使用array_max可以工作，但不使用array_position就不行的原因。

将array_position的结果转换为整数

df.select(df.Name, df.Freq,
          F.expr("element_at(Name, cast(array_position(Freq, array_max(Freq)) as int))").alias("Popular"))

或者使用数组语法。请注意，使用此语法时，数组需要0-based索引，而array_position返回1-based索引，因此需要添加-1来调整值。

df.select(df.Name, df.Freq,
          F.expr("Name[array_position(Freq, array_max(Freq))-1]").alias("Popular"))

英文:

The reason of the issue is the return types of these functions are different.

array_max

> Returns the result matches the type of the elements. NULL elements are skipped.

array_position

> Returns a long type.

I am guessing your current Freq column is integer, therefore, array_max will return integer type. Wheares, array_position will return long type.

When you are using with element_at function, element_at function requires integer value for the 2nd argument. That's why with array_max, it works but not with array_position.

Cast the result of array_position to integer

df.select(df.Name, df.Freq,
          F.expr(&quot;element_at(Name, cast(array_position(Freq, array_max(Freq)) as int))&quot;).alias(&quot;Popular&quot;))

Or use array syntax. With this syntax, note that array requires 0-based index and array_position returns in 1-based index, so add -1 to adjust the value.

df.select(df.Name, df.Freq,
          F.expr(&quot;Name[array_position(Freq, array_max(Freq))-1]&quot;).alias(&quot;Popular&quot;))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用PySpark中的索引位置或条件，在一个列中提取另一个列中的值。

问题

答案1

PyTorch中的数据增强用于CNN。

Pandas在一个轴上绘制的紧凑图。

Azure Databricks 访问出现错误代码=403/禁止访问：403-未经授权访问组织

为什么我的Python-Requests脚本在使用URL列表时不断下载相同的页面？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。