英文:
Use indexed position or condition in one PySpark column to extract a value in another
问题
我是新手使用PySpark/Python。我尝试创建一个新列,并将另一个数组列中具有最高频率的字符串放入该列中。
我尝试使用以下代码:
df.select(df.Name, df.Freq,
expr("element_at(Name, array_position(Freq, array_max(Freq)))")
.alias("Popular")).display()
来生成以下结果:
Name | Freq | Popular |
---|---|---|
['AB', 'DC', 'TZ','ETC','RX'] | [1,4,1,2,2] | DC |
['XYS', 'FD', 'PA'] | [2,1,6] | PA |
但是我得到了"Column is not iterable"的TypeError。我的多次尝试都指向了array_position()函数。因为当我使用这个代码:
df.select(df.Name, df.Freq,
expr("element_at(Name, array_max(Freq))")
.alias("Popular")).display()
我生成了以下结果:
Name | Freq | Popular |
---|---|---|
['AB', 'DC', 'TZ','ETC','RX'] | [1,4,1,2,2] | ETC |
['XYS', 'FD', 'PA'] | [2,1,6] | Null |
element_at()使用频率作为位置,并在第一行中收集Name数组列的第4个元素,在第二行中返回NULL,因为数组只有3个元素,这告诉我element_at()和array_max()都运行正常。
我查看了一些网站,array_max()和array_position()都使用col对象类型。所以我对为什么array_max()有效而array_position()无效感到困惑。
英文:
I am new to PySpark/Python. I am trying to create a new column and putting a string from an array in a column that has the highest frequency in another array column.
I tried using the following
df.select(df.Name, df.Freq,
expr("element_at(Name, array_position(Freq, array_max(Freq)))")
.alias("Popular")).display()
to produce this:
Name | Freq | Popular |
---|---|---|
['AB', 'DC', 'TZ','ETC','RX'] | [1,4,1,2,2] | DC |
['XYS', 'FD', 'PA'] | [2,1,6] | PA |
But i get the "Column is not iterable' TypeError. My many, many trials and errors pointed me to the array_position() function. Because when I use this:
df.select(df.Name, df.Freq,
expr("element_at(Name, array_max(Freq))")
.alias("Popular")).display()
I produced this:
Name | Freq | Popular |
---|---|---|
['AB', 'DC', 'TZ','ETC','RX'] | [1,4,1,2,2] | ETC |
['XYS', 'FD', 'PA'] | [2,1,6] | Null |
element_at() uses the number of frequency as the position and collect the 4th element in the Name array column in the first row and NULL on the second row since the array has only 3 elements, which tells me that element_at() and array_max() work just fine.
I checked some websites, and both array_max() and array_position() use col object types. So I'm confused why array_max() work but array_position() doesn't.
答案1
得分: 0
问题的原因是这些函数的返回类型不同。
array_max
> 返回与元素类型匹配的结果。跳过NULL元素。
array_position
> 返回长整型。
我猜测你当前的Freq
列是整数,因此,array_max
将返回整数类型。而array_position
将返回长整型。
当你使用element_at
函数时,element_at
函数需要第二个参数是整数值。这就是为什么使用array_max
可以工作,但不使用array_position
就不行的原因。
<h3>如何修复</h3>
将array_position
的结果转换为整数
df.select(df.Name, df.Freq,
F.expr("element_at(Name, cast(array_position(Freq, array_max(Freq)) as int))").alias("Popular"))
或者使用数组语法。请注意,使用此语法时,数组需要0-based索引,而array_position
返回1-based索引,因此需要添加-1来调整值。
df.select(df.Name, df.Freq,
F.expr("Name[array_position(Freq, array_max(Freq))-1]").alias("Popular"))
英文:
The reason of the issue is the return types of these functions are different.
array_max
> Returns the result matches the type of the elements. NULL elements are skipped.
array_position
> Returns a long type.
I am guessing your current Freq
column is integer, therefore, array_max
will return integer type. Wheares, array_position
will return long type.
When you are using with element_at
function, element_at
function requires integer value for the 2nd argument. That's why with array_max
, it works but not with array_position
.
<h3>How to fix</h3>
Cast the result of array_position
to integer
df.select(df.Name, df.Freq,
F.expr("element_at(Name, cast(array_position(Freq, array_max(Freq)) as int))").alias("Popular"))
Or use array syntax. With this syntax, note that array requires 0-based index and array_position
returns in 1-based index, so add -1 to adjust the value.
df.select(df.Name, df.Freq,
F.expr("Name[array_position(Freq, array_max(Freq))-1]").alias("Popular"))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论