使用PySpark中的索引位置或条件,在一个列中提取另一个列中的值。

huangapple go评论70阅读模式
英文:

Use indexed position or condition in one PySpark column to extract a value in another

问题

我是新手使用PySpark/Python。我尝试创建一个新列,并将另一个数组列中具有最高频率的字符串放入该列中。

我尝试使用以下代码:

df.select(df.Name, df.Freq,
     expr("element_at(Name, array_position(Freq, array_max(Freq)))")
  .alias("Popular")).display()

来生成以下结果:

Name Freq Popular
['AB', 'DC', 'TZ','ETC','RX'] [1,4,1,2,2] DC
['XYS', 'FD', 'PA'] [2,1,6] PA

但是我得到了"Column is not iterable"的TypeError。我的多次尝试都指向了array_position()函数。因为当我使用这个代码:

df.select(df.Name, df.Freq,
     expr("element_at(Name, array_max(Freq))")
  .alias("Popular")).display()

我生成了以下结果:

Name Freq Popular
['AB', 'DC', 'TZ','ETC','RX'] [1,4,1,2,2] ETC
['XYS', 'FD', 'PA'] [2,1,6] Null

element_at()使用频率作为位置,并在第一行中收集Name数组列的第4个元素,在第二行中返回NULL,因为数组只有3个元素,这告诉我element_at()和array_max()都运行正常。

我查看了一些网站,array_max()和array_position()都使用col对象类型。所以我对为什么array_max()有效而array_position()无效感到困惑。

英文:

I am new to PySpark/Python. I am trying to create a new column and putting a string from an array in a column that has the highest frequency in another array column.

I tried using the following

df.select(df.Name, df.Freq,
     expr("element_at(Name, array_position(Freq, array_max(Freq)))")
  .alias("Popular")).display()

to produce this:

Name Freq Popular
['AB', 'DC', 'TZ','ETC','RX'] [1,4,1,2,2] DC
['XYS', 'FD', 'PA'] [2,1,6] PA

But i get the "Column is not iterable' TypeError. My many, many trials and errors pointed me to the array_position() function. Because when I use this:

df.select(df.Name, df.Freq,
     expr("element_at(Name, array_max(Freq))")
  .alias("Popular")).display()

I produced this:

Name Freq Popular
['AB', 'DC', 'TZ','ETC','RX'] [1,4,1,2,2] ETC
['XYS', 'FD', 'PA'] [2,1,6] Null

element_at() uses the number of frequency as the position and collect the 4th element in the Name array column in the first row and NULL on the second row since the array has only 3 elements, which tells me that element_at() and array_max() work just fine.

I checked some websites, and both array_max() and array_position() use col object types. So I'm confused why array_max() work but array_position() doesn't.

答案1

得分: 0

问题的原因是这些函数的返回类型不同。

array_max

> 返回与元素类型匹配的结果。跳过NULL元素。

array_position

> 返回长整型。

我猜测你当前的Freq列是整数,因此,array_max将返回整数类型。而array_position将返回长整型。

当你使用element_at函数时,element_at函数需要第二个参数是整数值。这就是为什么使用array_max可以工作,但不使用array_position就不行的原因。

<h3>如何修复</h3>

array_position的结果转换为整数

df.select(df.Name, df.Freq,
          F.expr("element_at(Name, cast(array_position(Freq, array_max(Freq)) as int))").alias("Popular"))

或者使用数组语法。请注意,使用此语法时,数组需要0-based索引,而array_position返回1-based索引,因此需要添加-1来调整值。

df.select(df.Name, df.Freq,
          F.expr("Name[array_position(Freq, array_max(Freq))-1]").alias("Popular"))
英文:

The reason of the issue is the return types of these functions are different.

array_max

> Returns the result matches the type of the elements. NULL elements are skipped.

array_position

> Returns a long type.

I am guessing your current Freq column is integer, therefore, array_max will return integer type. Wheares, array_position will return long type.

When you are using with element_at function, element_at function requires integer value for the 2nd argument. That's why with array_max, it works but not with array_position.

<h3>How to fix</h3>

Cast the result of array_position to integer

df.select(df.Name, df.Freq,
          F.expr(&quot;element_at(Name, cast(array_position(Freq, array_max(Freq)) as int))&quot;).alias(&quot;Popular&quot;))

Or use array syntax. With this syntax, note that array requires 0-based index and array_position returns in 1-based index, so add -1 to adjust the value.

df.select(df.Name, df.Freq,
          F.expr(&quot;Name[array_position(Freq, array_max(Freq))-1]&quot;).alias(&quot;Popular&quot;))

huangapple
  • 本文由 发表于 2023年3月7日 01:40:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/75654084.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定