2023年8月10日 18:29:54go评论146阅读模式

英文:

TypeError: 'Column' object is not callable when adding column to Struct

问题

我正在实现这里提到的答案。
这是我的结构体，我想要添加一个新列。

root
 |-- shops: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- epoch: double (nullable = true)
 |    |    |-- request: string (nullable = true)

所以我执行了以下代码：

from pyspark.sql import functions as F
df = new_df.withColumn('state', F.col('shops').withField('a', F.lit(1)))
df.printSchema()

但是我得到了以下错误：

TypeError                                 Traceback (most recent call last)
<ipython-input-47-1749b2131995> in <module>
      1 from pyspark.sql import functions as F
----> 2 df = new_df.withColumn('state', F.col('shops').withField('a', F.lit(1)))
      3 df.printSchema()
TypeError: 'Column' object is not callable

编辑：我的版本是Python 3.9，Spark 3.0.3（可能是最大版本）。

英文:

I was implementing the answer mentioned here.
This is my struct and I want to add a new col to it.

root
 |-- shops: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- epoch: double (nullable = true)
 |    |    |-- request: string (nullable = true)

So I executed this

from pyspark.sql import functions as F
df = new_df.withColumn(&#39;state&#39;, F.col(&#39;shops&#39;).withField(&#39;a&#39;, F.lit(1)))
df.printSchema()

But I get this error

TypeError                                 Traceback (most recent call last)
&lt;ipython-input-47-1749b2131995&gt; in &lt;module&gt;
      1 from pyspark.sql import functions as F
----&gt; 2 df = new_df.withColumn(&#39;state&#39;, F.col(‘shops’).withField(&#39;a&#39;, F.lit(1)))
      3 df.printSchema()
TypeError: &#39;Column&#39; object is not callable

EDIT: My version is Python 39 Spark 3.0.3 (Max possible)

答案1

得分: 2

尝试使用transform高阶函数，因为您试图向一个array添加新列。

示例:

from pyspark.sql.functions import *
jsn_str = """{"shop_time":[{"seconds":10,"shop":"Texmex"},{"seconds":5,"shop":"Tex"}]}"""
df = spark.read.json(sc.parallelize([jsn_str]), multiLine=True)
df.\
  withColumn("shop_time", transform('shop_time', lambda x: x.withField('diff_sec', lit(1)))).\
    show(10,False)
#+------------------------------+
#|shop_time                     |
#+------------------------------+
#|[{10, Texmex, 1}, {5, Tex, 1}]|
#+------------------------------+
df.withColumn("shop_time", transform('shop_time', lambda x: x.withField('diff_sec', lit(1)))).\
    printSchema()
#root
# |-- shop_time: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- seconds: long (nullable = true)
# |    |    |-- shop: string (nullable = true)
# |    |    |-- diff_sec: integer (nullable = false)

更新:

使用Spark-sql：

df.createOrReplaceTempView("tmp")
spark.sql("select transform(shop_time, x -> struct(1 as diff_sec, x.seconds, x.shop)) as shop_time from tmp").\
  show(10,False)
#+------------------------------+
#|shop_time                     |
#+------------------------------+
#|[{1, 10, Texmex}, {1, 5, Tex}]|
#+------------------------------+

英文:

Try with transform higher order function, as you are trying to add new column to an array.

Example:

from pyspark.sql.functions import *
jsn_str=&quot;&quot;&quot;{&quot;shop_time&quot;:[{&quot;seconds&quot;:10,&quot;shop&quot;:&quot;Texmex&quot;},{&quot;seconds&quot;:5,&quot;shop&quot;:&quot;Tex&quot;}]}&quot;&quot;&quot;
df = spark.read.json(sc.parallelize([jsn_str]), multiLine=True)
df.\
  withColumn(&quot;shop_time&quot;, transform(&#39;shop_time&#39;, lambda x: x.withField(&#39;diff_sec&#39;, lit(1)))).\
    show(10,False)
#+------------------------------+
#|shop_time                     |
#+------------------------------+
#|[{10, Texmex, 1}, {5, Tex, 1}]|
#+------------------------------+
df.withColumn(&quot;shop_time&quot;, transform(&#39;shop_time&#39;, lambda x: x.withField(&#39;diff_sec&#39;, lit(1)))).\
    printSchema()
#root
# |-- shop_time: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- seconds: long (nullable = true)
# |    |    |-- shop: string (nullable = true)
# |    |    |-- diff_sec: integer (nullable = false)

UPDATE:

Using Spark-sql:

df.createOrReplaceTempView(&quot;tmp&quot;)
spark.sql(&quot;select transform(shop_time,x -&gt; struct(1 as diff_sec, x.seconds,x.shop)) as shop_time from tmp&quot;).\
  show(10,False)
#+------------------------------+
#|shop_time                     |
#+------------------------------+
#|[{1, 10, Texmex}, {1, 5, Tex}]|
#+------------------------------+

答案2

得分: 1

你的问题是你正在对一个列（你的shops列）使用withField方法，而该列的类型是ArrayType而不是StructType。

你可以通过使用pyspark.sql.functions的transform函数来解决这个问题。根据文档：

在对输入数组中的每个元素应用转换后，返回元素数组。

所以让我们首先创建一些输入数据：

from pyspark.sql.types import StringType, StructType, StructField, ArrayType, DoubleType
from pyspark.sql import functions as F
schema = StructType(
    [
        StructField(
            "shops",
            ArrayType(
                StructType(
                    [
                        StructField("epoch", DoubleType()),
                        StructField("request", StringType()),
                    ]
                )
            ),
        )
    ]
)
df = spark.createDataFrame(
    [
        [[(5.0, "haha")]],
        [[(6.0, "hoho")]],
    ],
    schema=schema,
)

然后使用transform函数在`shops

英文:

Your issue is that you're using the withField method on a column (your shops column) that is of type ArrayType and not of StructType.

You can fix this by using pyspark.sql.functions's transform function. From the docs:

> Returns an array of elements after applying a transformation to each element in the input array.

So let's first create some input data:

from pyspark.sql.types import StringType, StructType, StructField, ArrayType, DoubleType
from pyspark.sql import functions as F
schema = StructType(
    [
        StructField(
            &quot;shops&quot;,
            ArrayType(
                StructType(
                    [
                        StructField(&quot;epoch&quot;, DoubleType()),
                        StructField(&quot;request&quot;, StringType()),
                    ]
                )
            ),
        )
    ]
)
df = spark.createDataFrame(
    [
        [[(5.0, &quot;haha&quot;)]],
        [[(6.0, &quot;hoho&quot;)]],
    ],
    schema=schema,
)

And now use the transform function to apply your withField operation on each element of the shops column.

new_df = df.withColumn(
    &quot;state&quot;, F.transform(F.col(&quot;shops&quot;), lambda x: x.withField(&quot;a&quot;, F.lit(1)))
)
&gt;&gt;&gt; new_df.printSchema()
root
 |-- shops: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- epoch: double (nullable = true)
 |    |    |-- request: string (nullable = true)
 |-- state: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- epoch: double (nullable = true)
 |    |    |-- request: string (nullable = true)
 |    |    |-- a: integer (nullable = false)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

TypeError: 添加列到结构时，’Column’ 对象不可调用

问题

答案1

答案2

pyspark中使用多个条件连接不同行的表dfs：

Spark会话值未更新

如何扩展查询，如果 SQL 查询是带参数的？

Pyspark：添加具有行计数的单个值的行/列

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。