TypeError: 添加列到结构时,’Column’ 对象不可调用

huangapple go评论135阅读模式
英文:

TypeError: 'Column' object is not callable when adding column to Struct

问题

我正在实现这里提到的答案。
这是我的结构体,我想要添加一个新列。

  1. root
  2. |-- shops: array (nullable = true)
  3. | |-- element: struct (containsNull = true)
  4. | | |-- epoch: double (nullable = true)
  5. | | |-- request: string (nullable = true)

所以我执行了以下代码:

  1. from pyspark.sql import functions as F
  2. df = new_df.withColumn('state', F.col('shops').withField('a', F.lit(1)))
  3. df.printSchema()

但是我得到了以下错误:

  1. TypeError Traceback (most recent call last)
  2. <ipython-input-47-1749b2131995> in <module>
  3. 1 from pyspark.sql import functions as F
  4. ----> 2 df = new_df.withColumn('state', F.col('shops').withField('a', F.lit(1)))
  5. 3 df.printSchema()
  6. TypeError: 'Column' object is not callable

编辑:我的版本是Python 3.9,Spark 3.0.3(可能是最大版本)。

英文:

I was implementing the answer mentioned here.
This is my struct and I want to add a new col to it.

  1. root
  2. |-- shops: array (nullable = true)
  3. | |-- element: struct (containsNull = true)
  4. | | |-- epoch: double (nullable = true)
  5. | | |-- request: string (nullable = true)

So I executed this

  1. from pyspark.sql import functions as F
  2. df = new_df.withColumn(&#39;state&#39;, F.col(&#39;shops&#39;).withField(&#39;a&#39;, F.lit(1)))
  3. df.printSchema()

But I get this error

  1. TypeError Traceback (most recent call last)
  2. &lt;ipython-input-47-1749b2131995&gt; in &lt;module&gt;
  3. 1 from pyspark.sql import functions as F
  4. ----&gt; 2 df = new_df.withColumn(&#39;state&#39;, F.col(‘shops’).withField(&#39;a&#39;, F.lit(1)))
  5. 3 df.printSchema()
  6. TypeError: &#39;Column&#39; object is not callable

EDIT: My version is Python 39 Spark 3.0.3 (Max possible)

答案1

得分: 2

尝试使用transform高阶函数,因为您试图向一个array添加新列。

示例:

  1. from pyspark.sql.functions import *
  2. jsn_str = """{"shop_time":[{"seconds":10,"shop":"Texmex"},{"seconds":5,"shop":"Tex"}]}"""
  3. df = spark.read.json(sc.parallelize([jsn_str]), multiLine=True)
  4. df.\
  5. withColumn("shop_time", transform('shop_time', lambda x: x.withField('diff_sec', lit(1)))).\
  6. show(10,False)
  7. #+------------------------------+
  8. #|shop_time |
  9. #+------------------------------+
  10. #|[{10, Texmex, 1}, {5, Tex, 1}]|
  11. #+------------------------------+
  12. df.withColumn("shop_time", transform('shop_time', lambda x: x.withField('diff_sec', lit(1)))).\
  13. printSchema()
  14. #root
  15. # |-- shop_time: array (nullable = true)
  16. # | |-- element: struct (containsNull = true)
  17. # | | |-- seconds: long (nullable = true)
  18. # | | |-- shop: string (nullable = true)
  19. # | | |-- diff_sec: integer (nullable = false)

更新:

使用Spark-sql

  1. df.createOrReplaceTempView("tmp")
  2. spark.sql("select transform(shop_time, x -> struct(1 as diff_sec, x.seconds, x.shop)) as shop_time from tmp").\
  3. show(10,False)
  4. #+------------------------------+
  5. #|shop_time |
  6. #+------------------------------+
  7. #|[{1, 10, Texmex}, {1, 5, Tex}]|
  8. #+------------------------------+
英文:

Try with transform higher order function, as you are trying to add new column to an array.

Example:

  1. from pyspark.sql.functions import *
  2. jsn_str=&quot;&quot;&quot;{&quot;shop_time&quot;:[{&quot;seconds&quot;:10,&quot;shop&quot;:&quot;Texmex&quot;},{&quot;seconds&quot;:5,&quot;shop&quot;:&quot;Tex&quot;}]}&quot;&quot;&quot;
  3. df = spark.read.json(sc.parallelize([jsn_str]), multiLine=True)
  4. df.\
  5. withColumn(&quot;shop_time&quot;, transform(&#39;shop_time&#39;, lambda x: x.withField(&#39;diff_sec&#39;, lit(1)))).\
  6. show(10,False)
  7. #+------------------------------+
  8. #|shop_time |
  9. #+------------------------------+
  10. #|[{10, Texmex, 1}, {5, Tex, 1}]|
  11. #+------------------------------+
  12. df.withColumn(&quot;shop_time&quot;, transform(&#39;shop_time&#39;, lambda x: x.withField(&#39;diff_sec&#39;, lit(1)))).\
  13. printSchema()
  14. #root
  15. # |-- shop_time: array (nullable = true)
  16. # | |-- element: struct (containsNull = true)
  17. # | | |-- seconds: long (nullable = true)
  18. # | | |-- shop: string (nullable = true)
  19. # | | |-- diff_sec: integer (nullable = false)

UPDATE:

Using Spark-sql:

  1. df.createOrReplaceTempView(&quot;tmp&quot;)
  2. spark.sql(&quot;select transform(shop_time,x -&gt; struct(1 as diff_sec, x.seconds,x.shop)) as shop_time from tmp&quot;).\
  3. show(10,False)
  4. #+------------------------------+
  5. #|shop_time |
  6. #+------------------------------+
  7. #|[{1, 10, Texmex}, {1, 5, Tex}]|
  8. #+------------------------------+

答案2

得分: 1

你的问题是你正在对一个列(你的shops列)使用withField方法,而该列的类型是ArrayType而不是StructType

你可以通过使用pyspark.sql.functionstransform函数来解决这个问题。根据文档:

在对输入数组中的每个元素应用转换后,返回元素数组。

所以让我们首先创建一些输入数据:

  1. from pyspark.sql.types import StringType, StructType, StructField, ArrayType, DoubleType
  2. from pyspark.sql import functions as F
  3. schema = StructType(
  4. [
  5. StructField(
  6. "shops",
  7. ArrayType(
  8. StructType(
  9. [
  10. StructField("epoch", DoubleType()),
  11. StructField("request", StringType()),
  12. ]
  13. )
  14. ),
  15. )
  16. ]
  17. )
  18. df = spark.createDataFrame(
  19. [
  20. [[(5.0, "haha")]],
  21. [[(6.0, "hoho")]],
  22. ],
  23. schema=schema,
  24. )

然后使用transform函数在`shops

英文:

Your issue is that you're using the withField method on a column (your shops column) that is of type ArrayType and not of StructType.

You can fix this by using pyspark.sql.functions's transform function. From the docs:

> Returns an array of elements after applying a transformation to each element in the input array.

So let's first create some input data:

  1. from pyspark.sql.types import StringType, StructType, StructField, ArrayType, DoubleType
  2. from pyspark.sql import functions as F
  3. schema = StructType(
  4. [
  5. StructField(
  6. &quot;shops&quot;,
  7. ArrayType(
  8. StructType(
  9. [
  10. StructField(&quot;epoch&quot;, DoubleType()),
  11. StructField(&quot;request&quot;, StringType()),
  12. ]
  13. )
  14. ),
  15. )
  16. ]
  17. )
  18. df = spark.createDataFrame(
  19. [
  20. [[(5.0, &quot;haha&quot;)]],
  21. [[(6.0, &quot;hoho&quot;)]],
  22. ],
  23. schema=schema,
  24. )

And now use the transform function to apply your withField operation on each element of the shops column.

  1. new_df = df.withColumn(
  2. &quot;state&quot;, F.transform(F.col(&quot;shops&quot;), lambda x: x.withField(&quot;a&quot;, F.lit(1)))
  3. )
  4. &gt;&gt;&gt; new_df.printSchema()
  5. root
  6. |-- shops: array (nullable = true)
  7. | |-- element: struct (containsNull = true)
  8. | | |-- epoch: double (nullable = true)
  9. | | |-- request: string (nullable = true)
  10. |-- state: array (nullable = true)
  11. | |-- element: struct (containsNull = true)
  12. | | |-- epoch: double (nullable = true)
  13. | | |-- request: string (nullable = true)
  14. | | |-- a: integer (nullable = false)

huangapple
  • 本文由 发表于 2023年8月10日 18:29:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76874884.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定