英文:
TypeError: 'Column' object is not callable when adding column to Struct
问题
我正在实现这里提到的答案。
这是我的结构体,我想要添加一个新列。
root
|-- shops: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- epoch: double (nullable = true)
| | |-- request: string (nullable = true)
所以我执行了以下代码:
from pyspark.sql import functions as F
df = new_df.withColumn('state', F.col('shops').withField('a', F.lit(1)))
df.printSchema()
但是我得到了以下错误:
TypeError Traceback (most recent call last)
<ipython-input-47-1749b2131995> in <module>
1 from pyspark.sql import functions as F
----> 2 df = new_df.withColumn('state', F.col('shops').withField('a', F.lit(1)))
3 df.printSchema()
TypeError: 'Column' object is not callable
编辑:我的版本是Python 3.9,Spark 3.0.3(可能是最大版本)。
英文:
I was implementing the answer mentioned here.
This is my struct and I want to add a new col to it.
root
|-- shops: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- epoch: double (nullable = true)
| | |-- request: string (nullable = true)
So I executed this
from pyspark.sql import functions as F
df = new_df.withColumn('state', F.col('shops').withField('a', F.lit(1)))
df.printSchema()
But I get this error
TypeError Traceback (most recent call last)
<ipython-input-47-1749b2131995> in <module>
1 from pyspark.sql import functions as F
----> 2 df = new_df.withColumn('state', F.col(‘shops’).withField('a', F.lit(1)))
3 df.printSchema()
TypeError: 'Column' object is not callable
EDIT: My version is Python 39 Spark 3.0.3 (Max possible)
答案1
得分: 2
尝试使用transform
高阶函数,因为您试图向一个array
添加新列。
示例:
from pyspark.sql.functions import *
jsn_str = """{"shop_time":[{"seconds":10,"shop":"Texmex"},{"seconds":5,"shop":"Tex"}]}"""
df = spark.read.json(sc.parallelize([jsn_str]), multiLine=True)
df.\
withColumn("shop_time", transform('shop_time', lambda x: x.withField('diff_sec', lit(1)))).\
show(10,False)
#+------------------------------+
#|shop_time |
#+------------------------------+
#|[{10, Texmex, 1}, {5, Tex, 1}]|
#+------------------------------+
df.withColumn("shop_time", transform('shop_time', lambda x: x.withField('diff_sec', lit(1)))).\
printSchema()
#root
# |-- shop_time: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- seconds: long (nullable = true)
# | | |-- shop: string (nullable = true)
# | | |-- diff_sec: integer (nullable = false)
更新:
使用Spark-sql
:
df.createOrReplaceTempView("tmp")
spark.sql("select transform(shop_time, x -> struct(1 as diff_sec, x.seconds, x.shop)) as shop_time from tmp").\
show(10,False)
#+------------------------------+
#|shop_time |
#+------------------------------+
#|[{1, 10, Texmex}, {1, 5, Tex}]|
#+------------------------------+
英文:
Try with transform
higher order function, as you are trying to add new column to an array
.
Example:
from pyspark.sql.functions import *
jsn_str="""{"shop_time":[{"seconds":10,"shop":"Texmex"},{"seconds":5,"shop":"Tex"}]}"""
df = spark.read.json(sc.parallelize([jsn_str]), multiLine=True)
df.\
withColumn("shop_time", transform('shop_time', lambda x: x.withField('diff_sec', lit(1)))).\
show(10,False)
#+------------------------------+
#|shop_time |
#+------------------------------+
#|[{10, Texmex, 1}, {5, Tex, 1}]|
#+------------------------------+
df.withColumn("shop_time", transform('shop_time', lambda x: x.withField('diff_sec', lit(1)))).\
printSchema()
#root
# |-- shop_time: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- seconds: long (nullable = true)
# | | |-- shop: string (nullable = true)
# | | |-- diff_sec: integer (nullable = false)
UPDATE:
Using Spark-sql
:
df.createOrReplaceTempView("tmp")
spark.sql("select transform(shop_time,x -> struct(1 as diff_sec, x.seconds,x.shop)) as shop_time from tmp").\
show(10,False)
#+------------------------------+
#|shop_time |
#+------------------------------+
#|[{1, 10, Texmex}, {1, 5, Tex}]|
#+------------------------------+
答案2
得分: 1
你的问题是你正在对一个列(你的shops
列)使用withField
方法,而该列的类型是ArrayType
而不是StructType
。
你可以通过使用pyspark.sql.functions
的transform
函数来解决这个问题。根据文档:
在对输入数组中的每个元素应用转换后,返回元素数组。
所以让我们首先创建一些输入数据:
from pyspark.sql.types import StringType, StructType, StructField, ArrayType, DoubleType
from pyspark.sql import functions as F
schema = StructType(
[
StructField(
"shops",
ArrayType(
StructType(
[
StructField("epoch", DoubleType()),
StructField("request", StringType()),
]
)
),
)
]
)
df = spark.createDataFrame(
[
[[(5.0, "haha")]],
[[(6.0, "hoho")]],
],
schema=schema,
)
然后使用transform
函数在`shops
英文:
Your issue is that you're using the withField
method on a column (your shops
column) that is of type ArrayType
and not of StructType
.
You can fix this by using pyspark.sql.functions
's transform
function. From the docs:
> Returns an array of elements after applying a transformation to each element in the input array.
So let's first create some input data:
from pyspark.sql.types import StringType, StructType, StructField, ArrayType, DoubleType
from pyspark.sql import functions as F
schema = StructType(
[
StructField(
"shops",
ArrayType(
StructType(
[
StructField("epoch", DoubleType()),
StructField("request", StringType()),
]
)
),
)
]
)
df = spark.createDataFrame(
[
[[(5.0, "haha")]],
[[(6.0, "hoho")]],
],
schema=schema,
)
And now use the transform
function to apply your withField
operation on each element of the shops
column.
new_df = df.withColumn(
"state", F.transform(F.col("shops"), lambda x: x.withField("a", F.lit(1)))
)
>>> new_df.printSchema()
root
|-- shops: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- epoch: double (nullable = true)
| | |-- request: string (nullable = true)
|-- state: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- epoch: double (nullable = true)
| | |-- request: string (nullable = true)
| | |-- a: integer (nullable = false)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论