Not able to write spark dataframe. Error Found nested NullType in column 'colname' which is of ArrayType

huangapple go评论105阅读模式
英文:

Not able to write spark dataframe. Error Found nested NullType in column 'colname' which is of ArrayType

问题

只要翻译代码和相关内容,以下是您提供的内容的翻译:

我有一个名为df的pandas数据帧其中的一些列包含字符串列表

    id    colname    colname1
    a1    []         []
    a2    []         []
    a3    []         ['anc', 'asf']

我想将其写入Delta表根据表的模式colname和colname1的数据类型是array<string>

但是正如您所见colname不包含任何数据所以当我尝试将其写入表时出现以下错误

AnalysisException: 在数组类型的列'colname'中发现嵌套的NullType。Delta不支持在复杂类型中写入NullType。


这是用于将其写入表的代码。

    spark_df = spark.createDataFrame(df)
    spark_df.write.mode("append").option("overwriteSchema", "true").saveAsTable("dbname.tbl_name")

我尝试在各处搜索,但没有找到解决方案。

即使colname列完全为空(在这种情况下),我应该怎么做才能成功地将数据插入表中。
英文:

Hi I have a pandas dataframe named df , where few of the columns contain list of strings.

id    colname    colname1
a1    []         []
a2    []         []
a3    []         [&#39;anc&#39;,&#39;asf&#39;]

I want to write it into delta table. As per the schema of the table, the datatype of colname and colname1 are array<string>.

But as you can see colname doesn't contain any data, so when I'm trying to write it into the table. it is giving me this error:

AnalysisException: Found nested NullType in column &#39;colname&#39; which is of ArrayType. Delta doesn&#39;t support writing NullType in complex types.

This is the code for writing it to table.

spark_df = spark.createDataFrame(df)
spark_df.write.mode(&quot;append&quot;).option(&quot;overwriteSchema&quot;, &quot;true&quot;).saveAsTable(&quot;dbname.tbl_name&quot;)

I tried to search everywhere but didn't find the solution.

What can I do so that even if the colname column is entirely empty(as in this case) the data should be successfully inserted in the table.

答案1

得分: 1

如果您的列只包含空数组,Spark 无法确定它是否是整数数组、字符串数组或其他类型 - 最终它将其视为包含空值的数组。

在创建DataFrame时明确提供模式:

from pyspark.sql.types import *
schema = StructType([
           StructField("id", StringType(), True),
           StructField("colname", ArrayType(StringType()), True),
           StructField("colname1", ArrayType(StringType()), True)
         ])
spark.createDataFrame(df, schema)
英文:

If your column contains only empty arrays, Spark cannot tell whether it would be array of ints or strings or whatever - finally it considers array of nulls.

Provide schema explicitly when creating DataFrame:

from pyspark.sql.types import *
schema = StructType([
           StructField(&quot;id&quot;, StringType(), True),
           StructField(&quot;colname&quot;, ArrayType(StringType()), True),
           StructField(&quot;colname1&quot;, ArrayType(StringType()), True)
         ])
spark.createDataFrame(df, schema)

答案2

得分: 0

你可以在将数组写入 Delta 之前对其进行转换。使用此解决方案,您不必提前定义一切。

from pyspark.sql import functions as sf
from pyspark.sql.types import StringType, ArrayType
df = df.withColumn("colname", sf.col("colname").cast(ArrayType(StringType())))
英文:

You can also just cast your array prior to writing it to delta. With this solution you don't have to define everything up front.

from pyspark.sql import functions as sf
from pyspark.sql.types import StringType, ArrayType
df = df.withColumn(&quot;colname&quot;, sf.col(&quot;colname&quot;).cast(ArrayType(StringType())))

huangapple
  • 本文由 发表于 2023年3月31日 16:48:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/75896561.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定