Not able to write spark dataframe. Error Found nested NullType in column 'colname' which is of ArrayType

Not able to write spark dataframe. Error Found nested NullType in column 'colname' which is of ArrayType



  1. 我有一个名为dfpandas数据帧其中的一些列包含字符串列表
  2. id colname colname1
  3. a1 [] []
  4. a2 [] []
  5. a3 [] ['anc', 'asf']
  6. 我想将其写入Delta根据表的模式colnamecolname1的数据类型是array<string>
  7. 但是正如您所见colname不包含任何数据所以当我尝试将其写入表时出现以下错误

AnalysisException: 在数组类型的列'colname'中发现嵌套的NullType。Delta不支持在复杂类型中写入NullType。

  1. 这是用于将其写入表的代码。
  2. spark_df = spark.createDataFrame(df)
  3. spark_df.write.mode("append").option("overwriteSchema", "true").saveAsTable("dbname.tbl_name")
  4. 我尝试在各处搜索,但没有找到解决方案。
  5. 即使colname列完全为空(在这种情况下),我应该怎么做才能成功地将数据插入表中。

Hi I have a pandas dataframe named df , where few of the columns contain list of strings.

  1. id colname colname1
  2. a1 [] []
  3. a2 [] []
  4. a3 [] [&#39;anc&#39;,&#39;asf&#39;]

I want to write it into delta table. As per the schema of the table, the datatype of colname and colname1 are array<string>.

But as you can see colname doesn't contain any data, so when I'm trying to write it into the table. it is giving me this error:

  1. AnalysisException: Found nested NullType in column &#39;colname&#39; which is of ArrayType. Delta doesn&#39;t support writing NullType in complex types.

This is the code for writing it to table.

  1. spark_df = spark.createDataFrame(df)
  2. spark_df.write.mode(&quot;append&quot;).option(&quot;overwriteSchema&quot;, &quot;true&quot;).saveAsTable(&quot;dbname.tbl_name&quot;)

I tried to search everywhere but didn't find the solution.

What can I do so that even if the colname column is entirely empty(as in this case) the data should be successfully inserted in the table.


得分: 1

如果您的列只包含空数组,Spark 无法确定它是否是整数数组、字符串数组或其他类型 - 最终它将其视为包含空值的数组。


  1. from pyspark.sql.types import *
  2. schema = StructType([
  3. StructField("id", StringType(), True),
  4. StructField("colname", ArrayType(StringType()), True),
  5. StructField("colname1", ArrayType(StringType()), True)
  6. ])
  7. spark.createDataFrame(df, schema)

If your column contains only empty arrays, Spark cannot tell whether it would be array of ints or strings or whatever - finally it considers array of nulls.

Provide schema explicitly when creating DataFrame:

  1. from pyspark.sql.types import *
  2. schema = StructType([
  3. StructField(&quot;id&quot;, StringType(), True),
  4. StructField(&quot;colname&quot;, ArrayType(StringType()), True),
  5. StructField(&quot;colname1&quot;, ArrayType(StringType()), True)
  6. ])
  7. spark.createDataFrame(df, schema)


得分: 0

你可以在将数组写入 Delta 之前对其进行转换。使用此解决方案,您不必提前定义一切。

  1. from pyspark.sql import functions as sf
  2. from pyspark.sql.types import StringType, ArrayType
  3. df = df.withColumn("colname", sf.col("colname").cast(ArrayType(StringType())))

You can also just cast your array prior to writing it to delta. With this solution you don't have to define everything up front.

  1. from pyspark.sql import functions as sf
  2. from pyspark.sql.types import StringType, ArrayType
  3. df = df.withColumn(&quot;colname&quot;, sf.col(&quot;colname&quot;).cast(ArrayType(StringType())))

