2023年3月31日 16:48:55go评论195阅读模式

英文:

Not able to write spark dataframe. Error Found nested NullType in column 'colname' which is of ArrayType

问题

只要翻译代码和相关内容，以下是您提供的内容的翻译：

我有一个名为df的pandas数据帧，其中的一些列包含字符串列表。

    id    colname    colname1
    a1    []         []
    a2    []         []
    a3    []         ['anc', 'asf']

我想将其写入Delta表。根据表的模式，colname和colname1的数据类型是array<string>。

但是正如您所见，colname不包含任何数据，所以当我尝试将其写入表时，出现以下错误：

AnalysisException: 在数组类型的列'colname'中发现嵌套的NullType。Delta不支持在复杂类型中写入NullType。


这是用于将其写入表的代码。

    spark_df = spark.createDataFrame(df)
    spark_df.write.mode("append").option("overwriteSchema", "true").saveAsTable("dbname.tbl_name")

我尝试在各处搜索，但没有找到解决方案。

即使colname列完全为空（在这种情况下），我应该怎么做才能成功地将数据插入表中。

英文:

Hi I have a pandas dataframe named df , where few of the columns contain list of strings.

id    colname    colname1
a1    []         []
a2    []         []
a3    []         [&#39;anc&#39;,&#39;asf&#39;]

I want to write it into delta table. As per the schema of the table, the datatype of colname and colname1 are array<string>.

But as you can see colname doesn't contain any data, so when I'm trying to write it into the table. it is giving me this error:

AnalysisException: Found nested NullType in column &#39;colname&#39; which is of ArrayType. Delta doesn&#39;t support writing NullType in complex types.

This is the code for writing it to table.

spark_df = spark.createDataFrame(df)
spark_df.write.mode(&quot;append&quot;).option(&quot;overwriteSchema&quot;, &quot;true&quot;).saveAsTable(&quot;dbname.tbl_name&quot;)

I tried to search everywhere but didn't find the solution.

What can I do so that even if the colname column is entirely empty(as in this case) the data should be successfully inserted in the table.

答案1

得分: 1

如果您的列只包含空数组，Spark 无法确定它是否是整数数组、字符串数组或其他类型 - 最终它将其视为包含空值的数组。

在创建DataFrame时明确提供模式：

from pyspark.sql.types import *
schema = StructType([
           StructField("id", StringType(), True),
           StructField("colname", ArrayType(StringType()), True),
           StructField("colname1", ArrayType(StringType()), True)
         ])
spark.createDataFrame(df, schema)

英文:

If your column contains only empty arrays, Spark cannot tell whether it would be array of ints or strings or whatever - finally it considers array of nulls.

Provide schema explicitly when creating DataFrame:

from pyspark.sql.types import *
schema = StructType([
           StructField(&quot;id&quot;, StringType(), True),
           StructField(&quot;colname&quot;, ArrayType(StringType()), True),
           StructField(&quot;colname1&quot;, ArrayType(StringType()), True)
         ])
spark.createDataFrame(df, schema)

答案2

得分: 0

你可以在将数组写入 Delta 之前对其进行转换。使用此解决方案，您不必提前定义一切。

from pyspark.sql import functions as sf
from pyspark.sql.types import StringType, ArrayType
df = df.withColumn("colname", sf.col("colname").cast(ArrayType(StringType())))

英文:

You can also just cast your array prior to writing it to delta. With this solution you don't have to define everything up front.

from pyspark.sql import functions as sf
from pyspark.sql.types import StringType, ArrayType
df = df.withColumn(&quot;colname&quot;, sf.col(&quot;colname&quot;).cast(ArrayType(StringType())))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Not able to write spark dataframe. Error Found nested NullType in column 'colname' which is of ArrayType

问题

答案1

答案2

Scipy.stats T-分布的置信区间与手动计算的结果不同。

如何在Django Restframework中使用外键显示名称而不是ID

Python3从函数返回多个上下文管理器，以便在单个with语句中使用

如何高效处理和筛选大型CSV文件在Python中？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论