2023年6月8日 12:14:51go评论89阅读模式

英文:

How to display the size of each record of a PySpark Dataframe?

问题

我们将一个parquet文件读入一个pyspark dataframe，并将其加载到Synapse中。但显然，我们的dataframe包含的记录超过了Synapse（polybase）的1MB限制。我们的databricks数据导入脚本一直抛出以下错误：

在序数'n'处的模式/行的大小超过了最大允许的行大小1000000字节。

我正在尝试找出我的dataframe中哪一行存在此问题，但我无法识别有问题的行。

我能够打印出dataframe的每列长度，但如何打印出每个记录的大小呢？

有办法可以做到这一点吗？有人能帮忙吗？

英文:

We read a parquet file into a pyspark dataframe and load it into Synapse. But apparently, our dataframe is having records that exceed the 1MB limit on Synapse (polybase). Our databricks ingestion scripts keep throwing the below error:

The size of the schema/row at ordinal 'n' exceeds the maximum allowed row size of 1000000 bytes.

I'm trying to find out which row in my dataframe has this issue but I'm unable to identify the faulty row.

I was able to print the length of each column of a dataframe but how do I print the size of each record?

Is there a way to do this? Can someone please help?

答案1

得分: 0

使用以下代码来获取每一行的大小。

import sys
rows = df.collect()
for rw in rows:
    print(str((sys.getsizeof(''.join(rw[0:])))) + " bytes")

这将为您提供以字节为单位的大小。

在获取这些数据后，检查哪个记录的大小更大。

英文:

Use below code to get size of each row.

import sys
rows = df.collect()
for rw in rows:
    print(str((sys.getsizeof(&#39;&#39;.join(rw[0:]))))+&quot; bytes&quot;)

This gives you size in bytes.

如何显示 PySpark 数据框中每个记录的大小？

After getting this, check which record has more size.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何显示 PySpark 数据框中每个记录的大小？

问题

答案1

最佳方法是在显示 PySpark DataFrame 时，避免每次重新执行逻辑。

使用PySpark创建时间戳列

从Pyspark中的时间戳列中提取小时。

从多行获取数值到单行

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。