2023年5月14日 02:11:39go评论103阅读模式

英文:

Python unable to write 3 million records to .xpt sas dataset

问题

I have a pandas DataFrame with 3 million records but not able to write the large DataFrame to .xpt files. Although the same script works for other DataFrames having 2 million records without modifying anything.

While writing, I am using the df.to_xpt() function, and while writing to the .xpt file, suddenly after a couple of minutes, I get a "KILLED" message on the console.

System - RHEL Linux
Python - 3.7
Pandas version - 1.2
Using the xport module with .v56 subpackage while writing.

Need guidance in the following areas:

Is this a case of memory leak?
Any chances of bad data in the DataFrame?

What is the correct approach in debugging a single built-in function?

Not sure about the functionality of writing pandas DataFrame to SAS datasets in .xpt format.

英文:

i have a pandas dataframe with 3 million records but not able to write large dataframe to .xpt files. Although the same script works for other data frames having 2 million records without modifying anything.

While writing I am using df.from_dataframe() function and while writing to the xpt file suddenly after couple of minutes i get "KILLED" message on the console.

System - RHEL Linux
Python - 3.7
Pandas version - 1.2
Using xport module with .v56 subpackage while writing.

Need guidance in below areas

Is this a case of memory leak
any chances of bad data in dataframe

What is the correct approach in debugging a single built-in function?

Not sure about the functionality of writing pandas dataframe to sas datasets in .xpt format.

答案1

得分: 1

以下是您提供的内容的中文翻译：

相关错误已在 xport 的 GitHub 页面上报告，并在 xport 的某个版本中得到解决，不幸的是，该版本与 pandas==1.2.4 不兼容。

我使用 pandas==1.2.4 和 xport==3.2.1（与此 pandas 版本兼容的 xport 的最高版本）运行了下面的代码，结果出现以下错误：NotImplementedError: 无法将 SAS 变量元数据复制到数据框

一旦升级到 pandas==1.3.5 和 xport==3.6.1，代码就可以正常运行。好消息是：根据其官方文档，pandas==1.3.5 仍然在 Python 3.7.1 及以上版本上可用。

代码（来自 xport 的文档的示例代码，稍作调整）：

import pandas as pd
import xport
import xport.v56
# 创建包含 300 万项的列表
alpha_list = [10, 20, 30] * 1_000_000
beta_list = ['x', 'y', 'z'] * 1_000_000
df = pd.DataFrame({
    'alpha': alpha_list,
    'beta': beta_list
})
ds = xport.Dataset(df, name='DATA', label='精彩数据')
# 在 SAS 传输文件 (.xpt) 中，变量名称限制为 8 个字符。在 SAS 数据集中，变量名称可以长达 32 个字符。与 Pandas 数据框一样，必须在数据集上更改名称，而不是直接更改列。
ds = ds.rename(columns={k: k.upper()[:8] for k in ds})
# 其他 SAS 元数据可以在列本身上设置。
for k, v in ds.items():
    v.label = k.title()
    if v.dtype == 'object':
        v.format = '$CHAR20.'
    else:
        v.format = '10.2'
# 库可以包含多个数据集。
library = xport.Library({'DATA': ds})
with open('example.xpt', 'wb') as f:
    xport.v56.dump(library, f)
# 加载导出文件以检查一切是否正常工作
df = pd.read_sas('example.xpt')
print(df.head(n=10).to_markdown(index=False))

输出：

|   ALPHA | BETA   |
|--------:|:-------|
|      10 | x      |
|      20 | y      |
|      30 | z      |
|      10 | x      |
|      20 | y      |
|      30 | z      |
|      10 | x      |
|      20 | y      |
|      30 | z      |
|      10 | x      |

英文:

A related error had been reported on the GitHub page of xport and got resolved in a version of xport which is, unfortunately, not compatible with pandas==1.2.4

I ran the code below with pandas==1.2.4 & xport==3.2.1 (highest version of xport compatible with this pandas version), which gave me the following error: NotImplementedError: Can't copy SAS variable metadata to dataframe

Once upgrading to pandas==1.3.5 & xport==3.6.1, the code worked like a charm. Here's the good news: pandas==1.3.5 still works on Python 3.7.1 and above according to its official doc

Code (example code from xport documentation with some adaptations):

import pandas as pd
import xport
import xport.v56
# create list with 3 million items each
alpha_list = [10, 20, 30] * 1_000_000
beta_list = [&#39;x&#39;, &#39;y&#39;, &#39;z&#39;] * 1_000_000
df = pd.DataFrame({
    &#39;alpha&#39;: alpha_list,
    &#39;beta&#39;: beta_list
})
ds = xport.Dataset(df, name=&#39;DATA&#39;, label=&#39;Wonderful data&#39;)
# Variable names are limited to 8 characters in SAS transport
# files (.xpt). In SAS datasets, variable names may be up to 
# 32 characters  As with Pandas dataframes, you must change the
# name on the dataset rather than the column directly.
ds = ds.rename(columns={k: k.upper()[:8] for k in ds})
# Other SAS metadata can be set on the columns themselves.
for k, v in ds.items():
    v.label = k.title()
    if v.dtype == &#39;object&#39;:
        v.format = &#39;$CHAR20.&#39;
    else:
        v.format = &#39;10.2&#39;
# Libraries can have multiple datasets.
library = xport.Library({&#39;DATA&#39;: ds})
with open(&#39;example.xpt&#39;, &#39;wb&#39;) as f:
    xport.v56.dump(library, f)
    
# load exported file to check if everything&#39;s working
df = pd.read_sas(&#39;example.xpt&#39;)
print(df.head(n=10).to_markdown(index=False))

prints:

|   ALPHA | BETA   |
|--------:|:-------|
|      10 | x      |
|      20 | y      |
|      30 | z      |
|      10 | x      |
|      20 | y      |
|      30 | z      |
|      10 | x      |
|      20 | y      |
|      30 | z      |
|      10 | x      |

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python 无法将 300 万条记录写入 .xpt 的 SAS 数据集。

问题

答案1

在Python中识别每行中的”1″并创建一个列表。

How to iterate over each word in a Trie using iter()and next() functions and generators in Python

使用pandas的.astype在后面跟着.replace时，强制数据类型不像预期那样工作。

如何修复使用Nuitka编译.py到.exe时出现的致命错误C1060？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。