英文:
Python unable to write 3 million records to .xpt sas dataset
问题
I have a pandas DataFrame with 3 million records but not able to write the large DataFrame to .xpt files. Although the same script works for other DataFrames having 2 million records without modifying anything.
While writing, I am using the df.to_xpt()
function, and while writing to the .xpt file, suddenly after a couple of minutes, I get a "KILLED" message on the console.
- System - RHEL Linux
- Python - 3.7
- Pandas version - 1.2
- Using the xport module with .v56 subpackage while writing.
Need guidance in the following areas:
- Is this a case of memory leak?
- Any chances of bad data in the DataFrame?
What is the correct approach in debugging a single built-in function?
Not sure about the functionality of writing pandas DataFrame to SAS datasets in .xpt format.
英文:
i have a pandas dataframe with 3 million records but not able to write large dataframe to .xpt files. Although the same script works for other data frames having 2 million records without modifying anything.
While writing I am using df.from_dataframe()
function and while writing to the xpt file suddenly after couple of minutes i get "KILLED" message on the console.
- System - RHEL Linux
- Python - 3.7
- Pandas version - 1.2
- Using xport module with .v56 subpackage while writing.
Need guidance in below areas
- Is this a case of memory leak
- any chances of bad data in dataframe
What is the correct approach in debugging a single built-in function?
Not sure about the functionality of writing pandas dataframe to sas datasets in .xpt format.
答案1
得分: 1
以下是您提供的内容的中文翻译:
相关错误 已在 xport
的 GitHub 页面上报告,并在 xport
的某个版本中得到解决,不幸的是,该版本与 pandas==1.2.4
不兼容。
我使用 pandas==1.2.4
和 xport==3.2.1
(与此 pandas 版本兼容的 xport 的最高版本)运行了下面的代码,结果出现以下错误:NotImplementedError: 无法将 SAS 变量元数据复制到数据框
一旦升级到 pandas==1.3.5
和 xport==3.6.1
,代码就可以正常运行。好消息是:根据其官方文档,pandas==1.3.5
仍然在 Python 3.7.1
及以上版本上可用。
代码(来自 xport
的文档的示例代码,稍作调整):
import pandas as pd
import xport
import xport.v56
# 创建包含 300 万项的列表
alpha_list = [10, 20, 30] * 1_000_000
beta_list = ['x', 'y', 'z'] * 1_000_000
df = pd.DataFrame({
'alpha': alpha_list,
'beta': beta_list
})
ds = xport.Dataset(df, name='DATA', label='精彩数据')
# 在 SAS 传输文件 (.xpt) 中,变量名称限制为 8 个字符。在 SAS 数据集中,变量名称可以长达 32 个字符。与 Pandas 数据框一样,必须在数据集上更改名称,而不是直接更改列。
ds = ds.rename(columns={k: k.upper()[:8] for k in ds})
# 其他 SAS 元数据可以在列本身上设置。
for k, v in ds.items():
v.label = k.title()
if v.dtype == 'object':
v.format = '$CHAR20.'
else:
v.format = '10.2'
# 库可以包含多个数据集。
library = xport.Library({'DATA': ds})
with open('example.xpt', 'wb') as f:
xport.v56.dump(library, f)
# 加载导出文件以检查一切是否正常工作
df = pd.read_sas('example.xpt')
print(df.head(n=10).to_markdown(index=False))
输出:
| ALPHA | BETA |
|--------:|:-------|
| 10 | x |
| 20 | y |
| 30 | z |
| 10 | x |
| 20 | y |
| 30 | z |
| 10 | x |
| 20 | y |
| 30 | z |
| 10 | x |
英文:
A related error had been reported on the GitHub page of xport
and got resolved in a version of xport
which is, unfortunately, not compatible with pandas==1.2.4
I ran the code below with pandas==1.2.4
& xport==3.2.1
(highest version of xport compatible with this pandas version), which gave me the following error: NotImplementedError: Can't copy SAS variable metadata to dataframe
Once upgrading to pandas==1.3.5
& xport==3.6.1
, the code worked like a charm. Here's the good news: pandas==1.3.5
still works on Python 3.7.1
and above according to its official doc
Code (example code from xport
documentation with some adaptations):
import pandas as pd
import xport
import xport.v56
# create list with 3 million items each
alpha_list = [10, 20, 30] * 1_000_000
beta_list = ['x', 'y', 'z'] * 1_000_000
df = pd.DataFrame({
'alpha': alpha_list,
'beta': beta_list
})
ds = xport.Dataset(df, name='DATA', label='Wonderful data')
# Variable names are limited to 8 characters in SAS transport
# files (.xpt). In SAS datasets, variable names may be up to
# 32 characters As with Pandas dataframes, you must change the
# name on the dataset rather than the column directly.
ds = ds.rename(columns={k: k.upper()[:8] for k in ds})
# Other SAS metadata can be set on the columns themselves.
for k, v in ds.items():
v.label = k.title()
if v.dtype == 'object':
v.format = '$CHAR20.'
else:
v.format = '10.2'
# Libraries can have multiple datasets.
library = xport.Library({'DATA': ds})
with open('example.xpt', 'wb') as f:
xport.v56.dump(library, f)
# load exported file to check if everything's working
df = pd.read_sas('example.xpt')
print(df.head(n=10).to_markdown(index=False))
prints:
| ALPHA | BETA |
|--------:|:-------|
| 10 | x |
| 20 | y |
| 30 | z |
| 10 | x |
| 20 | y |
| 30 | z |
| 10 | x |
| 20 | y |
| 30 | z |
| 10 | x |
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论