Python 无法将 300 万条记录写入 .xpt 的 SAS 数据集。

huangapple go评论65阅读模式
英文:

Python unable to write 3 million records to .xpt sas dataset

问题

I have a pandas DataFrame with 3 million records but not able to write the large DataFrame to .xpt files. Although the same script works for other DataFrames having 2 million records without modifying anything.

While writing, I am using the df.to_xpt() function, and while writing to the .xpt file, suddenly after a couple of minutes, I get a "KILLED" message on the console.

  • System - RHEL Linux
  • Python - 3.7
  • Pandas version - 1.2
  • Using the xport module with .v56 subpackage while writing.

Need guidance in the following areas:

  • Is this a case of memory leak?
  • Any chances of bad data in the DataFrame?

What is the correct approach in debugging a single built-in function?

Not sure about the functionality of writing pandas DataFrame to SAS datasets in .xpt format.

英文:

i have a pandas dataframe with 3 million records but not able to write large dataframe to .xpt files. Although the same script works for other data frames having 2 million records without modifying anything.

While writing I am using df.from_dataframe() function and while writing to the xpt file suddenly after couple of minutes i get "KILLED" message on the console.

  • System - RHEL Linux
  • Python - 3.7
  • Pandas version - 1.2
  • Using xport module with .v56 subpackage while writing.

Need guidance in below areas

  • Is this a case of memory leak
  • any chances of bad data in dataframe

What is the correct approach in debugging a single built-in function?

Not sure about the functionality of writing pandas dataframe to sas datasets in .xpt format.

答案1

得分: 1

以下是您提供的内容的中文翻译:

相关错误 已在 xport 的 GitHub 页面上报告,并在 xport 的某个版本中得到解决,不幸的是,该版本与 pandas==1.2.4 不兼容。

我使用 pandas==1.2.4xport==3.2.1(与此 pandas 版本兼容的 xport 的最高版本)运行了下面的代码,结果出现以下错误:NotImplementedError: 无法将 SAS 变量元数据复制到数据框

一旦升级到 pandas==1.3.5xport==3.6.1,代码就可以正常运行。好消息是:根据其官方文档pandas==1.3.5 仍然在 Python 3.7.1 及以上版本上可用。

代码(来自 xport文档的示例代码,稍作调整):

import pandas as pd
import xport
import xport.v56

# 创建包含 300 万项的列表
alpha_list = [10, 20, 30] * 1_000_000
beta_list = ['x', 'y', 'z'] * 1_000_000

df = pd.DataFrame({
    'alpha': alpha_list,
    'beta': beta_list
})

ds = xport.Dataset(df, name='DATA', label='精彩数据')

# 在 SAS 传输文件 (.xpt) 中,变量名称限制为 8 个字符。在 SAS 数据集中,变量名称可以长达 32 个字符。与 Pandas 数据框一样,必须在数据集上更改名称,而不是直接更改列。
ds = ds.rename(columns={k: k.upper()[:8] for k in ds})

# 其他 SAS 元数据可以在列本身上设置。
for k, v in ds.items():
    v.label = k.title()
    if v.dtype == 'object':
        v.format = '$CHAR20.'
    else:
        v.format = '10.2'

# 库可以包含多个数据集。
library = xport.Library({'DATA': ds})

with open('example.xpt', 'wb') as f:
    xport.v56.dump(library, f)

# 加载导出文件以检查一切是否正常工作
df = pd.read_sas('example.xpt')

print(df.head(n=10).to_markdown(index=False))

输出:

|   ALPHA | BETA   |
|--------:|:-------|
|      10 | x      |
|      20 | y      |
|      30 | z      |
|      10 | x      |
|      20 | y      |
|      30 | z      |
|      10 | x      |
|      20 | y      |
|      30 | z      |
|      10 | x      |
英文:

A related error had been reported on the GitHub page of xport and got resolved in a version of xport which is, unfortunately, not compatible with pandas==1.2.4

I ran the code below with pandas==1.2.4 & xport==3.2.1 (highest version of xport compatible with this pandas version), which gave me the following error: NotImplementedError: Can't copy SAS variable metadata to dataframe

Once upgrading to pandas==1.3.5 & xport==3.6.1, the code worked like a charm. Here's the good news: pandas==1.3.5 still works on Python 3.7.1 and above according to its official doc

Code (example code from xport documentation with some adaptations):

import pandas as pd
import xport
import xport.v56

# create list with 3 million items each
alpha_list = [10, 20, 30] * 1_000_000
beta_list = ['x', 'y', 'z'] * 1_000_000

df = pd.DataFrame({
    'alpha': alpha_list,
    'beta': beta_list
})


ds = xport.Dataset(df, name='DATA', label='Wonderful data')

# Variable names are limited to 8 characters in SAS transport
# files (.xpt). In SAS datasets, variable names may be up to 
# 32 characters  As with Pandas dataframes, you must change the
# name on the dataset rather than the column directly.
ds = ds.rename(columns={k: k.upper()[:8] for k in ds})

# Other SAS metadata can be set on the columns themselves.
for k, v in ds.items():
    v.label = k.title()
    if v.dtype == 'object':
        v.format = '$CHAR20.'
    else:
        v.format = '10.2'

# Libraries can have multiple datasets.
library = xport.Library({'DATA': ds})

with open('example.xpt', 'wb') as f:
    xport.v56.dump(library, f)
    
# load exported file to check if everything's working
df = pd.read_sas('example.xpt')

print(df.head(n=10).to_markdown(index=False))

prints:

|   ALPHA | BETA   |
|--------:|:-------|
|      10 | x      |
|      20 | y      |
|      30 | z      |
|      10 | x      |
|      20 | y      |
|      30 | z      |
|      10 | x      |
|      20 | y      |
|      30 | z      |
|      10 | x      |

huangapple
  • 本文由 发表于 2023年5月14日 02:11:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/76244241.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定