Python 无法将 300 万条记录写入 .xpt 的 SAS 数据集。

huangapple go评论103阅读模式
英文:

Python unable to write 3 million records to .xpt sas dataset

问题

I have a pandas DataFrame with 3 million records but not able to write the large DataFrame to .xpt files. Although the same script works for other DataFrames having 2 million records without modifying anything.

While writing, I am using the df.to_xpt() function, and while writing to the .xpt file, suddenly after a couple of minutes, I get a "KILLED" message on the console.

  • System - RHEL Linux
  • Python - 3.7
  • Pandas version - 1.2
  • Using the xport module with .v56 subpackage while writing.

Need guidance in the following areas:

  • Is this a case of memory leak?
  • Any chances of bad data in the DataFrame?

What is the correct approach in debugging a single built-in function?

Not sure about the functionality of writing pandas DataFrame to SAS datasets in .xpt format.

英文:

i have a pandas dataframe with 3 million records but not able to write large dataframe to .xpt files. Although the same script works for other data frames having 2 million records without modifying anything.

While writing I am using df.from_dataframe() function and while writing to the xpt file suddenly after couple of minutes i get "KILLED" message on the console.

  • System - RHEL Linux
  • Python - 3.7
  • Pandas version - 1.2
  • Using xport module with .v56 subpackage while writing.

Need guidance in below areas

  • Is this a case of memory leak
  • any chances of bad data in dataframe

What is the correct approach in debugging a single built-in function?

Not sure about the functionality of writing pandas dataframe to sas datasets in .xpt format.

答案1

得分: 1

以下是您提供的内容的中文翻译:

相关错误 已在 xport 的 GitHub 页面上报告,并在 xport 的某个版本中得到解决,不幸的是,该版本与 pandas==1.2.4 不兼容。

我使用 pandas==1.2.4xport==3.2.1(与此 pandas 版本兼容的 xport 的最高版本)运行了下面的代码,结果出现以下错误:NotImplementedError: 无法将 SAS 变量元数据复制到数据框

一旦升级到 pandas==1.3.5xport==3.6.1,代码就可以正常运行。好消息是:根据其官方文档pandas==1.3.5 仍然在 Python 3.7.1 及以上版本上可用。

代码(来自 xport文档的示例代码,稍作调整):

  1. import pandas as pd
  2. import xport
  3. import xport.v56
  4. # 创建包含 300 万项的列表
  5. alpha_list = [10, 20, 30] * 1_000_000
  6. beta_list = ['x', 'y', 'z'] * 1_000_000
  7. df = pd.DataFrame({
  8. 'alpha': alpha_list,
  9. 'beta': beta_list
  10. })
  11. ds = xport.Dataset(df, name='DATA', label='精彩数据')
  12. # 在 SAS 传输文件 (.xpt) 中,变量名称限制为 8 个字符。在 SAS 数据集中,变量名称可以长达 32 个字符。与 Pandas 数据框一样,必须在数据集上更改名称,而不是直接更改列。
  13. ds = ds.rename(columns={k: k.upper()[:8] for k in ds})
  14. # 其他 SAS 元数据可以在列本身上设置。
  15. for k, v in ds.items():
  16. v.label = k.title()
  17. if v.dtype == 'object':
  18. v.format = '$CHAR20.'
  19. else:
  20. v.format = '10.2'
  21. # 库可以包含多个数据集。
  22. library = xport.Library({'DATA': ds})
  23. with open('example.xpt', 'wb') as f:
  24. xport.v56.dump(library, f)
  25. # 加载导出文件以检查一切是否正常工作
  26. df = pd.read_sas('example.xpt')
  27. print(df.head(n=10).to_markdown(index=False))

输出:

  1. | ALPHA | BETA |
  2. |--------:|:-------|
  3. | 10 | x |
  4. | 20 | y |
  5. | 30 | z |
  6. | 10 | x |
  7. | 20 | y |
  8. | 30 | z |
  9. | 10 | x |
  10. | 20 | y |
  11. | 30 | z |
  12. | 10 | x |
英文:

A related error had been reported on the GitHub page of xport and got resolved in a version of xport which is, unfortunately, not compatible with pandas==1.2.4

I ran the code below with pandas==1.2.4 & xport==3.2.1 (highest version of xport compatible with this pandas version), which gave me the following error: NotImplementedError: Can't copy SAS variable metadata to dataframe

Once upgrading to pandas==1.3.5 & xport==3.6.1, the code worked like a charm. Here's the good news: pandas==1.3.5 still works on Python 3.7.1 and above according to its official doc

Code (example code from xport documentation with some adaptations):

  1. import pandas as pd
  2. import xport
  3. import xport.v56
  4. # create list with 3 million items each
  5. alpha_list = [10, 20, 30] * 1_000_000
  6. beta_list = ['x', 'y', 'z'] * 1_000_000
  7. df = pd.DataFrame({
  8. 'alpha': alpha_list,
  9. 'beta': beta_list
  10. })
  11. ds = xport.Dataset(df, name='DATA', label='Wonderful data')
  12. # Variable names are limited to 8 characters in SAS transport
  13. # files (.xpt). In SAS datasets, variable names may be up to
  14. # 32 characters As with Pandas dataframes, you must change the
  15. # name on the dataset rather than the column directly.
  16. ds = ds.rename(columns={k: k.upper()[:8] for k in ds})
  17. # Other SAS metadata can be set on the columns themselves.
  18. for k, v in ds.items():
  19. v.label = k.title()
  20. if v.dtype == 'object':
  21. v.format = '$CHAR20.'
  22. else:
  23. v.format = '10.2'
  24. # Libraries can have multiple datasets.
  25. library = xport.Library({'DATA': ds})
  26. with open('example.xpt', 'wb') as f:
  27. xport.v56.dump(library, f)
  28. # load exported file to check if everything's working
  29. df = pd.read_sas('example.xpt')
  30. print(df.head(n=10).to_markdown(index=False))

prints:

  1. | ALPHA | BETA |
  2. |--------:|:-------|
  3. | 10 | x |
  4. | 20 | y |
  5. | 30 | z |
  6. | 10 | x |
  7. | 20 | y |
  8. | 30 | z |
  9. | 10 | x |
  10. | 20 | y |
  11. | 30 | z |
  12. | 10 | x |

huangapple
  • 本文由 发表于 2023年5月14日 02:11:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/76244241.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定