英文:
Python program to load only a certain mega bytes of an excel file into the dataframe and convert it to string
问题
我对Python相当新,我正在研究pandas
库的一些用途。然而,我找不到一种方法,可以仅将部分Excel文件加载到内存中并进行操作。例如,如果我将内存限制设置为1MB,程序应该能够从大小大于1MB的Excel文件中读取前1MB。
从这里提到的答案中,我看到一种加载一定数量的行的选项。但我不知道输入文件中有多少行。此外,我也不知道此代码已读取多少字节的数据。
是否有一种以迭代方式加载行数的方法,可以在每次迭代中计算已读取的字节数,并将其累积总和?
英文:
I am pretty new to Python and I was going through some of the uses of the pandas
library. However, I could not find a way to load only a partial excel file into the memory and play with it. For example, if I set the memory limit as 1MB, the program should be able to read the first 1MB from the excel file of a size larger than 1MB.
From the answer mentioned here, I see an option to load a certain number of rows. But I would not know the number of rows in the input file. Also, I do not know how many bytes of data has been read by this code.
Is there a way to load the number of rows in an iterative way where in the number of bytes read can also be calculated in each iteration and can be cumulatively summed up?
答案1
得分: 1
1.) conversion factor
"转换系数",在工作表的头部“尝试”一些示例数据,计算每行有多少字节的平均值,然后使用这个值来预测在你的内存预算中可以容纳多少行。
2.) polars
polars 项目非常强调“节省内存!”以及快速的输入/输出。一个便捷的 .to_pandas()
方法使将 polars DataFrame 转换为你喜欢的格式变得非常容易。考虑在 polars 中进行筛选,并将结果作为你的应用程序的其余部分所期望的格式转换给 pandas。
3.) generator
对于 CSV 文件来说,这很容易,绝对不会执行额外的内存分配。对于其他格式,我们可能需要为整个工作表分配内存,但然后我们绝对可以避免为不需要的行分配 Pandas 内存。
我们将使用 dict reader 和一个用于提前终止的生成器。
from sys import getsizeof
import openpyxl_dictreader
df = pd.DataFrame(read_initial(1_000_000, filespec, sheet))
def read_initial(budget: int, filespec: Path, sheet: str):
size = 0
reader = openpyxl_dictreader.DictReader(filespec, sheet, read_only=True, data_only=True)
for row in reader:
size += (sum(map(getsizeof, row.values())) + sum(map(getsizeof, row.keys()))
if size > budget:
break
yield row
如果递归的 getsizeof
的准确性不符合你的要求,可以考虑使用更高级的成本估算方法。
考虑将 *.xlsx 文件转换为更适合流式处理的格式,如 .csv。
我们更喜欢使用 read_only=True 关键参数,这样我们可以在处理大文件时仅使用常量内存。
如果你无法计算公式并且基本希望 Excel 文件是 CSV 文件,那么提供 data_only=True 关键字参数。
英文:
1.) conversion factor
"Taste" some example data near the head of the worksheet,
compute an average of how many bytes per row,
then use that to predict how many rows fit in your memory budget.
2.) polars
The polars project
has a heavy emphasis on "use less RAM!" and on rapid I/O.
A convenient .to_pandas()
method makes it trivially easy
to convert a polars DataFrame to your favorite format.
Consider doing the filtering in polars and handing off
the result to pandas, formatted as the rest of your
app expects it.
3.) generator
For CSV this is easy, and definitely won't do extra malloc's.
For other formats we might do an allocation for the entire sheet,
but then we can definitely avoid Pandas allocations for unwanted rows.
We will use a dict reader,
plus a generator for early termination.
from sys import getsizeof
import openpyxl_dictreader
df = pd.DataFrame(read_initial(1_000_000, filespec, sheet))
def read_initial(budget: int, filespec: Path, sheet: str):
size = 0
reader = openpyxl_dictreader.DictReader(filespec, sheet, read_only=True, data_only=True)
for row in reader:
size += (sum(map(getsizeof, row.values()))
+ sum(map(getsizeof, row.keys())))
if size > budget:
break
yield row
Feel free to use a fancier cost estimate,
if accuracy of recursive getsizeof
isn't to your liking.
Consider converting *.xlsx files to a more stream-friendly format like .csv.
We prefer the
read_only=True
keyword arg so we only consume constant memory despite large file size.
If you're unable to evaluate formulas and essentially wished
the Excel file was a CSV file, then supply a
data_only=True
kwarg.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论