问题

我正在尝试使用pandas按块读取大型CSV文件（84GB），筛选出必要的行并将其转换为DataFrame。

import pandas as pd

chunk_size = 1000000  # 每块要读取的行数
my_df = pd.DataFrame()
i = 1
def convert_data(value):
    try:
        return float(value)
    except:
        return float(0.778)

for chunk in pd.read_csv(path, delimiter='~', dtype={'FIELD': 'object', 'ID_TAXPAYER': 'object', 'PYEAR': 'object'}, usecols=['PYEAR', 'DATA', 'FIELD', 'ID_TAXPAYER'], chunksize=chunk_size, converters={'DATA': convert_data}, engine='python'):
    chunk = chunk[chunk['FIELD'].str.contains("field", na=False)]
    chunk['FIELD'] = [i.replace('field_', '').replace('_', '.') for i in chunk['FIELD']]
    filtered_df = chunk[chunk['FIELD'] == '910.00.001']
    print(i)
    i += 1
    my_df = pd.concat([my_df, filtered_df], ignore_index=True)

# 打印结果DataFrame
print(my_df)

我的笔记本电脑具有16GB RAM和3.5 GHz CPU，带有4个核心。在运行一段时间后，当'i'变量达到323时，出现错误。我理解我的RAM不足。但我认为将DataFrame分成块会解决这个问题。此外，我注意到，每次循环迭代时，我的内存都会越来越多。我尝试使用"del chunk"，但在323时仍然出现错误。有什么想法吗？提前谢谢你们！

MemoryError                               Traceback (most recent call last)
Cell In[3], line 14
     11         return float(0.778)
     13 # Iterate over the chunks
---> 14 for chunk in pd.read_csv(path, delimiter='~', dtype={'FIELD': 'object', 'ID_TAXPAYER': 'object', 'PYEAR': 'object'}, usecols=['PYEAR', 'DATA', 'FIELD', 'ID_TAXPAYER'], chunksize=chunk_size, converters={'DATA': convert_data}, engine='python'):
     15     chunk = chunk[chunk['FIELD'].str.contains("field", na=False)]
     16     chunk['FIELD'] = [i.replace('field_', '').replace('_', '.') for i in chunk['FIELD']]

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py:1624, in TextFileReader.__next__(self)
   1622 def __next__(self) -> DataFrame:
   1623     try:
--> 1624         return self.get_chunk()
   1625     except StopIteration:
   1626         self.close()

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py:1733, in TextFileReader.get_chunk(self, size)
   1731         raise StopIteration
   1732     size = min(size, self.nrows - self._currow)
-> 1733 return self.read(nrows=size)

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py:1704, in TextFileReader.read(self, nrows)
   1697 nrows = validate_integer("nrows", nrows)
   1698 try:
-> 1699     # error: "ParserBase" has no attribute "read"
   1700     (
   1701         index,
   1702         columns,
   1703         col_dict,
-> 1704     ) = self._engine.read(  # type: ignore[attr-defined]
   1705         nrows
   1706     )
   1707 except Exception:
   1708     self.close()

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\python_parser.py:251, in PythonParser.read(self, rows)
    245 def read(
    246     self, rows: int | None = None
    247 ) -> tuple[
    248     Index | None, Sequence[Hashable] | MultiIndex, Mapping[Hashable, ArrayLike]
    249 ]:
    250     try:
--> 251         content = self._get_lines(rows)
    252     except StopIteration:
    253         if self._first_chunk:

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\python_parser.py:1114, in PythonParser._get_lines(self, rows)
   1110 rows_to_skip = 0
   1111 if self.skiprows is not None and self.pos is not None:
   1112     # Only read additional rows if pos is in skiprows
   1113     rows_to_skip = len(
-> 1114         set(self.skiprows) - set(range(self.pos))
   1115     )
   1117 for _ in range(rows + rows_to_skip):
   1118     # assert for mypy, data is Iterator[str] or None, would
   1119     # error in next
   1120     assert self.data is not None

MemoryError:

英文:

I am trying to read large csv file (84GB) in chunks with pandas, filter out necessary rows and convert it to df

import pandas as pd
chunk_size = 1000000  # Number of rows to read per chunk
my_df = pd.DataFrame()
i = 1
def convert_data(value):
try:
return float(value)
except:
return float(0.778)
for chunk in pd.read_csv(path, delimiter=&#39;~&#39;, dtype={&#39;FIELD&#39;: &#39;object&#39;, &#39;ID_TAXPAYER&#39;: &#39;object&#39;, &#39;PYEAR&#39;: &#39;object&#39;}, usecols=[&#39;PYEAR&#39;, &#39;DATA&#39;, &#39;FIELD&#39;, &#39;ID_TAXPAYER&#39;], chunksize=chunk_size, converters={&#39;DATA&#39;: convert_data},engine=&#39;python&#39;):
chunk = chunk[chunk[&#39;FIELD&#39;].str.contains(&quot;field&quot;, na=False)]
chunk[&#39;FIELD&#39;] = [i.replace(&#39;field_&#39;, &#39;&#39;).replace(&#39;_&#39;, &#39;.&#39;) for i in chunk[&#39;FIELD&#39;]]
filtered_df = chunk[chunk[&#39;FIELD&#39;] == &#39;910.00.001&#39;]
print(i)
i+=1
my_df = pd.concat([my_df, filtered_df], ignore_index=True)
# Print the resulting dataframe
print(my_df)

My laptop has 16 GB of RAM and 3.5 GHz CPU with 4 cores.After running for some time, when 'i' variable gets to 323, the error appears. I have an understanding that my RAM isn't enough. But I thought that dividing dataframe by chunks will solve the problem.Also, I have noticed that with each iteration of loop, my memory is filled more and more. I have tried to use "del chunk" but it still pops me error at 323.
Any thoughts? Thank you guys in advance!

MemoryError                               Traceback (most recent call last)
Cell In[3], line 14
11         return float(0.778)
13 # Iterate over the chunks
---&gt; 14 for chunk in pd.read_csv(path, delimiter=&#39;~&#39;, dtype={&#39;FIELD&#39;: &#39;object&#39;, &#39;ID_TAXPAYER&#39;: &#39;object&#39;, &#39;PYEAR&#39;: &#39;object&#39;}, usecols=[&#39;PYEAR&#39;, &#39;DATA&#39;, &#39;FIELD&#39;, &#39;ID_TAXPAYER&#39;], chunksize=chunk_size, converters={&#39;DATA&#39;: convert_data},engine=&#39;python&#39;):
15     chunk = chunk[chunk[&#39;FIELD&#39;].str.contains(&quot;field&quot;, na=False)]
16     chunk[&#39;FIELD&#39;] = [i.replace(&#39;field_&#39;, &#39;&#39;).replace(&#39;_&#39;, &#39;.&#39;) for i in chunk[&#39;FIELD&#39;]]
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py:1624, in TextFileReader.__next__(self)
1622 def __next__(self) -&gt; DataFrame:
1623     try:
-&gt; 1624         return self.get_chunk()
1625     except StopIteration:
1626         self.close()
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py:1733, in TextFileReader.get_chunk(self, size)
1731         raise StopIteration
1732     size = min(size, self.nrows - self._currow)
-&gt; 1733 return self.read(nrows=size)
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py:1704, in TextFileReader.read(self, nrows)
1697 nrows = validate_integer(&quot;nrows&quot;, nrows)
1698 try:
1699     # error: &quot;ParserBase&quot; has no attribute &quot;read&quot;
1700     (
1701         index,
1702         columns,
1703         col_dict,
-&gt; 1704     ) = self._engine.read(  # type: ignore[attr-defined]
1705         nrows
1706     )
1707 except Exception:
1708     self.close()
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\python_parser.py:251, in PythonParser.read(self, rows)
245 def read(
246     self, rows: int | None = None
247 ) -&gt; tuple[
248     Index | None, Sequence[Hashable] | MultiIndex, Mapping[Hashable, ArrayLike]
249 ]:
250     try:
--&gt; 251         content = self._get_lines(rows)
252     except StopIteration:
253         if self._first_chunk:
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\python_parser.py:1114, in PythonParser._get_lines(self, rows)
1110 rows_to_skip = 0
1111 if self.skiprows is not None and self.pos is not None:
1112     # Only read additional rows if pos is in skiprows
1113     rows_to_skip = len(
-&gt; 1114         set(self.skiprows) - set(range(self.pos))
1115     )
1117 for _ in range(rows + rows_to_skip):
1118     # assert for mypy, data is Iterator[str] or None, would
1119     # error in next
1120     assert self.data is not None
MemoryError:

答案1

得分: 1

你做得对。

唯一一件我会尝试改变的事是更改这一行：

    filtered_df = chunk[chunk['FIELD'] == '910.00.001']

为：

    chunk = chunk[chunk['FIELD'] == '910.00.001']

避免声明额外的数据框。

英文:

You are doing it right.

The only thing I would try changing the line:

filtered_df = chunk[chunk[&#39;FIELD&#39;] == &#39;910.00.001&#39;]

to:

chunk = chunk[chunk[&#39;FIELD&#39;] == &#39;910.00.001&#39;]

avoiding declaring additional dataframe

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

用pandas逐块读取大型CSV文件

问题

答案1

pandas 在自定义函数上进行滚动应用

将函数调用的结果合并到一个数据框中，使用 apply 函数。

Pyspark：检查列的连续值是否相同

Python Pandas合并具有相似结构的多个CSV文件

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论