用pandas逐块读取大型CSV文件

huangapple go评论53阅读模式
英文:

Reading large csv file in chunks with pandas

问题

我正在尝试使用pandas按块读取大型CSV文件(84GB),筛选出必要的行并将其转换为DataFrame。

import pandas as pd

chunk_size = 1000000  # 每块要读取的行数
my_df = pd.DataFrame()
i = 1
def convert_data(value):
    try:
        return float(value)
    except:
        return float(0.778)

for chunk in pd.read_csv(path, delimiter='~', dtype={'FIELD': 'object', 'ID_TAXPAYER': 'object', 'PYEAR': 'object'}, usecols=['PYEAR', 'DATA', 'FIELD', 'ID_TAXPAYER'], chunksize=chunk_size, converters={'DATA': convert_data}, engine='python'):
    chunk = chunk[chunk['FIELD'].str.contains("field", na=False)]
    chunk['FIELD'] = [i.replace('field_', '').replace('_', '.') for i in chunk['FIELD']]
    filtered_df = chunk[chunk['FIELD'] == '910.00.001']
    print(i)
    i += 1
    my_df = pd.concat([my_df, filtered_df], ignore_index=True)

# 打印结果DataFrame
print(my_df)

我的笔记本电脑具有16GB RAM和3.5 GHz CPU,带有4个核心。在运行一段时间后,当'i'变量达到323时,出现错误。我理解我的RAM不足。但我认为将DataFrame分成块会解决这个问题。此外,我注意到,每次循环迭代时,我的内存都会越来越多。我尝试使用"del chunk",但在323时仍然出现错误。有什么想法吗?提前谢谢你们!

MemoryError                               Traceback (most recent call last)
Cell In[3], line 14
     11         return float(0.778)
     13 # Iterate over the chunks
---> 14 for chunk in pd.read_csv(path, delimiter='~', dtype={'FIELD': 'object', 'ID_TAXPAYER': 'object', 'PYEAR': 'object'}, usecols=['PYEAR', 'DATA', 'FIELD', 'ID_TAXPAYER'], chunksize=chunk_size, converters={'DATA': convert_data}, engine='python'):
     15     chunk = chunk[chunk['FIELD'].str.contains("field", na=False)]
     16     chunk['FIELD'] = [i.replace('field_', '').replace('_', '.') for i in chunk['FIELD']]

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py:1624, in TextFileReader.__next__(self)
   1622 def __next__(self) -> DataFrame:
   1623     try:
--> 1624         return self.get_chunk()
   1625     except StopIteration:
   1626         self.close()

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py:1733, in TextFileReader.get_chunk(self, size)
   1731         raise StopIteration
   1732     size = min(size, self.nrows - self._currow)
-> 1733 return self.read(nrows=size)

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py:1704, in TextFileReader.read(self, nrows)
   1697 nrows = validate_integer("nrows", nrows)
   1698 try:
-> 1699     # error: "ParserBase" has no attribute "read"
   1700     (
   1701         index,
   1702         columns,
   1703         col_dict,
-> 1704     ) = self._engine.read(  # type: ignore[attr-defined]
   1705         nrows
   1706     )
   1707 except Exception:
   1708     self.close()

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\python_parser.py:251, in PythonParser.read(self, rows)
    245 def read(
    246     self, rows: int | None = None
    247 ) -> tuple[
    248     Index | None, Sequence[Hashable] | MultiIndex, Mapping[Hashable, ArrayLike]
    249 ]:
    250     try:
--> 251         content = self._get_lines(rows)
    252     except StopIteration:
    253         if self._first_chunk:

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\python_parser.py:1114, in PythonParser._get_lines(self, rows)
   1110 rows_to_skip = 0
   1111 if self.skiprows is not None and self.pos is not None:
   1112     # Only read additional rows if pos is in skiprows
   1113     rows_to_skip = len(
-> 1114         set(self.skiprows) - set(range(self.pos))
   1115     )
   1117 for _ in range(rows + rows_to_skip):
   1118     # assert for mypy, data is Iterator[str] or None, would
   1119     # error in next
   1120     assert self.data is not None

MemoryError:
英文:

I am trying to read large csv file (84GB) in chunks with pandas, filter out necessary rows and convert it to df

import pandas as pd
chunk_size = 1000000  # Number of rows to read per chunk
my_df = pd.DataFrame()
i = 1
def convert_data(value):
try:
return float(value)
except:
return float(0.778)
for chunk in pd.read_csv(path, delimiter='~', dtype={'FIELD': 'object', 'ID_TAXPAYER': 'object', 'PYEAR': 'object'}, usecols=['PYEAR', 'DATA', 'FIELD', 'ID_TAXPAYER'], chunksize=chunk_size, converters={'DATA': convert_data},engine='python'):
chunk = chunk[chunk['FIELD'].str.contains("field", na=False)]
chunk['FIELD'] = [i.replace('field_', '').replace('_', '.') for i in chunk['FIELD']]
filtered_df = chunk[chunk['FIELD'] == '910.00.001']
print(i)
i+=1
my_df = pd.concat([my_df, filtered_df], ignore_index=True)
# Print the resulting dataframe
print(my_df)

My laptop has 16 GB of RAM and 3.5 GHz CPU with 4 cores.After running for some time, when 'i' variable gets to 323, the error appears. I have an understanding that my RAM isn't enough. But I thought that dividing dataframe by chunks will solve the problem.Also, I have noticed that with each iteration of loop, my memory is filled more and more. I have tried to use "del chunk" but it still pops me error at 323.
Any thoughts? Thank you guys in advance!

MemoryError                               Traceback (most recent call last)
Cell In[3], line 14
11         return float(0.778)
13 # Iterate over the chunks
---> 14 for chunk in pd.read_csv(path, delimiter='~', dtype={'FIELD': 'object', 'ID_TAXPAYER': 'object', 'PYEAR': 'object'}, usecols=['PYEAR', 'DATA', 'FIELD', 'ID_TAXPAYER'], chunksize=chunk_size, converters={'DATA': convert_data},engine='python'):
15     chunk = chunk[chunk['FIELD'].str.contains("field", na=False)]
16     chunk['FIELD'] = [i.replace('field_', '').replace('_', '.') for i in chunk['FIELD']]
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py:1624, in TextFileReader.__next__(self)
1622 def __next__(self) -> DataFrame:
1623     try:
-> 1624         return self.get_chunk()
1625     except StopIteration:
1626         self.close()
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py:1733, in TextFileReader.get_chunk(self, size)
1731         raise StopIteration
1732     size = min(size, self.nrows - self._currow)
-> 1733 return self.read(nrows=size)
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py:1704, in TextFileReader.read(self, nrows)
1697 nrows = validate_integer("nrows", nrows)
1698 try:
1699     # error: "ParserBase" has no attribute "read"
1700     (
1701         index,
1702         columns,
1703         col_dict,
-> 1704     ) = self._engine.read(  # type: ignore[attr-defined]
1705         nrows
1706     )
1707 except Exception:
1708     self.close()
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\python_parser.py:251, in PythonParser.read(self, rows)
245 def read(
246     self, rows: int | None = None
247 ) -> tuple[
248     Index | None, Sequence[Hashable] | MultiIndex, Mapping[Hashable, ArrayLike]
249 ]:
250     try:
--> 251         content = self._get_lines(rows)
252     except StopIteration:
253         if self._first_chunk:
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\python_parser.py:1114, in PythonParser._get_lines(self, rows)
1110 rows_to_skip = 0
1111 if self.skiprows is not None and self.pos is not None:
1112     # Only read additional rows if pos is in skiprows
1113     rows_to_skip = len(
-> 1114         set(self.skiprows) - set(range(self.pos))
1115     )
1117 for _ in range(rows + rows_to_skip):
1118     # assert for mypy, data is Iterator[str] or None, would
1119     # error in next
1120     assert self.data is not None
MemoryError: 

答案1

得分: 1

你做得对。

唯一一件我会尝试改变的事是更改这一行:

    filtered_df = chunk[chunk['FIELD'] == '910.00.001']

为:

    chunk = chunk[chunk['FIELD'] == '910.00.001']

避免声明额外的数据框。

英文:

You are doing it right.

The only thing I would try changing the line:

filtered_df = chunk[chunk['FIELD'] == '910.00.001']

to:

chunk = chunk[chunk['FIELD'] == '910.00.001']

avoiding declaring additional dataframe

huangapple
  • 本文由 发表于 2023年5月28日 15:03:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76350327.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定