英文:
Reading large csv file in chunks with pandas
问题
我正在尝试使用pandas按块读取大型CSV文件(84GB),筛选出必要的行并将其转换为DataFrame。
import pandas as pd
chunk_size = 1000000 # 每块要读取的行数
my_df = pd.DataFrame()
i = 1
def convert_data(value):
try:
return float(value)
except:
return float(0.778)
for chunk in pd.read_csv(path, delimiter='~', dtype={'FIELD': 'object', 'ID_TAXPAYER': 'object', 'PYEAR': 'object'}, usecols=['PYEAR', 'DATA', 'FIELD', 'ID_TAXPAYER'], chunksize=chunk_size, converters={'DATA': convert_data}, engine='python'):
chunk = chunk[chunk['FIELD'].str.contains("field", na=False)]
chunk['FIELD'] = [i.replace('field_', '').replace('_', '.') for i in chunk['FIELD']]
filtered_df = chunk[chunk['FIELD'] == '910.00.001']
print(i)
i += 1
my_df = pd.concat([my_df, filtered_df], ignore_index=True)
# 打印结果DataFrame
print(my_df)
我的笔记本电脑具有16GB RAM和3.5 GHz CPU,带有4个核心。在运行一段时间后,当'i'变量达到323时,出现错误。我理解我的RAM不足。但我认为将DataFrame分成块会解决这个问题。此外,我注意到,每次循环迭代时,我的内存都会越来越多。我尝试使用"del chunk",但在323时仍然出现错误。有什么想法吗?提前谢谢你们!
MemoryError Traceback (most recent call last)
Cell In[3], line 14
11 return float(0.778)
13 # Iterate over the chunks
---> 14 for chunk in pd.read_csv(path, delimiter='~', dtype={'FIELD': 'object', 'ID_TAXPAYER': 'object', 'PYEAR': 'object'}, usecols=['PYEAR', 'DATA', 'FIELD', 'ID_TAXPAYER'], chunksize=chunk_size, converters={'DATA': convert_data}, engine='python'):
15 chunk = chunk[chunk['FIELD'].str.contains("field", na=False)]
16 chunk['FIELD'] = [i.replace('field_', '').replace('_', '.') for i in chunk['FIELD']]
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py:1624, in TextFileReader.__next__(self)
1622 def __next__(self) -> DataFrame:
1623 try:
--> 1624 return self.get_chunk()
1625 except StopIteration:
1626 self.close()
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py:1733, in TextFileReader.get_chunk(self, size)
1731 raise StopIteration
1732 size = min(size, self.nrows - self._currow)
-> 1733 return self.read(nrows=size)
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py:1704, in TextFileReader.read(self, nrows)
1697 nrows = validate_integer("nrows", nrows)
1698 try:
-> 1699 # error: "ParserBase" has no attribute "read"
1700 (
1701 index,
1702 columns,
1703 col_dict,
-> 1704 ) = self._engine.read( # type: ignore[attr-defined]
1705 nrows
1706 )
1707 except Exception:
1708 self.close()
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\python_parser.py:251, in PythonParser.read(self, rows)
245 def read(
246 self, rows: int | None = None
247 ) -> tuple[
248 Index | None, Sequence[Hashable] | MultiIndex, Mapping[Hashable, ArrayLike]
249 ]:
250 try:
--> 251 content = self._get_lines(rows)
252 except StopIteration:
253 if self._first_chunk:
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\python_parser.py:1114, in PythonParser._get_lines(self, rows)
1110 rows_to_skip = 0
1111 if self.skiprows is not None and self.pos is not None:
1112 # Only read additional rows if pos is in skiprows
1113 rows_to_skip = len(
-> 1114 set(self.skiprows) - set(range(self.pos))
1115 )
1117 for _ in range(rows + rows_to_skip):
1118 # assert for mypy, data is Iterator[str] or None, would
1119 # error in next
1120 assert self.data is not None
MemoryError:
英文:
I am trying to read large csv file (84GB) in chunks with pandas, filter out necessary rows and convert it to df
import pandas as pd
chunk_size = 1000000 # Number of rows to read per chunk
my_df = pd.DataFrame()
i = 1
def convert_data(value):
try:
return float(value)
except:
return float(0.778)
for chunk in pd.read_csv(path, delimiter='~', dtype={'FIELD': 'object', 'ID_TAXPAYER': 'object', 'PYEAR': 'object'}, usecols=['PYEAR', 'DATA', 'FIELD', 'ID_TAXPAYER'], chunksize=chunk_size, converters={'DATA': convert_data},engine='python'):
chunk = chunk[chunk['FIELD'].str.contains("field", na=False)]
chunk['FIELD'] = [i.replace('field_', '').replace('_', '.') for i in chunk['FIELD']]
filtered_df = chunk[chunk['FIELD'] == '910.00.001']
print(i)
i+=1
my_df = pd.concat([my_df, filtered_df], ignore_index=True)
# Print the resulting dataframe
print(my_df)
My laptop has 16 GB of RAM and 3.5 GHz CPU with 4 cores.After running for some time, when 'i' variable gets to 323, the error appears. I have an understanding that my RAM isn't enough. But I thought that dividing dataframe by chunks will solve the problem.Also, I have noticed that with each iteration of loop, my memory is filled more and more. I have tried to use "del chunk" but it still pops me error at 323.
Any thoughts? Thank you guys in advance!
MemoryError Traceback (most recent call last)
Cell In[3], line 14
11 return float(0.778)
13 # Iterate over the chunks
---> 14 for chunk in pd.read_csv(path, delimiter='~', dtype={'FIELD': 'object', 'ID_TAXPAYER': 'object', 'PYEAR': 'object'}, usecols=['PYEAR', 'DATA', 'FIELD', 'ID_TAXPAYER'], chunksize=chunk_size, converters={'DATA': convert_data},engine='python'):
15 chunk = chunk[chunk['FIELD'].str.contains("field", na=False)]
16 chunk['FIELD'] = [i.replace('field_', '').replace('_', '.') for i in chunk['FIELD']]
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py:1624, in TextFileReader.__next__(self)
1622 def __next__(self) -> DataFrame:
1623 try:
-> 1624 return self.get_chunk()
1625 except StopIteration:
1626 self.close()
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py:1733, in TextFileReader.get_chunk(self, size)
1731 raise StopIteration
1732 size = min(size, self.nrows - self._currow)
-> 1733 return self.read(nrows=size)
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\readers.py:1704, in TextFileReader.read(self, nrows)
1697 nrows = validate_integer("nrows", nrows)
1698 try:
1699 # error: "ParserBase" has no attribute "read"
1700 (
1701 index,
1702 columns,
1703 col_dict,
-> 1704 ) = self._engine.read( # type: ignore[attr-defined]
1705 nrows
1706 )
1707 except Exception:
1708 self.close()
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\python_parser.py:251, in PythonParser.read(self, rows)
245 def read(
246 self, rows: int | None = None
247 ) -> tuple[
248 Index | None, Sequence[Hashable] | MultiIndex, Mapping[Hashable, ArrayLike]
249 ]:
250 try:
--> 251 content = self._get_lines(rows)
252 except StopIteration:
253 if self._first_chunk:
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\io\parsers\python_parser.py:1114, in PythonParser._get_lines(self, rows)
1110 rows_to_skip = 0
1111 if self.skiprows is not None and self.pos is not None:
1112 # Only read additional rows if pos is in skiprows
1113 rows_to_skip = len(
-> 1114 set(self.skiprows) - set(range(self.pos))
1115 )
1117 for _ in range(rows + rows_to_skip):
1118 # assert for mypy, data is Iterator[str] or None, would
1119 # error in next
1120 assert self.data is not None
MemoryError:
答案1
得分: 1
你做得对。
唯一一件我会尝试改变的事是更改这一行:
filtered_df = chunk[chunk['FIELD'] == '910.00.001']
为:
chunk = chunk[chunk['FIELD'] == '910.00.001']
避免声明额外的数据框。
英文:
You are doing it right.
The only thing I would try changing the line:
filtered_df = chunk[chunk['FIELD'] == '910.00.001']
to:
chunk = chunk[chunk['FIELD'] == '910.00.001']
avoiding declaring additional dataframe
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论