英文:
Pandas DataFrame Apply Function and Save every N steps
问题
你的DataFrame非常庞大,是否有一种好的方法(而不是使用for循环)来修改DataFrame中的某些值,并在每N步保存一次,例如:
def modifier(x):
x = x.split() # 在这里应用更复杂的逻辑
return x
df['new_col'] = df.old_col.apply(modifier)
是否有一种好的方法可以在modifier
函数中添加一些代码,以便在每10,000行时执行以下操作:
df.to_pickle('make_copy.pickle')
英文:
My DF is very large is there a nice way (not for loop) to modify some values within the DF and save every N steps e.g.
def modifier(x):
x = x.split() # more complex logic is applied here
return x
df['new_col'] = df.old_col.apply(modifier)
Is there a nice way to add to modifier function some code that every 10,000 rows
df.to_pickle('make_copy.pickle')
will be called?
答案1
得分: 1
代码部分不需要翻译,以下是翻译好的内容:
"对于每隔一段时间保存一些行,问题在于确保边界情况得到妥善处理(因为最后一节可能不是一个完整的节)。可以使用这里讨论的方法,然后可以按照以下方式操作。虽然有一个循环,但它仅针对每个节。请注意,如果您要保存每个节,那么您需要一种保存每个节的机制(或者附加到DF列表并保存)。请注意,它是高效的,因为它使用默认的数字索引来拆分,因此需要进行原地拆分或使用 reset_index
进行替换。如果这不可用,或者您想要在没有循环的情况下拆分成块,那么您可以探索 numpy array_split
;但仍然需要为每个块进行循环以保存到文件中。
from more_itertools import sliced # 可能需要使用 pip 安装此模块
SLICE_SIZE = 10000
slices = sliced(range(len(df)), SLICE_SIZE)
for index in slices:
df_slice = df.iloc[index]
print(df_slice) # 或者对DF的这一部分执行您想要的任何操作,比如保存
"
英文:
For saving every so-many rows, the issue is making sure that the edge case is properly handled (as the last section might not be a full-size section). Using an approach discussed here then you could do something along the following lines. Although there is a loop it is only for every section. Note if you save every section then you need a mechanism for saving each under a new name (or else append to a List of DFs and save that). Note that it is efficient because it uses the default numerical index for splitting so this needs to be in-place or replaced using reset_index
. If this is not available or you want to split into chunks without looping then you could explore numpy array_split
; but the same looping would still be required for each chunk to save to a file.
from more_itertools import sliced # this module might need to be installed using pip
SLICE_SIZE = 10000
slices = sliced(range(len(df)), SLICE_SIZE)
for index in slices:
df_slice = df.iloc[index]
print(df_slice) # or do anything you want with the section of the DF such as save it as required
答案2
得分: 0
我想实现类似的功能 - 这是我的方法:
# 导入包
from more_itertools import sliced
import pandas as pd
# 创建DataFrame
df = pd.DataFrame(data={'col_a': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'col_b': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]})
# 将DataFrame分割成每组4个元素
slices = sliced(seq=range(len(df)), n=4)
# 创建空的DataFrame
data = pd.DataFrame()
# 应用函数并每N步保存一次
for index in slices:
chunk = df.iloc[index].copy()
chunk['new_column'] = index # 在这里应用函数/转换
data = pd.concat([data, chunk], axis=0, ignore_index=True, sort=False)
data.to_pickle(path='df.pkl')
英文:
I wanted to achieve something similar - here is my approach:
# Import packages
from more_itertools import sliced
import pandas as pd
# Create DataFrame
df = pd.DataFrame(data={'col_a': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'col_b': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]})
# Split/slice DataFrame into chunks of 4
slices = sliced(seq=range(len(df)), n=4)
# Create empty DataFrame
data = pd.DataFrame()
# Apply function and save every N steps
for index in slices:
chunk = df.iloc[index].copy()
chunk['new_column'] = index # Apply function/transformation here
data = pd.concat([data, chunk], axis=0, ignore_index=True, sort=False)
data.to_pickle(path='df.pkl')
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论