Pandas DataFrame Apply函数和每N步保存一次

huangapple go评论62阅读模式
英文:

Pandas DataFrame Apply Function and Save every N steps

问题

你的DataFrame非常庞大,是否有一种好的方法(而不是使用for循环)来修改DataFrame中的某些值,并在每N步保存一次,例如:

def modifier(x):
   x = x.split() # 在这里应用更复杂的逻辑
   return x

df['new_col'] = df.old_col.apply(modifier)

是否有一种好的方法可以在modifier函数中添加一些代码,以便在每10,000行时执行以下操作:

df.to_pickle('make_copy.pickle')
英文:

My DF is very large is there a nice way (not for loop) to modify some values within the DF and save every N steps e.g.

def modifier(x):
   x = x.split() # more complex logic is applied here
   return x

df['new_col'] = df.old_col.apply(modifier)

Is there a nice way to add to modifier function some code that every 10,000 rows

df.to_pickle('make_copy.pickle')

will be called?

答案1

得分: 1

代码部分不需要翻译,以下是翻译好的内容:

"对于每隔一段时间保存一些行,问题在于确保边界情况得到妥善处理(因为最后一节可能不是一个完整的节)。可以使用这里讨论的方法,然后可以按照以下方式操作。虽然有一个循环,但它仅针对每个节。请注意,如果您要保存每个节,那么您需要一种保存每个节的机制(或者附加到DF列表并保存)。请注意,它是高效的,因为它使用默认的数字索引来拆分,因此需要进行原地拆分或使用 reset_index 进行替换。如果这不可用,或者您想要在没有循环的情况下拆分成块,那么您可以探索 numpy array_split;但仍然需要为每个块进行循环以保存到文件中。

from more_itertools import sliced   # 可能需要使用 pip 安装此模块
SLICE_SIZE = 10000

slices = sliced(range(len(df)), SLICE_SIZE)

for index in slices:
    df_slice = df.iloc[index]
    print(df_slice)          # 或者对DF的这一部分执行您想要的任何操作,比如保存

"

英文:

For saving every so-many rows, the issue is making sure that the edge case is properly handled (as the last section might not be a full-size section). Using an approach discussed here then you could do something along the following lines. Although there is a loop it is only for every section. Note if you save every section then you need a mechanism for saving each under a new name (or else append to a List of DFs and save that). Note that it is efficient because it uses the default numerical index for splitting so this needs to be in-place or replaced using reset_index. If this is not available or you want to split into chunks without looping then you could explore numpy array_split; but the same looping would still be required for each chunk to save to a file.

from more_itertools import sliced   # this module might need to be installed using pip
SLICE_SIZE = 10000

slices = sliced(range(len(df)), SLICE_SIZE)

for index in slices:
    df_slice = df.iloc[index]
    print(df_slice)          # or do anything you want with the section of the DF such as save it as required

答案2

得分: 0

我想实现类似的功能 - 这是我的方法:

# 导入包
from more_itertools import sliced
import pandas as pd

# 创建DataFrame
df = pd.DataFrame(data={'col_a': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'col_b': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]})

# 将DataFrame分割成每组4个元素
slices = sliced(seq=range(len(df)), n=4)

# 创建空的DataFrame
data = pd.DataFrame()

# 应用函数并每N步保存一次
for index in slices:
    chunk = df.iloc[index].copy()
    chunk['new_column'] = index  # 在这里应用函数/转换
    data = pd.concat([data, chunk], axis=0, ignore_index=True, sort=False)
    data.to_pickle(path='df.pkl')
英文:

I wanted to achieve something similar - here is my approach:

# Import packages
from more_itertools import sliced
import pandas as pd

# Create DataFrame
df = pd.DataFrame(data={'col_a': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'col_b': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]})

# Split/slice DataFrame into chunks of 4
slices = sliced(seq=range(len(df)), n=4)

# Create empty DataFrame
data = pd.DataFrame()

# Apply function and save every N steps
for index in slices:
    chunk = df.iloc[index].copy()
    chunk['new_column'] = index # Apply function/transformation here
    data = pd.concat([data, chunk], axis=0, ignore_index=True, sort=False)
    data.to_pickle(path='df.pkl')

huangapple
  • 本文由 发表于 2023年2月8日 20:15:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/75385658.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定