2023年2月8日 20:15:56go评论91阅读模式

英文:

Pandas DataFrame Apply Function and Save every N steps

问题

你的DataFrame非常庞大，是否有一种好的方法（而不是使用for循环）来修改DataFrame中的某些值，并在每N步保存一次，例如：

def modifier(x):
   x = x.split() # 在这里应用更复杂的逻辑
   return x
df['new_col'] = df.old_col.apply(modifier)

是否有一种好的方法可以在modifier函数中添加一些代码，以便在每10,000行时执行以下操作：

df.to_pickle('make_copy.pickle')

英文:

My DF is very large is there a nice way (not for loop) to modify some values within the DF and save every N steps e.g.

def modifier(x):
   x = x.split() # more complex logic is applied here
   return x
df[&#39;new_col&#39;] = df.old_col.apply(modifier)

Is there a nice way to add to modifier function some code that every 10,000 rows

df.to_pickle(&#39;make_copy.pickle&#39;)

will be called?

答案1

得分: 1

代码部分不需要翻译，以下是翻译好的内容：

"对于每隔一段时间保存一些行，问题在于确保边界情况得到妥善处理（因为最后一节可能不是一个完整的节）。可以使用这里讨论的方法，然后可以按照以下方式操作。虽然有一个循环，但它仅针对每个节。请注意，如果您要保存每个节，那么您需要一种保存每个节的机制（或者附加到DF列表并保存）。请注意，它是高效的，因为它使用默认的数字索引来拆分，因此需要进行原地拆分或使用 reset_index 进行替换。如果这不可用，或者您想要在没有循环的情况下拆分成块，那么您可以探索 numpy array_split；但仍然需要为每个块进行循环以保存到文件中。

from more_itertools import sliced   # 可能需要使用 pip 安装此模块
SLICE_SIZE = 10000
slices = sliced(range(len(df)), SLICE_SIZE)
for index in slices:
    df_slice = df.iloc[index]
    print(df_slice)          # 或者对DF的这一部分执行您想要的任何操作，比如保存

英文:

For saving every so-many rows, the issue is making sure that the edge case is properly handled (as the last section might not be a full-size section). Using an approach discussed here then you could do something along the following lines. Although there is a loop it is only for every section. Note if you save every section then you need a mechanism for saving each under a new name (or else append to a List of DFs and save that). Note that it is efficient because it uses the default numerical index for splitting so this needs to be in-place or replaced using reset_index. If this is not available or you want to split into chunks without looping then you could explore numpy array_split; but the same looping would still be required for each chunk to save to a file.

from more_itertools import sliced   # this module might need to be installed using pip
SLICE_SIZE = 10000
slices = sliced(range(len(df)), SLICE_SIZE)
for index in slices:
    df_slice = df.iloc[index]
    print(df_slice)          # or do anything you want with the section of the DF such as save it as required

答案2

得分: 0

我想实现类似的功能 - 这是我的方法：

# 导入包
from more_itertools import sliced
import pandas as pd
# 创建DataFrame
df = pd.DataFrame(data={'col_a': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'col_b': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]})
# 将DataFrame分割成每组4个元素
slices = sliced(seq=range(len(df)), n=4)
# 创建空的DataFrame
data = pd.DataFrame()
# 应用函数并每N步保存一次
for index in slices:
    chunk = df.iloc[index].copy()
    chunk['new_column'] = index  # 在这里应用函数/转换
    data = pd.concat([data, chunk], axis=0, ignore_index=True, sort=False)
    data.to_pickle(path='df.pkl')

英文:

I wanted to achieve something similar - here is my approach:

# Import packages
from more_itertools import sliced
import pandas as pd
# Create DataFrame
df = pd.DataFrame(data={&#39;col_a&#39;: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], &#39;col_b&#39;: [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]})
# Split/slice DataFrame into chunks of 4
slices = sliced(seq=range(len(df)), n=4)
# Create empty DataFrame
data = pd.DataFrame()
# Apply function and save every N steps
for index in slices:
    chunk = df.iloc[index].copy()
    chunk[&#39;new_column&#39;] = index # Apply function/transformation here
    data = pd.concat([data, chunk], axis=0, ignore_index=True, sort=False)
    data.to_pickle(path=&#39;df.pkl&#39;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas DataFrame Apply函数和每N步保存一次

问题

答案1

答案2

添加来自for循环的结果

Read/Write data from a Tabulated file format

选择1到n的关系“错误的方式”

Excel Border Color with Xlwings in Python

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。