2023年6月26日 19:07:05go评论109阅读模式

英文:

Insert a new row in a Pandas dataframe missing values found in other rows and columns

问题

import pandas as pd
df_test = pd.DataFrame(data=None, columns=['file', 'quarter', 'status'])
df_test.file = ['file_1', 'file_1', 'file_2', 'file_2', 'file_3']
df_test.quarter = ['2022q4', '2023q2', '2022q3', '2022q4', '2023q1']
df_test.status = ['in', 'in', 'in', 'in', 'in']
quarters = ['2022q3', '2022q4', '2023q1', '2023q2']
# Create a new DataFrame to store the expanded data
expanded_df = pd.DataFrame(columns=['file', 'quarter', 'status'])
# Iterate through unique file names
unique_files = df_test['file'].unique()
for file in unique_files:
    file_data = df_test[df_test['file'] == file]  # Filter data for the current file
    
    # Iterate through quarters and check for missing entries
    for quarter in quarters:
        if quarter not in file_data['quarter'].values:
            # Add a new row with 'out' status for missing quarter
            new_row = {'file': file, 'quarter': quarter, 'status': 'out'}
            expanded_df = expanded_df.append(new_row, ignore_index=True)
    
# Concatenate the original data and the expanded data
result_df = pd.concat([df_test, expanded_df], ignore_index=True)
# Sort the result DataFrame by file and quarter
result_df = result_df.sort_values(by=['file', 'quarter']).reset_index(drop=True)
# Display the expected output
print(result_df)

Expected output:

     file quarter status
0  file_1  2022q4     in
1  file_1  2023q1    out
2  file_1  2023q2     in
3  file_2  2022q3     in
4  file_2  2022q4     in
5  file_2  2023q1    out
6  file_3  2023q1     in
7  file_3  2023q2    out

英文:

I have a very large but somewhat incomplete Pandas dataframe in Python where I need to insert rows that are missing based on values found in other rows and columns. An example is something like this.

import pandas as pd
df_test = pd.DataFrame(data=None, columns=[&#39;file&#39;, &#39;quarter&#39;, &#39;status&#39;])
df_test.file = [&#39;file_1&#39;, &#39;file_1&#39;, &#39;file_2&#39;, &#39;file_2&#39;, &#39;file_3&#39;]
df_test.quarter = [&#39;2022q4&#39;, &#39;2023q2&#39;, &#39;2022q3&#39;, &#39;2022q4&#39;, &#39;2023q1&#39;]
df_test.status = [&#39;in&#39;, &#39;in&#39;, &#39;in&#39;, &#39;in&#39;, &#39;in&#39;]

What I have are different files that were used during different quarters, and if they were used, the 'status' for that file and quarter is set to 'in'. What I want to do in this dataframe is insert rows for when the file was not used and set the status to 'out' for the correct quarter. If the file was not used for two quarters or more in a row, only a single new entry that reads 'out' is needed for the first quarter when it was not used.

In this example, it means that for 'file_1', a new row should be added for quarter = '2023q1' with status = 'out'. For 'file_2' a new row for quarter '2023q1' with status 'out' should be added, but nothing new for '2023q2' is needed. For 'file_3', just '2023q2' and 'out' is needed.

I suppose that I should use a list like the one below and check if a unique file name has all entries in the list or not and from there create new rows, but I'm not sure how to do it. Any help at all is really appreciated, thanks!

quarters = [&#39;2022q3&#39;, &#39;2022q4&#39;, &#39;2023q1&#39;, &#39;2023q2&#39;]

Expected output:

     file quarter status
0  file_1  2022q4     in
1  file_1  2023q1    out
2  file_1  2023q2     in
3  file_2  2022q3     in
4  file_2  2022q4     in
5  file_2  2023q1    out
6  file_3  2023q1     in
7  file_3  2023q2    out

答案1

得分: 1

以下是您要求的翻译内容：

One classical approach to fill missing values is to stack/unstack (or melt/pivot).

这是经典的一种方法来填充缺失值，即 stack/unstack（或 melt/pivot）。

Here you can do that and use ffill to add missing values that follow an "in":
在这里，您可以这样做，并使用 ffill 来添加遵循 "in" 的缺失值：

tmp = df_test.set_index(['file', 'quarter']).unstack()
out = tmp.fillna(tmp.replace('in', 'out').ffill(axis=1, limit=1)).stack().reset_index()

Output:
输出：

     file quarter status
0  file_1  2022q4     in
1  file_1  2023q1    out
2  file_1  2023q2     in
3  file_2  2022q3     in
4  file_2  2022q4     in
5  file_2  2023q1    out
6  file_3  2023q1     in
7  file_3  2023q2    out

Intermediate:
中间结果：

tmp.fillna(tmp.replace('in', 'out').ffill(axis=1, limit=1))
        status                     
quarter 2022q3 2022q4 2023q1 2023q2
file                                
file_1     NaN     in    out     in
file_2      in     in    out    NaN
file_3     NaN    NaN     in    out

英文:

One classical approach to fill missing values is to stack/unstack (or melt/pivot).

Here you can do that and use ffill to add missing values that follow an "in":

tmp = df_test.set_index([&#39;file&#39;, &#39;quarter&#39;]).unstack()
out = tmp.fillna(tmp.replace(&#39;in&#39;, &#39;out&#39;).ffill(axis=1, limit=1)).stack().reset_index()

Output:

     file quarter status
0  file_1  2022q4     in
1  file_1  2023q1    out
2  file_1  2023q2     in
3  file_2  2022q3     in
4  file_2  2022q4     in
5  file_2  2023q1    out
6  file_3  2023q1     in
7  file_3  2023q2    out

Intermediate:

tmp.fillna(tmp.replace(&#39;in&#39;, &#39;out&#39;).ffill(axis=1, limit=1))
        status                     
quarter 2022q3 2022q4 2023q1 2023q2
file                               
file_1     NaN     in    out     in
file_2      in     in    out    NaN
file_3     NaN    NaN     in    out

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Pandas数据框中插入新行，其中包含其他行和列中找到的缺失值。

问题

答案1

Sybase图像列的数据，通过pyodbc检索时被截断为32 KiB。

如何在FastAPI中正确路由子页面？

在数据框中迭代行以通过正则表达式进行搜索。

Pandas绘制每个组的值计数直方图

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。