在Pandas数据框中插入新行,其中包含其他行和列中找到的缺失值。

huangapple go评论109阅读模式
英文:

Insert a new row in a Pandas dataframe missing values found in other rows and columns

问题

  1. import pandas as pd
  2. df_test = pd.DataFrame(data=None, columns=['file', 'quarter', 'status'])
  3. df_test.file = ['file_1', 'file_1', 'file_2', 'file_2', 'file_3']
  4. df_test.quarter = ['2022q4', '2023q2', '2022q3', '2022q4', '2023q1']
  5. df_test.status = ['in', 'in', 'in', 'in', 'in']
  6. quarters = ['2022q3', '2022q4', '2023q1', '2023q2']
  7. # Create a new DataFrame to store the expanded data
  8. expanded_df = pd.DataFrame(columns=['file', 'quarter', 'status'])
  9. # Iterate through unique file names
  10. unique_files = df_test['file'].unique()
  11. for file in unique_files:
  12. file_data = df_test[df_test['file'] == file] # Filter data for the current file
  13. # Iterate through quarters and check for missing entries
  14. for quarter in quarters:
  15. if quarter not in file_data['quarter'].values:
  16. # Add a new row with 'out' status for missing quarter
  17. new_row = {'file': file, 'quarter': quarter, 'status': 'out'}
  18. expanded_df = expanded_df.append(new_row, ignore_index=True)
  19. # Concatenate the original data and the expanded data
  20. result_df = pd.concat([df_test, expanded_df], ignore_index=True)
  21. # Sort the result DataFrame by file and quarter
  22. result_df = result_df.sort_values(by=['file', 'quarter']).reset_index(drop=True)
  23. # Display the expected output
  24. print(result_df)

Expected output:

  1. file quarter status
  2. 0 file_1 2022q4 in
  3. 1 file_1 2023q1 out
  4. 2 file_1 2023q2 in
  5. 3 file_2 2022q3 in
  6. 4 file_2 2022q4 in
  7. 5 file_2 2023q1 out
  8. 6 file_3 2023q1 in
  9. 7 file_3 2023q2 out
英文:

I have a very large but somewhat incomplete Pandas dataframe in Python where I need to insert rows that are missing based on values found in other rows and columns. An example is something like this.

  1. import pandas as pd
  2. df_test = pd.DataFrame(data=None, columns=['file', 'quarter', 'status'])
  3. df_test.file = ['file_1', 'file_1', 'file_2', 'file_2', 'file_3']
  4. df_test.quarter = ['2022q4', '2023q2', '2022q3', '2022q4', '2023q1']
  5. df_test.status = ['in', 'in', 'in', 'in', 'in']

What I have are different files that were used during different quarters, and if they were used, the 'status' for that file and quarter is set to 'in'. What I want to do in this dataframe is insert rows for when the file was not used and set the status to 'out' for the correct quarter. If the file was not used for two quarters or more in a row, only a single new entry that reads 'out' is needed for the first quarter when it was not used.

In this example, it means that for 'file_1', a new row should be added for quarter = '2023q1' with status = 'out'. For 'file_2' a new row for quarter '2023q1' with status 'out' should be added, but nothing new for '2023q2' is needed. For 'file_3', just '2023q2' and 'out' is needed.

I suppose that I should use a list like the one below and check if a unique file name has all entries in the list or not and from there create new rows, but I'm not sure how to do it. Any help at all is really appreciated, thanks!

  1. quarters = ['2022q3', '2022q4', '2023q1', '2023q2']

Expected output:

  1. file quarter status
  2. 0 file_1 2022q4 in
  3. 1 file_1 2023q1 out
  4. 2 file_1 2023q2 in
  5. 3 file_2 2022q3 in
  6. 4 file_2 2022q4 in
  7. 5 file_2 2023q1 out
  8. 6 file_3 2023q1 in
  9. 7 file_3 2023q2 out

答案1

得分: 1

以下是您要求的翻译内容:

One classical approach to fill missing values is to stack/unstack (or melt/pivot).

这是经典的一种方法来填充缺失值,即 stack/unstack(或 melt/pivot)。

Here you can do that and use ffill to add missing values that follow an "in":
在这里,您可以这样做,并使用 ffill 来添加遵循 "in" 的缺失值:

  1. tmp = df_test.set_index(['file', 'quarter']).unstack()
  2. out = tmp.fillna(tmp.replace('in', 'out').ffill(axis=1, limit=1)).stack().reset_index()

Output:
输出:

  1. file quarter status
  2. 0 file_1 2022q4 in
  3. 1 file_1 2023q1 out
  4. 2 file_1 2023q2 in
  5. 3 file_2 2022q3 in
  6. 4 file_2 2022q4 in
  7. 5 file_2 2023q1 out
  8. 6 file_3 2023q1 in
  9. 7 file_3 2023q2 out

Intermediate:
中间结果:

  1. tmp.fillna(tmp.replace('in', 'out').ffill(axis=1, limit=1))
  2. status
  3. quarter 2022q3 2022q4 2023q1 2023q2
  4. file
  5. file_1 NaN in out in
  6. file_2 in in out NaN
  7. file_3 NaN NaN in out
英文:

One classical approach to fill missing values is to stack/unstack (or melt/pivot).

Here you can do that and use ffill to add missing values that follow an "in":

  1. tmp = df_test.set_index(['file', 'quarter']).unstack()
  2. out = tmp.fillna(tmp.replace('in', 'out').ffill(axis=1, limit=1)).stack().reset_index()

Output:

  1. file quarter status
  2. 0 file_1 2022q4 in
  3. 1 file_1 2023q1 out
  4. 2 file_1 2023q2 in
  5. 3 file_2 2022q3 in
  6. 4 file_2 2022q4 in
  7. 5 file_2 2023q1 out
  8. 6 file_3 2023q1 in
  9. 7 file_3 2023q2 out

Intermediate:

  1. tmp.fillna(tmp.replace('in', 'out').ffill(axis=1, limit=1))
  2. status
  3. quarter 2022q3 2022q4 2023q1 2023q2
  4. file
  5. file_1 NaN in out in
  6. file_2 in in out NaN
  7. file_3 NaN NaN in out

huangapple
  • 本文由 发表于 2023年6月26日 19:07:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/76556109.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定