使用pandas更新旧数据框,通过条件匹配不同列并添加新行。

huangapple go评论65阅读模式
英文:

Updating an old dataframe with conditional matching of different columns and adding new rows in pandas

问题

以下是代码的翻译部分:

# 读取旧和新数据框
old_df = pd.read_csv('old.csv')
new_df = pd.read_csv('new.csv')

# 创建一个表示旧数据框中唯一(ID,检测器ID)对的元组集合
unique_pairs = set()
for _, row in old_df.iterrows():
    detector_ids = [int(x) for x in str(row['Detect_ID']).split(',')]
    for detector_id in detector_ids:
        unique_pairs.add((row['ID'], detect_id))

# 遍历新数据框中的行,并检查它们的(ID,检测器ID)对是否在唯一对的集合中
new_rows = []
updated_rows = []
for _, row in new_df.iterrows():
    detector_ids = [int(x) for x in str(row['Detect_ID']).split(',')]
    for detector_id in detector_ids:
        if (row['ID'], detector_id) in unique_pairs:
            old_row = old_df.loc[(old_df['ID'] == row['ID']) & (old_df['Detect_ID'].str.contains(str(detector_id)))]
            if not old_row.empty:
                old_row = old_row.iloc[0]
                old_row['Date/Time'] = row['date/time']
                old_df.loc[(old_df['ID'] == row['ID']) & (old_df['Detector_ID'].str.contains(str(detector_id))), 'date/time'] = old_row['date/time']
                updated_rows.append(old_row)
        else:
            row['Identify'] = 'new'
            new_rows.append(row)
            unique_pairs.add((row['ID'], detector_id))

# 将新行附加到旧数据框,并将更新后的数据框写入新文件
old_df = old_df.append(new_rows, ignore_index=True)
for row in updated_rows:
    row['Identify'] = 'updated'
old_df = old_df.append(updated_rows, ignore_index=True)
old_df.to_csv('updated.csv', index=False)

请注意,代码中可能存在一些错误,因此你需要自行检查并进行必要的修改以使其按照你的需求工作。此外,你需要确保你的Python环境已经导入了Pandas库。

英文:

I have an old dataframe with following columns and lot of rows and look like this

>old_df
date/time    Name   detect_ID   category  ID
12/1/2023    XXX    1           B        1400
12/1/2023    XXY    1,3,7       B        1402
12/1/2023    XXY    4           A        1403
12/1/2023    XXY    4           B        1407
.....

I have some information on new_df which has similar column and based on this I want to update the old_df. New dataframe is:

>new_df
date/time    Name   detect_ID   category  ID
13/1/2023    XXX    1           B        1400
14/1/2023    XXY    1,3,8       B        1402
14/1/2023    XXY    1           B        1405
.....

For updating I want following conditions:

  1. I want to iterate through the rows of old_df while checking every rows of new_df and its information. But I want to check only those rows of old_df where Category column value is 'B".
  2. First the program will keep in mind the first row's ID value of new_df for first iteration (and go through all rows of new_df with consecutive iteration). While iterating through rows of old_df. If the ID of this first row of new_df doesnt match matches with any ID of old_df it will take that full row from new_df and add it as a new row of old_df while creating a new column in old_df named Identify and assigning a value new.

If the ID of this new_df matches with any ID of old_df, it will go through that specific row's detect_ID column value. Now few things can happen :

A. If that specific detect_ID value of old_df matches with that first row's detect_ID value of new_df, it will take that specific row of new_df and replace the matched row of old_df, while the newly created column identify will have value updated. In this case. Also, as you can see here detect_ID has multiple values : 1,2,3 I want to check for each of them separately and some of these digits maybe integer. so basically split them with , and converting them to integer.

B. If that detect_ID value of old_df doesnt match with that first row's detect_ID value of new_df, it will take that full row from new_df and add it as a new row of old_df while going in column identify and assigning a value new.

  1. For the rows of old_df, that "ID" value did not match with any rows of new_df having same 'ID or same 'ID' but no match on ' detect_ID' value, will remain unchanged in old_df and have value unchanged in identify column.

I want this to iterate through all rows of old_df until every row of new_df is updated in old_df .

For the givene example, I want output dataframe like following:

>output
date/time    Name   detect_ID   category  ID   identify
13/1/2023    XXX    1           B        1400   updated  [Case A] 
14/1/2023    XXY    1           B        1402   updated  [Case A with multiple detect_ID]
14/1/2023    XXY    3           B        1402   updated
12/1/2023    XXY    7           B        1402   unchanged  [Step 3, Id matches but detect_id do not ]
14/1/2023    XXY    8           B        1402   new        [Case B]
12/1/2023    XXY    4           A        1403   unchanged   
12/1/2023    XXY    4           B        1407   unchanged [Step3 , id not found in new_df]

I am using following code but it seems not working as the way I want. It gives a lot of duplicate and doesn't iterate through a lot of rows of old_df too.

old_df = pd.read_csv('old.csv')
new_df = pd.read_csv('new.csv')

# Create a set of tuples representing the unique (ID, Detector Id) pairs in the old dataframe
unique_pairs = set()
for _, row in old_df.iterrows():
    detector_ids = [int(x) for x in str(row['Detect_ID']).split(',')]
    for detector_id in detector_ids:
        unique_pairs.add((row['ID'], detect_id))

# Iterate over the rows in the new dataframe and check if their (ID, Detector Id) pair is in the set of unique pairs
new_rows = []
updated_rows = []
for _, row in new_df.iterrows():
    detector_ids = [int(x) for x in str(row['Detect_ID']).split(',')]
    for detector_id in detector_ids:
        if (row['ID'], detector_id) in unique_pairs:
            old_row = old_df.loc[(old_df['ID'] == row['ID']) & (old_df['Detect_ID'].str.contains(str(detector_id)))]
            if not old_row.empty:
                old_row = old_row.iloc[0]
                old_row['Date/Time'] = row['date/time']
                old_df.loc[(old_df['ID'] == row['ID']) & (old_df['Detector_ID'].str.contains(str(detector_id))), 'date/time'] = old_row['date/time']
                updated_rows.append(old_row)
        else:
            row['Identify'] = 'new'
            new_rows.append(row)
            unique_pairs.add((row['ID'], detector_id))

# Append the new rows to the old dataframe and write the updated dataframe to a new file
old_df = old_df.append(new_rows, ignore_index=True)
for row in updated_rows:
    row['Identify'] = 'updated'
old_df = old_df.append(updated_rows, ignore_index=True)
old_df.to_csv('updated.csv', index=False)

答案1

得分: 1

以下是已翻译的内容:

步骤1

确保列 detect_ID 中的所有值都是字符串,通过应用 Series.astype(str) 来实现。现在,使用 Series.str.splitdf.explode 将像 1,3,7 这样的条目拆分为单独的行。同时应用到两个 dfs。在此阶段,我们可以将 detect_ID 中的所有值的类型更改回 int(假设您的数据确实由数字字符组成)。

由于我们只想检查具有列 category 中值为 B 的行,使用 Series.eq 过滤掉 new_df 中的任何非 B 值(尽管在您当前的示例中不存在这样的值)。

步骤2

应用 df.merge。我们要在 ['Name','detect_ID','category', 'ID'] 上进行合并,保留来自两侧的所有条目(因此:how='outer'),并添加一个 indicator 列(称为 identify),它将告诉我们每一行的来源。为了更清楚起见,添加自定义后缀(例如 '_old',而不是默认的 '_x')。

步骤3

在这个阶段,我们要决定哪个值需要保留作为列 date/time 的值。对于所有行,我们需要来自 _new 的值,这些行存在于两个 dfs 中,并且仅存在于 df_new 中。由于 left_only 条目在列 date/time_new 中将具有 NaN 值,我们可以依赖 Series.where 来实现这一点。

步骤4

以下工作尚未完成:

  • 更新列 identify 的值。我们可以使用 Series.map 来实现这一点。
  • res 中以正确的顺序选择正确的列。让我们使用 df_old 中的列名加上 identify 放入 df.loc 中,并且还链 df.sort_values 进行排序。在这里,让我们还使用 df.reset_index
英文:

Relying on df.iterrows nearly always implies a suboptimal approach to manipulations in pandas (see e.g. this SO post). Here's an approach using df.merge for the important part.

Step 1

Make sure that all the values in column detect_ID are strings by applying Series.astype(str). Now, use Series.str.split and df.explode to get entries like 1,3,7 into separate rows. Apply to both dfs. At this stage, we can change the type (back) to int for all values in detect_ID (assuming that your data indeed consists of numeric characters).

Since we only want to check on rows that have value B in column category, filter out any non-B values from new_df with Series.eq (though, in your current example no such value exists).

old_df['detect_ID'] = old_df['detect_ID'].astype(str).str.split(',')
old_df = old_df.explode('detect_ID', ignore_index=False)
old_df['detect_ID'] = old_df['detect_ID'].astype(int)

new_df['detect_ID'] = new_df['detect_ID'].astype(str).str.split(',')
new_df = new_df.explode('detect_ID', ignore_index=False)
new_df['detect_ID'] = new_df['detect_ID'].astype(int)
new_df = new_df[new_df['category'].eq('B')]

# `dfs` now as follows
old_df

   date/time Name detect_ID category    ID
0  12/1/2023  XXX         1        B  1400
1  12/1/2023  XXY         1        B  1402
1  12/1/2023  XXY         3        B  1402
1  12/1/2023  XXY         7        B  1402
2  12/1/2023  XXY         4        A  1403
3  12/1/2023  XXY         4        B  1407

Step 2

Apply df.merge. We want to merge on ['Name','detect_ID','category', 'ID'], keep all the entries from both sides (so: how='outer'), and also add an indicator column (calling it identify), which will tell us the source of each row. Adding custom suffixes (e.g. '_old' instead of default '_x') for clarity's sake.

res = old_df.merge(new_df, on=['Name','detect_ID','category', 'ID'], 
                   how='outer', indicator='identify', suffixes=('_old','_new'))

res

  date/time_old Name detect_ID category    ID date/time_new    identify
0     12/1/2023  XXX         1        B  1400     13/1/2023        both
1     12/1/2023  XXY         1        B  1402     14/1/2023        both
2     12/1/2023  XXY         3        B  1402     14/1/2023        both
3     12/1/2023  XXY         7        B  1402           NaN   left_only
4     12/1/2023  XXY         4        A  1403           NaN   left_only
5     12/1/2023  XXY         4        B  1407           NaN   left_only
6           NaN  XXY         8        B  1402     14/1/2023  right_only
7           NaN  XXY         1        B  1405     14/1/2023  right_only

Step 3

At this stage, we want to decide on which value we need to keep for column date/time. We need the values from _new for all rows (1) that exist in both dfs, and (2) that exist only in df_new. Since left_only entries will have NaN values in column date/time_new, we can rely on Series.where to accomplish this:

res['date/time'] = res['date/time_new'].where(res['date/time_new'].notna(),
                                              res['date/time_old'])

Step 4

The following remains to be done:

  • Updating the values for column identify. We can use Series.map for this.
  • Selecting the correct columns from res in the correct order. Let's use the column names from df_old plus identify inside df.loc, and also chain df.sort_values for this. Let's also here use df.reset_index.
mapper = {'both': 'updated',
          'left_only': 'unchanged',
          'right_only': 'new'}

res['identify'] = res['identify'].map(mapper)
res = (res.loc[:, list(old_df.columns) + ['identify']]
       .sort_values(['ID', 'detect_ID'])
       .reset_index(drop=True))

res

   date/time Name  detect_ID category    ID   identify
0  13/1/2023  XXX          1        B  1400    updated
1  14/1/2023  XXY          1        B  1402    updated
2  14/1/2023  XXY          3        B  1402    updated
3  12/1/2023  XXY          7        B  1402  unchanged
4  14/1/2023  XXY          8        B  1402        new
5  12/1/2023  XXY          4        A  1403  unchanged
6  14/1/2023  XXY          1        B  1405        new
7  12/1/2023  XXY          4        B  1407  unchanged

N.B. As mentioned by @Ashyam in the comments above, your desired result doesn't have the row for ID 1405, which exists only in df_new. I'm here assuming that you do want this entry in the new df. If not, you can get rid of it as follows:

res = res[res['ID'].isin(old_df['ID'])].reset_index(drop=True)

Of course, this operation could then in fact already be applied to new_df. Cf. the filter for value B in column category above.

huangapple
  • 本文由 发表于 2023年2月27日 12:14:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/75576726.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定