英文:
Updating an old dataframe with conditional matching of different columns and adding new rows in pandas
问题
以下是代码的翻译部分:
# 读取旧和新数据框
old_df = pd.read_csv('old.csv')
new_df = pd.read_csv('new.csv')
# 创建一个表示旧数据框中唯一(ID,检测器ID)对的元组集合
unique_pairs = set()
for _, row in old_df.iterrows():
detector_ids = [int(x) for x in str(row['Detect_ID']).split(',')]
for detector_id in detector_ids:
unique_pairs.add((row['ID'], detect_id))
# 遍历新数据框中的行,并检查它们的(ID,检测器ID)对是否在唯一对的集合中
new_rows = []
updated_rows = []
for _, row in new_df.iterrows():
detector_ids = [int(x) for x in str(row['Detect_ID']).split(',')]
for detector_id in detector_ids:
if (row['ID'], detector_id) in unique_pairs:
old_row = old_df.loc[(old_df['ID'] == row['ID']) & (old_df['Detect_ID'].str.contains(str(detector_id)))]
if not old_row.empty:
old_row = old_row.iloc[0]
old_row['Date/Time'] = row['date/time']
old_df.loc[(old_df['ID'] == row['ID']) & (old_df['Detector_ID'].str.contains(str(detector_id))), 'date/time'] = old_row['date/time']
updated_rows.append(old_row)
else:
row['Identify'] = 'new'
new_rows.append(row)
unique_pairs.add((row['ID'], detector_id))
# 将新行附加到旧数据框,并将更新后的数据框写入新文件
old_df = old_df.append(new_rows, ignore_index=True)
for row in updated_rows:
row['Identify'] = 'updated'
old_df = old_df.append(updated_rows, ignore_index=True)
old_df.to_csv('updated.csv', index=False)
请注意,代码中可能存在一些错误,因此你需要自行检查并进行必要的修改以使其按照你的需求工作。此外,你需要确保你的Python环境已经导入了Pandas库。
英文:
I have an old dataframe with following columns and lot of rows and look like this
>old_df
date/time Name detect_ID category ID
12/1/2023 XXX 1 B 1400
12/1/2023 XXY 1,3,7 B 1402
12/1/2023 XXY 4 A 1403
12/1/2023 XXY 4 B 1407
.....
I have some information on new_df which has similar column and based on this I want to update the old_df. New dataframe is:
>new_df
date/time Name detect_ID category ID
13/1/2023 XXX 1 B 1400
14/1/2023 XXY 1,3,8 B 1402
14/1/2023 XXY 1 B 1405
.....
For updating I want following conditions:
- I want to iterate through the rows of old_df while checking every rows of new_df and its information. But I want to check only those rows of old_df where Category column value is 'B".
- First the program will keep in mind the first row's ID value of new_df for first iteration (and go through all rows of new_df with consecutive iteration). While iterating through rows of old_df. If the ID of this first row of new_df doesnt match matches with any ID of old_df it will take that full row from new_df and add it as a new row of old_df while creating a new column in old_df named Identify and assigning a value new.
If the ID of this new_df matches with any ID of old_df, it will go through that specific row's detect_ID column value. Now few things can happen :
A. If that specific detect_ID value of old_df matches with that first row's detect_ID value of new_df, it will take that specific row of new_df and replace the matched row of old_df, while the newly created column identify will have value updated. In this case. Also, as you can see here detect_ID has multiple values : 1,2,3 I want to check for each of them separately and some of these digits maybe integer. so basically split them with , and converting them to integer.
B. If that detect_ID value of old_df doesnt match with that first row's detect_ID value of new_df, it will take that full row from new_df and add it as a new row of old_df while going in column identify and assigning a value new.
- For the rows of old_df, that "ID" value did not match with any rows of new_df having same 'ID or same 'ID' but no match on ' detect_ID' value, will remain unchanged in old_df and have value unchanged in identify column.
I want this to iterate through all rows of old_df until every row of new_df is updated in old_df .
For the givene example, I want output dataframe like following:
>output
date/time Name detect_ID category ID identify
13/1/2023 XXX 1 B 1400 updated [Case A]
14/1/2023 XXY 1 B 1402 updated [Case A with multiple detect_ID]
14/1/2023 XXY 3 B 1402 updated
12/1/2023 XXY 7 B 1402 unchanged [Step 3, Id matches but detect_id do not ]
14/1/2023 XXY 8 B 1402 new [Case B]
12/1/2023 XXY 4 A 1403 unchanged
12/1/2023 XXY 4 B 1407 unchanged [Step3 , id not found in new_df]
I am using following code but it seems not working as the way I want. It gives a lot of duplicate and doesn't iterate through a lot of rows of old_df too.
old_df = pd.read_csv('old.csv')
new_df = pd.read_csv('new.csv')
# Create a set of tuples representing the unique (ID, Detector Id) pairs in the old dataframe
unique_pairs = set()
for _, row in old_df.iterrows():
detector_ids = [int(x) for x in str(row['Detect_ID']).split(',')]
for detector_id in detector_ids:
unique_pairs.add((row['ID'], detect_id))
# Iterate over the rows in the new dataframe and check if their (ID, Detector Id) pair is in the set of unique pairs
new_rows = []
updated_rows = []
for _, row in new_df.iterrows():
detector_ids = [int(x) for x in str(row['Detect_ID']).split(',')]
for detector_id in detector_ids:
if (row['ID'], detector_id) in unique_pairs:
old_row = old_df.loc[(old_df['ID'] == row['ID']) & (old_df['Detect_ID'].str.contains(str(detector_id)))]
if not old_row.empty:
old_row = old_row.iloc[0]
old_row['Date/Time'] = row['date/time']
old_df.loc[(old_df['ID'] == row['ID']) & (old_df['Detector_ID'].str.contains(str(detector_id))), 'date/time'] = old_row['date/time']
updated_rows.append(old_row)
else:
row['Identify'] = 'new'
new_rows.append(row)
unique_pairs.add((row['ID'], detector_id))
# Append the new rows to the old dataframe and write the updated dataframe to a new file
old_df = old_df.append(new_rows, ignore_index=True)
for row in updated_rows:
row['Identify'] = 'updated'
old_df = old_df.append(updated_rows, ignore_index=True)
old_df.to_csv('updated.csv', index=False)
答案1
得分: 1
以下是已翻译的内容:
步骤1
确保列 detect_ID
中的所有值都是字符串,通过应用 Series.astype(str)
来实现。现在,使用 Series.str.split
和 df.explode
将像 1,3,7
这样的条目拆分为单独的行。同时应用到两个 dfs
。在此阶段,我们可以将 detect_ID
中的所有值的类型更改回 int
(假设您的数据确实由数字字符组成)。
由于我们只想检查具有列 category
中值为 B
的行,使用 Series.eq
过滤掉 new_df
中的任何非 B
值(尽管在您当前的示例中不存在这样的值)。
步骤2
应用 df.merge
。我们要在 ['Name','detect_ID','category', 'ID']
上进行合并,保留来自两侧的所有条目(因此:how='outer'
),并添加一个 indicator
列(称为 identify
),它将告诉我们每一行的来源。为了更清楚起见,添加自定义后缀(例如 '_old'
,而不是默认的 '_x'
)。
步骤3
在这个阶段,我们要决定哪个值需要保留作为列 date/time
的值。对于所有行,我们需要来自 _new
的值,这些行存在于两个 dfs
中,并且仅存在于 df_new
中。由于 left_only
条目在列 date/time_new
中将具有 NaN
值,我们可以依赖 Series.where
来实现这一点。
步骤4
以下工作尚未完成:
- 更新列
identify
的值。我们可以使用Series.map
来实现这一点。 - 从
res
中以正确的顺序选择正确的列。让我们使用df_old
中的列名加上identify
放入df.loc
中,并且还链df.sort_values
进行排序。在这里,让我们还使用df.reset_index
。
英文:
Relying on df.iterrows
nearly always implies a suboptimal approach to manipulations in pandas
(see e.g. this SO post). Here's an approach using df.merge
for the important part.
Step 1
Make sure that all the values in column detect_ID
are strings by applying Series.astype(str)
. Now, use Series.str.split
and df.explode
to get entries like 1,3,7
into separate rows. Apply to both dfs
. At this stage, we can change the type (back) to int
for all values in detect_ID
(assuming that your data indeed consists of numeric characters).
Since we only want to check on rows that have value B
in column category
, filter out any non-B
values from new_df
with Series.eq
(though, in your current example no such value exists).
old_df['detect_ID'] = old_df['detect_ID'].astype(str).str.split(',')
old_df = old_df.explode('detect_ID', ignore_index=False)
old_df['detect_ID'] = old_df['detect_ID'].astype(int)
new_df['detect_ID'] = new_df['detect_ID'].astype(str).str.split(',')
new_df = new_df.explode('detect_ID', ignore_index=False)
new_df['detect_ID'] = new_df['detect_ID'].astype(int)
new_df = new_df[new_df['category'].eq('B')]
# `dfs` now as follows
old_df
date/time Name detect_ID category ID
0 12/1/2023 XXX 1 B 1400
1 12/1/2023 XXY 1 B 1402
1 12/1/2023 XXY 3 B 1402
1 12/1/2023 XXY 7 B 1402
2 12/1/2023 XXY 4 A 1403
3 12/1/2023 XXY 4 B 1407
Step 2
Apply df.merge
. We want to merge on ['Name','detect_ID','category', 'ID']
, keep all the entries from both sides (so: how='outer'
), and also add an indicator
column (calling it identify
), which will tell us the source of each row. Adding custom suffixes (e.g. '_old'
instead of default '_x'
) for clarity's sake.
res = old_df.merge(new_df, on=['Name','detect_ID','category', 'ID'],
how='outer', indicator='identify', suffixes=('_old','_new'))
res
date/time_old Name detect_ID category ID date/time_new identify
0 12/1/2023 XXX 1 B 1400 13/1/2023 both
1 12/1/2023 XXY 1 B 1402 14/1/2023 both
2 12/1/2023 XXY 3 B 1402 14/1/2023 both
3 12/1/2023 XXY 7 B 1402 NaN left_only
4 12/1/2023 XXY 4 A 1403 NaN left_only
5 12/1/2023 XXY 4 B 1407 NaN left_only
6 NaN XXY 8 B 1402 14/1/2023 right_only
7 NaN XXY 1 B 1405 14/1/2023 right_only
Step 3
At this stage, we want to decide on which value we need to keep for column date/time
. We need the values from _new
for all rows (1) that exist in both dfs
, and (2) that exist only in df_new
. Since left_only
entries will have NaN
values in column date/time_new
, we can rely on Series.where
to accomplish this:
res['date/time'] = res['date/time_new'].where(res['date/time_new'].notna(),
res['date/time_old'])
Step 4
The following remains to be done:
- Updating the values for column
identify
. We can useSeries.map
for this. - Selecting the correct columns from
res
in the correct order. Let's use the column names fromdf_old
plusidentify
insidedf.loc
, and also chaindf.sort_values
for this. Let's also here usedf.reset_index
.
mapper = {'both': 'updated',
'left_only': 'unchanged',
'right_only': 'new'}
res['identify'] = res['identify'].map(mapper)
res = (res.loc[:, list(old_df.columns) + ['identify']]
.sort_values(['ID', 'detect_ID'])
.reset_index(drop=True))
res
date/time Name detect_ID category ID identify
0 13/1/2023 XXX 1 B 1400 updated
1 14/1/2023 XXY 1 B 1402 updated
2 14/1/2023 XXY 3 B 1402 updated
3 12/1/2023 XXY 7 B 1402 unchanged
4 14/1/2023 XXY 8 B 1402 new
5 12/1/2023 XXY 4 A 1403 unchanged
6 14/1/2023 XXY 1 B 1405 new
7 12/1/2023 XXY 4 B 1407 unchanged
N.B. As mentioned by @Ashyam in the comments above, your desired result doesn't have the row for ID 1405
, which exists only in df_new
. I'm here assuming that you do want this entry in the new df
. If not, you can get rid of it as follows:
res = res[res['ID'].isin(old_df['ID'])].reset_index(drop=True)
Of course, this operation could then in fact already be applied to new_df
. Cf. the filter for value B
in column category
above.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论