英文:
What's a more efficient way to match values in a dataframe?
问题
I will provide you with the translated code portion:
elist = []
for i, col in A.iterrows():
for ix, c in B.iterrows():
if col["Responsible"] == c["Resp Name"] and col["Name"] not in elist and c["Resp Name"] not in C:
elist.append([col["Name"], col["Gender"].... c["Dpt"]])
elif col["Responsible"] == c["Resp Name"] and col["Name"] in elist and c["Resp Name"] in C:
for n, x in C.iterrows():
if c["Resp Name"] == x["Resp Name"] and col["Store"] == x["Store"] and col["DateLabel"] == x["Date"]:
elist.append([col["Name"], col["Gender"].... c["Dpt"]])
Please note that I have replaced the HTML entities like "
with the actual characters for better code readability.
英文:
I have a large dataset (like 1m lines) and i need to match each value with another value from other 2 datasets, based on several conditions. This problem is solved with for loops and dataframe.iterrows(), but, in the near future, wil become slow, once we'll add more lines, so i want to find another method for value matching. Example of the tables below:
A:
Name | Gender | Responsible | Function | DateLabel | Store
Bob D | M | Mike M | Worker | Jun-20 | 122
Mike L | M | Josh J |Manager | Apr-21 | 133
Diana V | F |Christine |Manager | Apr-23 |133
B:
Resp Name | Dpt
Mike M | Ops
Josh J | Logistics
Christine | Legal
C:
Resp Name | Store | Date
Mike M |122 | Jun-20
Mike M |122 | Jun-21
Mike M |122 | Apr-22
Christine |133 | Apr-23
Christine |133 | Apr-21
Task: I need to match the name in A, with the Dpt in B if the responsible in A is the same with B. If the responsible from B it's also in C, i need to match the date in A, with the Date in C and the store in A with the store in C and get the dpt from B or C and add it to an extended dataframe, that contains all data from A + the dpt.
My code:
elist=[]
for i, col in A.iterrows():
for ix, c in B.iterrows():
if col["Resp"] == c["Resp. Name"] and col["Name"] not in elist and c["Resp name"] not in C:
elist.append([col["Name"], col["Gender"].... c["Dpt"])
elif col["Resp"] == c["Resp. Name"] and col["Name"] in elist and c["Resp name"] in C:
for n, x in c.iterrows():
if c["resp name"] == x["Resp name"] and col["store"] == x["Store"] and col["DateLabel"] == x["date"]:
elist.append([col["Name"], col["Gender"].... c["Dpt"])
答案1
得分: 1
我无法测试百万行数据,但仍然可以尝试这样做。合并操作可以比使用for循环更快。
假设你有三个数据框 A、B 和 C
这里,我们将数据框 A 和 B 合并在一起。"on" 参数告诉 pandas 使用哪一列来匹配这两个数据框的行。在我们的情况下,它使用 "Resp_Name" 列。
"how" 参数定义要执行的合并类型。"left" 表示将包括来自 A 的所有行和仅来自 B 的匹配行。如果没有匹配项,结果将是 NaN。
merged_AB = pd.merge(A, B, on='Resp_Name', how='left')
接下来,我们将上一次合并的结果 (merged_AB) 与数据框 C 合并。这次我们将根据三列进行行匹配:"Resp_Name"、"Store" 和 "Date"
同样地,"how" 参数设置为 "left",这意味着我们将保留 merged_AB 中的所有行和来自 C 的匹配行。如果没有匹配项,结果将是 NaN。
final_merged = pd.merge(merged_AB, C, on=['Resp_Name', 'Store', 'Date'], how='left')
英文:
I cannot test with million rows, but still. Try this. merge can do faster that for loops
# Lets say you have your dfs A, B and C
# Here, we are merging dfs A and B together. The 'on' parameter tells pandas which column to use to match rows across the two dataframes. In our cas it is using the 'Resp_Name' column.
# The 'how' parameter defines the type of merge to be performed. 'left' means that all the rows from A and only the matching rows from B will be included. If there is no match then the result is nan
merged_AB = pd.merge(A, B, on='Resp_Name', how='left')
# Next, we are taking the result from the previous merge (merged_AB) and merging it with dataframe C. This time we're matching rows based on three columns: 'Resp_Name', 'Store', and 'Date'
# Again, 'how' parameter is set to 'left', it means that we will keep all rows from merged_AB and only the matching rows from C. If there is no match, the result is nan
final_merged = pd.merge(merged_AB, C, on=['Resp_Name', 'Store', 'Date'], how='left')
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论