英文:
What's a more efficient way to match values in a dataframe?
问题
I will provide you with the translated code portion:
elist = []
for i, col in A.iterrows():
    for ix, c in B.iterrows():
        if col["Responsible"] == c["Resp Name"] and col["Name"] not in elist and c["Resp Name"] not in C:
            elist.append([col["Name"], col["Gender"].... c["Dpt"]])
        elif col["Responsible"] == c["Resp Name"] and col["Name"] in elist and c["Resp Name"] in C:
            for n, x in C.iterrows():
                if c["Resp Name"] == x["Resp Name"] and col["Store"] == x["Store"] and col["DateLabel"] == x["Date"]:
                    elist.append([col["Name"], col["Gender"].... c["Dpt"]])
Please note that I have replaced the HTML entities like " with the actual characters for better code readability.
英文:
I have a large dataset (like 1m lines) and i need to match each value with another value from other 2 datasets, based on several conditions. This problem is solved with for loops and dataframe.iterrows(), but, in the near future, wil become slow, once we'll add more lines, so i want to find another method for value matching. Example of the tables below:
A:
Name        | Gender         | Responsible         | Function | DateLabel | Store
Bob   D     | M              | Mike  M             | Worker   | Jun-20    | 122
Mike   L    | M              | Josh  J             |Manager   | Apr-21    | 133
Diana V     | F              |Christine            |Manager   | Apr-23    |133
B:
Resp Name      | Dpt
Mike M         | Ops
Josh J         | Logistics
Christine      | Legal
C:
Resp Name         | Store    | Date
Mike   M          |122       | Jun-20
Mike   M          |122       | Jun-21
Mike   M          |122       | Apr-22
Christine         |133       | Apr-23
Christine         |133       | Apr-21
Task: I need to match the name in A, with the Dpt in B if the responsible in A is the same with B. If the responsible from B it's also in C, i need to match the date in A, with the Date in C and the store in A with the store in C and get the dpt from B or C and add it to an extended dataframe, that contains all data from A + the dpt.
My code:
elist=[]
for i, col in A.iterrows():
  for ix, c in B.iterrows():
      if col["Resp"] == c["Resp. Name"] and col["Name"] not in elist and c["Resp name"] not in C:
        elist.append([col["Name"], col["Gender"].... c["Dpt"])
       elif col["Resp"] == c["Resp. Name"] and col["Name"] in elist and c["Resp name"] in C:
          for n, x in c.iterrows():
              if c["resp name"] == x["Resp name"] and col["store"] == x["Store"] and col["DateLabel"] == x["date"]:
                    elist.append([col["Name"], col["Gender"].... c["Dpt"])
答案1
得分: 1
我无法测试百万行数据,但仍然可以尝试这样做。合并操作可以比使用for循环更快。
假设你有三个数据框 A、B 和 C
这里,我们将数据框 A 和 B 合并在一起。"on" 参数告诉 pandas 使用哪一列来匹配这两个数据框的行。在我们的情况下,它使用 "Resp_Name" 列。
"how" 参数定义要执行的合并类型。"left" 表示将包括来自 A 的所有行和仅来自 B 的匹配行。如果没有匹配项,结果将是 NaN。
merged_AB = pd.merge(A, B, on='Resp_Name', how='left')
接下来,我们将上一次合并的结果 (merged_AB) 与数据框 C 合并。这次我们将根据三列进行行匹配:"Resp_Name"、"Store" 和 "Date"
同样地,"how" 参数设置为 "left",这意味着我们将保留 merged_AB 中的所有行和来自 C 的匹配行。如果没有匹配项,结果将是 NaN。
final_merged = pd.merge(merged_AB, C, on=['Resp_Name', 'Store', 'Date'], how='left')
英文:
I cannot test with million rows, but still. Try this. merge can do faster that for loops
# Lets say you have your dfs A, B and C
# Here, we are merging dfs A and B together. The 'on' parameter tells pandas which column to use to match rows across the two dataframes. In our cas it is using the 'Resp_Name' column. 
# The 'how' parameter defines the type of merge to be performed. 'left' means that all the rows from A and only the matching rows from B will be included. If there is no match then the result is nan
merged_AB = pd.merge(A, B, on='Resp_Name', how='left')
# Next, we are taking the result from the previous merge (merged_AB) and merging it with dataframe C. This time we're matching rows based on three columns: 'Resp_Name', 'Store', and 'Date'
# Again, 'how' parameter is set to 'left', it means that we will keep all rows from merged_AB and only the matching rows from C. If there is no match, the result is nan
final_merged = pd.merge(merged_AB, C, on=['Resp_Name', 'Store', 'Date'], how='left')
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论