更高效的方式匹配数据框中的数值是什么?

huangapple go评论66阅读模式
英文:

What's a more efficient way to match values in a dataframe?

问题

I will provide you with the translated code portion:

elist = []
for i, col in A.iterrows():
    for ix, c in B.iterrows():
        if col["Responsible"] == c["Resp Name"] and col["Name"] not in elist and c["Resp Name"] not in C:
            elist.append([col["Name"], col["Gender"].... c["Dpt"]])
        elif col["Responsible"] == c["Resp Name"] and col["Name"] in elist and c["Resp Name"] in C:
            for n, x in C.iterrows():
                if c["Resp Name"] == x["Resp Name"] and col["Store"] == x["Store"] and col["DateLabel"] == x["Date"]:
                    elist.append([col["Name"], col["Gender"].... c["Dpt"]])

Please note that I have replaced the HTML entities like " with the actual characters for better code readability.

英文:

I have a large dataset (like 1m lines) and i need to match each value with another value from other 2 datasets, based on several conditions. This problem is solved with for loops and dataframe.iterrows(), but, in the near future, wil become slow, once we'll add more lines, so i want to find another method for value matching. Example of the tables below:
A:

Name        | Gender         | Responsible         | Function | DateLabel | Store
Bob   D     | M              | Mike  M             | Worker   | Jun-20    | 122
Mike   L    | M              | Josh  J             |Manager   | Apr-21    | 133
Diana V     | F              |Christine            |Manager   | Apr-23    |133

B:

Resp Name      | Dpt
Mike M         | Ops
Josh J         | Logistics
Christine      | Legal

C:

Resp Name         | Store    | Date
Mike   M          |122       | Jun-20
Mike   M          |122       | Jun-21
Mike   M          |122       | Apr-22
Christine         |133       | Apr-23
Christine         |133       | Apr-21

Task: I need to match the name in A, with the Dpt in B if the responsible in A is the same with B. If the responsible from B it's also in C, i need to match the date in A, with the Date in C and the store in A with the store in C and get the dpt from B or C and add it to an extended dataframe, that contains all data from A + the dpt.

My code:

elist=[]
for i, col in A.iterrows():
  for ix, c in B.iterrows():
      if col["Resp"] == c["Resp. Name"] and col["Name"] not in elist and c["Resp name"] not in C:
        elist.append([col["Name"], col["Gender"].... c["Dpt"])
       elif col["Resp"] == c["Resp. Name"] and col["Name"] in elist and c["Resp name"] in C:
          for n, x in c.iterrows():
              if c["resp name"] == x["Resp name"] and col["store"] == x["Store"] and col["DateLabel"] == x["date"]:
                    elist.append([col["Name"], col["Gender"].... c["Dpt"])

答案1

得分: 1

我无法测试百万行数据,但仍然可以尝试这样做。合并操作可以比使用for循环更快。

假设你有三个数据框 A、B 和 C

这里,我们将数据框 A 和 B 合并在一起。"on" 参数告诉 pandas 使用哪一列来匹配这两个数据框的行。在我们的情况下,它使用 "Resp_Name" 列。

"how" 参数定义要执行的合并类型。"left" 表示将包括来自 A 的所有行和仅来自 B 的匹配行。如果没有匹配项,结果将是 NaN。

merged_AB = pd.merge(A, B, on='Resp_Name', how='left')

接下来,我们将上一次合并的结果 (merged_AB) 与数据框 C 合并。这次我们将根据三列进行行匹配:"Resp_Name"、"Store" 和 "Date"

同样地,"how" 参数设置为 "left",这意味着我们将保留 merged_AB 中的所有行和来自 C 的匹配行。如果没有匹配项,结果将是 NaN。

final_merged = pd.merge(merged_AB, C, on=['Resp_Name', 'Store', 'Date'], how='left')

英文:

I cannot test with million rows, but still. Try this. merge can do faster that for loops

# Lets say you have your dfs A, B and C

# Here, we are merging dfs A and B together. The 'on' parameter tells pandas which column to use to match rows across the two dataframes. In our cas it is using the 'Resp_Name' column. 
# The 'how' parameter defines the type of merge to be performed. 'left' means that all the rows from A and only the matching rows from B will be included. If there is no match then the result is nan
merged_AB = pd.merge(A, B, on='Resp_Name', how='left')

# Next, we are taking the result from the previous merge (merged_AB) and merging it with dataframe C. This time we're matching rows based on three columns: 'Resp_Name', 'Store', and 'Date'
# Again, 'how' parameter is set to 'left', it means that we will keep all rows from merged_AB and only the matching rows from C. If there is no match, the result is nan
final_merged = pd.merge(merged_AB, C, on=['Resp_Name', 'Store', 'Date'], how='left')

huangapple
  • 本文由 发表于 2023年6月29日 23:35:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76582576.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定