2023年6月29日 23:35:54go评论70阅读模式

英文:

What's a more efficient way to match values in a dataframe?

问题

I will provide you with the translated code portion:

elist = []
for i, col in A.iterrows():
    for ix, c in B.iterrows():
        if col["Responsible"] == c["Resp Name"] and col["Name"] not in elist and c["Resp Name"] not in C:
            elist.append([col["Name"], col["Gender"].... c["Dpt"]])
        elif col["Responsible"] == c["Resp Name"] and col["Name"] in elist and c["Resp Name"] in C:
            for n, x in C.iterrows():
                if c["Resp Name"] == x["Resp Name"] and col["Store"] == x["Store"] and col["DateLabel"] == x["Date"]:
                    elist.append([col["Name"], col["Gender"].... c["Dpt"]])

Please note that I have replaced the HTML entities like " with the actual characters for better code readability.

英文:

I have a large dataset (like 1m lines) and i need to match each value with another value from other 2 datasets, based on several conditions. This problem is solved with for loops and dataframe.iterrows(), but, in the near future, wil become slow, once we'll add more lines, so i want to find another method for value matching. Example of the tables below:
A:

Name        | Gender         | Responsible         | Function | DateLabel | Store
Bob   D     | M              | Mike  M             | Worker   | Jun-20    | 122
Mike   L    | M              | Josh  J             |Manager   | Apr-21    | 133
Diana V     | F              |Christine            |Manager   | Apr-23    |133

Resp Name      | Dpt
Mike M         | Ops
Josh J         | Logistics
Christine      | Legal

Resp Name         | Store    | Date
Mike   M          |122       | Jun-20
Mike   M          |122       | Jun-21
Mike   M          |122       | Apr-22
Christine         |133       | Apr-23
Christine         |133       | Apr-21

Task: I need to match the name in A, with the Dpt in B if the responsible in A is the same with B. If the responsible from B it's also in C, i need to match the date in A, with the Date in C and the store in A with the store in C and get the dpt from B or C and add it to an extended dataframe, that contains all data from A + the dpt.

My code:

elist=[]
for i, col in A.iterrows():
  for ix, c in B.iterrows():
      if col[&quot;Resp&quot;] == c[&quot;Resp. Name&quot;] and col[&quot;Name&quot;] not in elist and c[&quot;Resp name&quot;] not in C:
        elist.append([col[&quot;Name&quot;], col[&quot;Gender&quot;].... c[&quot;Dpt&quot;])
       elif col[&quot;Resp&quot;] == c[&quot;Resp. Name&quot;] and col[&quot;Name&quot;] in elist and c[&quot;Resp name&quot;] in C:
          for n, x in c.iterrows():
              if c[&quot;resp name&quot;] == x[&quot;Resp name&quot;] and col[&quot;store&quot;] == x[&quot;Store&quot;] and col[&quot;DateLabel&quot;] == x[&quot;date&quot;]:
                    elist.append([col[&quot;Name&quot;], col[&quot;Gender&quot;].... c[&quot;Dpt&quot;])

答案1

得分: 1

我无法测试百万行数据，但仍然可以尝试这样做。合并操作可以比使用for循环更快。

假设你有三个数据框 A、B 和 C

这里，我们将数据框 A 和 B 合并在一起。"on" 参数告诉 pandas 使用哪一列来匹配这两个数据框的行。在我们的情况下，它使用 "Resp_Name" 列。

"how" 参数定义要执行的合并类型。"left" 表示将包括来自 A 的所有行和仅来自 B 的匹配行。如果没有匹配项，结果将是 NaN。

merged_AB = pd.merge(A, B, on='Resp_Name', how='left')

接下来，我们将上一次合并的结果 (merged_AB) 与数据框 C 合并。这次我们将根据三列进行行匹配："Resp_Name"、"Store" 和 "Date"

同样地，"how" 参数设置为 "left"，这意味着我们将保留 merged_AB 中的所有行和来自 C 的匹配行。如果没有匹配项，结果将是 NaN。

final_merged = pd.merge(merged_AB, C, on=['Resp_Name', 'Store', 'Date'], how='left')

英文:

I cannot test with million rows, but still. Try this. merge can do faster that for loops

# Lets say you have your dfs A, B and C

# Here, we are merging dfs A and B together. The &#39;on&#39; parameter tells pandas which column to use to match rows across the two dataframes. In our cas it is using the &#39;Resp_Name&#39; column. 
# The &#39;how&#39; parameter defines the type of merge to be performed. &#39;left&#39; means that all the rows from A and only the matching rows from B will be included. If there is no match then the result is nan
merged_AB = pd.merge(A, B, on=&#39;Resp_Name&#39;, how=&#39;left&#39;)

# Next, we are taking the result from the previous merge (merged_AB) and merging it with dataframe C. This time we&#39;re matching rows based on three columns: &#39;Resp_Name&#39;, &#39;Store&#39;, and &#39;Date&#39;
# Again, &#39;how&#39; parameter is set to &#39;left&#39;, it means that we will keep all rows from merged_AB and only the matching rows from C. If there is no match, the result is nan
final_merged = pd.merge(merged_AB, C, on=[&#39;Resp_Name&#39;, &#39;Store&#39;, &#39;Date&#39;], how=&#39;left&#39;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

更高效的方式匹配数据框中的数值是什么？

问题

答案1

假设你有三个数据框 A、B 和 C

这里，我们将数据框 A 和 B 合并在一起。"on" 参数告诉 pandas 使用哪一列来匹配这两个数据框的行。在我们的情况下，它使用 "Resp_Name" 列。

"how" 参数定义要执行的合并类型。"left" 表示将包括来自 A 的所有行和仅来自 B 的匹配行。如果没有匹配项，结果将是 NaN。

接下来，我们将上一次合并的结果 (merged_AB) 与数据框 C 合并。这次我们将根据三列进行行匹配："Resp_Name"、"Store" 和 "Date"

同样地，"how" 参数设置为 "left"，这意味着我们将保留 merged_AB 中的所有行和来自 C 的匹配行。如果没有匹配项，结果将是 NaN。

使用JAX在大型二维数组上查找最大的n个值

Python program to generate a single species matrix file from multiple sample-wise species count files (using Pandas?)

AttributeError: 模块 ‘networkx’ 没有 ‘info’ 属性。

从 pandas 数据框中提取相关行，当存在重复列数值时。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论