2023年4月19日 14:46:53go评论60阅读模式

英文:

Perform inner_join() of R in Python

问题

I have a Pandas database Network with a network structure like this:

{'Sup': {0: 1002000157, 1: 1002000157, ...},
 'Cust': {0: 1002438313, 1: 8039296054, ...}}

I would like to reproduce this R command (possibly without kernel interrupting) in Python:

NodesSharingSupplier <- inner_join(Network, Network, by=c('Sup'='Sup'))

This is an SQL-style inner join, thus I fear that it cannot be performed simply with an inner merge on Sup in Python.

How do I reproduce it in Python?

英文:

I have a Pandas database Network with a network structure like this:

{&#39;Sup&#39;: {0: 1002000157,
  1: 1002000157,
  2: 1002000157,
  3: 1002000157,
  4: 1002000157,
  5: 1002000157,
  6: 1002000157,
  7: 1002000157,
  8: 1002000157,
  9: 1002000157,
  10: 1002000157,
  11: 1002000157,
  12: 1002000157,
  13: 1002000382,
  14: 1002000382,
  15: 1002000382,
  16: 1002000382,
  17: 1002000382,
  18: 1002000382,
  19: 1002000382,
  20: 1002000382,
  21: 1002000382,
  22: 1002000382,
  23: 1002000382,
  24: 1002000382,
  25: 1002000382,
  26: 1002000382,
  27: 1002000382,
  28: 1002000382,
  29: 1002000382},
 &#39;Cust&#39;: {0: 1002438313,
  1: 8039296054,
  2: 9003188096,
  3: 14900070991,
  4: 17005234747,
  5: 18006860724,
  6: 28000286091,
  7: 29009623382,
  8: 39000007702,
  9: 39004420023,
  10: 46000088397,
  11: 50000063751,
  12: 7000090017,
  13: 1900120936,
  14: 1900779883,
  15: 2000013994,
  16: 2001222824,
  17: 2003032125,
  18: 2900121723,
  19: 2900197555,
  20: 2902742641,
  21: 3000101113,
  22: 3000195031,
  23: 3000318054,
  24: 3900091301,
  25: 3911084436,
  26: 4900112325,
  27: 5900720933,
  28: 7000001703,
  29: 8000004881}}

I would like to reproduce this R command (possibly without kernel interrupting) in Python:

NodesSharingSupplier &lt;- inner_join(Network, Network,  by=c(&#39;Sup&#39;=&#39;Sup&#39;))

This is an SQL-style inner join, thus I fear that it cannot be performed simply with an inner merge on Sup in Python.

How do I reproduce it in Python?

答案1

得分: 2

Sure, here is the translated part:

"IIUC, you are looking for merge:"
"您好像在寻找merge："

You can remove case where Cust_x == Cust_y by appending .query('Cust_x != Cust_y') after .merge(...):
您可以在.merge(...)之后添加.query('Cust_x != Cust_y')来删除Cust_x == Cust_y的情况。

Input:
输入：

More information: Pandas Merging 101
更多信息：Pandas Merging 101

Update
更新

> The kernel dies after it. Consider that the Network database has 5 million observations
> 内核在此之后中断了。请考虑网络数据库有500万条观测数据

Use dask:
使用dask：

import dask.pandas as dd

NetworkDD = dd.from_pandas(Network, npartitions=Network[&#39;Sup&#39;].nunique())
NodesSharingSupplier = dd.merge(NetworkDD, NetworkDD, on=&#39;Sup&#39;).compute()

英文:

IIUC, you are looking for merge:

NodesSharingSupplier = Network.merge(Network, on=&#39;Sup&#39;, how=&#39;inner&#39;)
print(NodesSharingSupplier)

# Output
            Sup      Cust_x       Cust_y
0    1002000157  1002438313   1002438313
1    1002000157  1002438313   8039296054
2    1002000157  1002438313   9003188096
3    1002000157  1002438313  14900070991
4    1002000157  1002438313  17005234747
..          ...         ...          ...
453  1002000382  8000004881   3911084436
454  1002000382  8000004881   4900112325
455  1002000382  8000004881   5900720933
456  1002000382  8000004881   7000001703
457  1002000382  8000004881   8000004881

[458 rows x 3 columns]

You can remove case where Cust_x == Cust_y by appending .query('Cust_x != Cust_y') after .merge(...).

Input:

data = {&#39;Sup&#39;: {0: 1002000157, 1: 1002000157, 2: 1002000157, 3: 1002000157, 4: 1002000157, 5: 1002000157, 6: 1002000157, 7: 1002000157, 8: 1002000157, 9: 1002000157, 10: 1002000157, 11: 1002000157, 12: 1002000157, 13: 1002000382, 14: 1002000382, 15: 1002000382, 16: 1002000382, 17: 1002000382, 18: 1002000382, 19: 1002000382, 20: 1002000382, 21: 1002000382, 22: 1002000382, 23: 1002000382, 24: 1002000382, 25: 1002000382, 26: 1002000382, 27: 1002000382, 28: 1002000382, 29: 1002000382},
        &#39;Cust&#39;: {0: 1002438313, 1: 8039296054, 2: 9003188096, 3: 14900070991, 4: 17005234747, 5: 18006860724, 6: 28000286091, 7: 29009623382, 8: 39000007702, 9: 39004420023, 10: 46000088397, 11: 50000063751, 12: 7000090017, 13: 1900120936, 14: 1900779883, 15: 2000013994, 16: 2001222824, 17: 2003032125, 18: 2900121723, 19: 2900197555, 20: 2902742641, 21: 3000101113, 22: 3000195031, 23: 3000318054, 24: 3900091301, 25: 3911084436, 26: 4900112325, 27: 5900720933, 28: 7000001703, 29: 8000004881}}
Network = pd.DataFrame(data)

More information: Pandas Merging 101

Update

> The kernel dies after it. Consider that the Network database has 5 million observations

Use dask:

import dask.pandas as dd

NetworkDD = dd.from_pandas(Network, npartitions=Network[&#39;Sup&#39;].nunique())
NodesSharingSupplier = dd.merge(NetworkDD, NetworkDD, on=&#39;Sup&#39;).compute()

答案2

得分: 0

Inner join: merge(df1, df2) 在这些示例中可行，因为R会自动根据共同的变量名连接这些框架，但您很可能希望指定 merge(df1, df2, by = "CustomerId")，以确保您只匹配您想要的字段。如果不同数据框中的匹配变量具有不同的名称，您还可以使用 by.x 和 by.y 参数。

英文:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Perform inner_join() of R in Python.

问题

答案1

答案2

在Pandas中如何删除包含第三个字母为W的行？

在init内部给当前类添加类型提示。

Python Thread : TypeError: main.generate_num() argument after * must be an iterable, not Event

在Google Colab中使用pdftotext

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论