Perform inner_join() of R in Python.

huangapple go评论60阅读模式
英文:

Perform inner_join() of R in Python

问题

I have a Pandas database Network with a network structure like this:

{'Sup': {0: 1002000157, 1: 1002000157, ...},
 'Cust': {0: 1002438313, 1: 8039296054, ...}}

I would like to reproduce this R command (possibly without kernel interrupting) in Python:

NodesSharingSupplier <- inner_join(Network, Network, by=c('Sup'='Sup'))

This is an SQL-style inner join, thus I fear that it cannot be performed simply with an inner merge on Sup in Python.

How do I reproduce it in Python?

英文:

I have a Pandas database Network with a network structure like this:

{&#39;Sup&#39;: {0: 1002000157,
  1: 1002000157,
  2: 1002000157,
  3: 1002000157,
  4: 1002000157,
  5: 1002000157,
  6: 1002000157,
  7: 1002000157,
  8: 1002000157,
  9: 1002000157,
  10: 1002000157,
  11: 1002000157,
  12: 1002000157,
  13: 1002000382,
  14: 1002000382,
  15: 1002000382,
  16: 1002000382,
  17: 1002000382,
  18: 1002000382,
  19: 1002000382,
  20: 1002000382,
  21: 1002000382,
  22: 1002000382,
  23: 1002000382,
  24: 1002000382,
  25: 1002000382,
  26: 1002000382,
  27: 1002000382,
  28: 1002000382,
  29: 1002000382},
 &#39;Cust&#39;: {0: 1002438313,
  1: 8039296054,
  2: 9003188096,
  3: 14900070991,
  4: 17005234747,
  5: 18006860724,
  6: 28000286091,
  7: 29009623382,
  8: 39000007702,
  9: 39004420023,
  10: 46000088397,
  11: 50000063751,
  12: 7000090017,
  13: 1900120936,
  14: 1900779883,
  15: 2000013994,
  16: 2001222824,
  17: 2003032125,
  18: 2900121723,
  19: 2900197555,
  20: 2902742641,
  21: 3000101113,
  22: 3000195031,
  23: 3000318054,
  24: 3900091301,
  25: 3911084436,
  26: 4900112325,
  27: 5900720933,
  28: 7000001703,
  29: 8000004881}}

I would like to reproduce this R command (possibly without kernel interrupting) in Python:

NodesSharingSupplier &lt;- inner_join(Network, Network,  by=c(&#39;Sup&#39;=&#39;Sup&#39;))

This is an SQL-style inner join, thus I fear that it cannot be performed simply with an inner merge on Sup in Python.

How do I reproduce it in Python?

答案1

得分: 2

Sure, here is the translated part:

"IIUC, you are looking for merge:"
"您好像在寻找merge:"

You can remove case where Cust_x == Cust_y by appending .query(&#39;Cust_x != Cust_y&#39;) after .merge(...):
您可以在.merge(...)之后添加.query(&#39;Cust_x != Cust_y&#39;)来删除Cust_x == Cust_y的情况。

Input:
输入:

More information: Pandas Merging 101
更多信息:Pandas Merging 101

Update
更新

> The kernel dies after it. Consider that the Network database has 5 million observations
> 内核在此之后中断了。请考虑网络数据库有500万条观测数据

Use dask:
使用dask

import dask.pandas as dd

NetworkDD = dd.from_pandas(Network, npartitions=Network[&#39;Sup&#39;].nunique())
NodesSharingSupplier = dd.merge(NetworkDD, NetworkDD, on=&#39;Sup&#39;).compute()
英文:

IIUC, you are looking for merge:

NodesSharingSupplier = Network.merge(Network, on=&#39;Sup&#39;, how=&#39;inner&#39;)
print(NodesSharingSupplier)

# Output
            Sup      Cust_x       Cust_y
0    1002000157  1002438313   1002438313
1    1002000157  1002438313   8039296054
2    1002000157  1002438313   9003188096
3    1002000157  1002438313  14900070991
4    1002000157  1002438313  17005234747
..          ...         ...          ...
453  1002000382  8000004881   3911084436
454  1002000382  8000004881   4900112325
455  1002000382  8000004881   5900720933
456  1002000382  8000004881   7000001703
457  1002000382  8000004881   8000004881

[458 rows x 3 columns]

You can remove case where Cust_x == Cust_y by appending .query(&#39;Cust_x != Cust_y&#39;) after .merge(...).

Input:

data = {&#39;Sup&#39;: {0: 1002000157, 1: 1002000157, 2: 1002000157, 3: 1002000157, 4: 1002000157, 5: 1002000157, 6: 1002000157, 7: 1002000157, 8: 1002000157, 9: 1002000157, 10: 1002000157, 11: 1002000157, 12: 1002000157, 13: 1002000382, 14: 1002000382, 15: 1002000382, 16: 1002000382, 17: 1002000382, 18: 1002000382, 19: 1002000382, 20: 1002000382, 21: 1002000382, 22: 1002000382, 23: 1002000382, 24: 1002000382, 25: 1002000382, 26: 1002000382, 27: 1002000382, 28: 1002000382, 29: 1002000382},
        &#39;Cust&#39;: {0: 1002438313, 1: 8039296054, 2: 9003188096, 3: 14900070991, 4: 17005234747, 5: 18006860724, 6: 28000286091, 7: 29009623382, 8: 39000007702, 9: 39004420023, 10: 46000088397, 11: 50000063751, 12: 7000090017, 13: 1900120936, 14: 1900779883, 15: 2000013994, 16: 2001222824, 17: 2003032125, 18: 2900121723, 19: 2900197555, 20: 2902742641, 21: 3000101113, 22: 3000195031, 23: 3000318054, 24: 3900091301, 25: 3911084436, 26: 4900112325, 27: 5900720933, 28: 7000001703, 29: 8000004881}}
Network = pd.DataFrame(data)

More information: Pandas Merging 101

Update

> The kernel dies after it. Consider that the Network database has 5 million observations

Use dask:

import dask.pandas as dd

NetworkDD = dd.from_pandas(Network, npartitions=Network[&#39;Sup&#39;].nunique())
NodesSharingSupplier = dd.merge(NetworkDD, NetworkDD, on=&#39;Sup&#39;).compute()

答案2

得分: 0

Inner join: merge(df1, df2) 在这些示例中可行,因为R会自动根据共同的变量名连接这些框架,但您很可能希望指定 merge(df1, df2, by = "CustomerId"),以确保您只匹配您想要的字段。如果不同数据框中的匹配变量具有不同的名称,您还可以使用 by.xby.y 参数。

英文:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = &quot;CustomerId&quot;) to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

huangapple
  • 本文由 发表于 2023年4月19日 14:46:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/76051470.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定