英文:
Perform inner_join() of R in Python
问题
I have a Pandas database Network with a network structure like this:
{'Sup': {0: 1002000157, 1: 1002000157, ...},
'Cust': {0: 1002438313, 1: 8039296054, ...}}
I would like to reproduce this R command (possibly without kernel interrupting) in Python:
NodesSharingSupplier <- inner_join(Network, Network, by=c('Sup'='Sup'))
This is an SQL-style inner join, thus I fear that it cannot be performed simply with an inner merge on Sup in Python.
How do I reproduce it in Python?
英文:
I have a Pandas database Network with a network structure like this:
{'Sup': {0: 1002000157,
1: 1002000157,
2: 1002000157,
3: 1002000157,
4: 1002000157,
5: 1002000157,
6: 1002000157,
7: 1002000157,
8: 1002000157,
9: 1002000157,
10: 1002000157,
11: 1002000157,
12: 1002000157,
13: 1002000382,
14: 1002000382,
15: 1002000382,
16: 1002000382,
17: 1002000382,
18: 1002000382,
19: 1002000382,
20: 1002000382,
21: 1002000382,
22: 1002000382,
23: 1002000382,
24: 1002000382,
25: 1002000382,
26: 1002000382,
27: 1002000382,
28: 1002000382,
29: 1002000382},
'Cust': {0: 1002438313,
1: 8039296054,
2: 9003188096,
3: 14900070991,
4: 17005234747,
5: 18006860724,
6: 28000286091,
7: 29009623382,
8: 39000007702,
9: 39004420023,
10: 46000088397,
11: 50000063751,
12: 7000090017,
13: 1900120936,
14: 1900779883,
15: 2000013994,
16: 2001222824,
17: 2003032125,
18: 2900121723,
19: 2900197555,
20: 2902742641,
21: 3000101113,
22: 3000195031,
23: 3000318054,
24: 3900091301,
25: 3911084436,
26: 4900112325,
27: 5900720933,
28: 7000001703,
29: 8000004881}}
I would like to reproduce this R command (possibly without kernel interrupting) in Python:
NodesSharingSupplier <- inner_join(Network, Network, by=c('Sup'='Sup'))
This is an SQL-style inner join, thus I fear that it cannot be performed simply with an inner merge on Sup in Python.
How do I reproduce it in Python?
答案1
得分: 2
Sure, here is the translated part:
"IIUC, you are looking for merge
:"
"您好像在寻找merge
:"
You can remove case where Cust_x == Cust_y
by appending .query('Cust_x != Cust_y')
after .merge(...)
:
您可以在.merge(...)
之后添加.query('Cust_x != Cust_y')
来删除Cust_x == Cust_y
的情况。
Input:
输入:
More information: Pandas Merging 101
更多信息:Pandas Merging 101
Update
更新
> The kernel dies after it. Consider that the Network database has 5 million observations
> 内核在此之后中断了。请考虑网络数据库有500万条观测数据
Use dask
:
使用dask
:
import dask.pandas as dd
NetworkDD = dd.from_pandas(Network, npartitions=Network['Sup'].nunique())
NodesSharingSupplier = dd.merge(NetworkDD, NetworkDD, on='Sup').compute()
英文:
IIUC, you are looking for merge
:
NodesSharingSupplier = Network.merge(Network, on='Sup', how='inner')
print(NodesSharingSupplier)
# Output
Sup Cust_x Cust_y
0 1002000157 1002438313 1002438313
1 1002000157 1002438313 8039296054
2 1002000157 1002438313 9003188096
3 1002000157 1002438313 14900070991
4 1002000157 1002438313 17005234747
.. ... ... ...
453 1002000382 8000004881 3911084436
454 1002000382 8000004881 4900112325
455 1002000382 8000004881 5900720933
456 1002000382 8000004881 7000001703
457 1002000382 8000004881 8000004881
[458 rows x 3 columns]
You can remove case where Cust_x == Cust_y
by appending .query('Cust_x != Cust_y')
after .merge(...)
.
Input:
data = {'Sup': {0: 1002000157, 1: 1002000157, 2: 1002000157, 3: 1002000157, 4: 1002000157, 5: 1002000157, 6: 1002000157, 7: 1002000157, 8: 1002000157, 9: 1002000157, 10: 1002000157, 11: 1002000157, 12: 1002000157, 13: 1002000382, 14: 1002000382, 15: 1002000382, 16: 1002000382, 17: 1002000382, 18: 1002000382, 19: 1002000382, 20: 1002000382, 21: 1002000382, 22: 1002000382, 23: 1002000382, 24: 1002000382, 25: 1002000382, 26: 1002000382, 27: 1002000382, 28: 1002000382, 29: 1002000382},
'Cust': {0: 1002438313, 1: 8039296054, 2: 9003188096, 3: 14900070991, 4: 17005234747, 5: 18006860724, 6: 28000286091, 7: 29009623382, 8: 39000007702, 9: 39004420023, 10: 46000088397, 11: 50000063751, 12: 7000090017, 13: 1900120936, 14: 1900779883, 15: 2000013994, 16: 2001222824, 17: 2003032125, 18: 2900121723, 19: 2900197555, 20: 2902742641, 21: 3000101113, 22: 3000195031, 23: 3000318054, 24: 3900091301, 25: 3911084436, 26: 4900112325, 27: 5900720933, 28: 7000001703, 29: 8000004881}}
Network = pd.DataFrame(data)
More information: Pandas Merging 101
Update
> The kernel dies after it. Consider that the Network database has 5 million observations
Use dask
:
import dask.pandas as dd
NetworkDD = dd.from_pandas(Network, npartitions=Network['Sup'].nunique())
NodesSharingSupplier = dd.merge(NetworkDD, NetworkDD, on='Sup').compute()
答案2
得分: 0
Inner join: merge(df1, df2)
在这些示例中可行,因为R会自动根据共同的变量名连接这些框架,但您很可能希望指定 merge(df1, df2, by = "CustomerId")
,以确保您只匹配您想要的字段。如果不同数据框中的匹配变量具有不同的名称,您还可以使用 by.x
和 by.y
参数。
英文:
Inner join: merge(df1, df2)
will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId")
to make sure that you were matching on only the fields you desired. You can also use the by.x
and by.y
parameters if the matching variables have different names in the different data frames.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论