英文:
[pandas]: most performant way to create foreign keys from one table to another?
问题
我正在处理一个非常庞大的数据集(400万行),将CSV文件转换为SQL。我需要在几个数据帧之间创建连接,使用一张表的索引作为另一张表的外键。我已经找到了一种解决方案,但速度很慢(1800行/秒,见下文),我正在寻找一种更快的方法。
创建两个以下数据帧之间连接的最快方法是什么?(用df1.street
替换为df2.index
,其中df1.street == df2.street
并且df1.number == df2.number
)
我对任何解决方案都持开放态度,包括多进程(已尝试过,但没有运气),只要能加快速度,尽量使用尽可能少的内存。我也尝试过df.merge()
和其他函数,但到目前为止都没有运气。
谢谢!!!
数据框
df1
import pandas as pd
df1 = pd.DataFrame({
'street': {'qr16ef3677886a44f8b9c5bc37dd660688a': 'quai de la Tournelle', 'qr112e28085e3c84c41b6b6a5e13ecf15ac': 'r. AlexandreDumas', 'qr1a213d2a5cbf64718892b3dbb3a9024f1': 'pass. Brunoy', 'qr1fb0760cd0fab4c71a4297d006ec3d119': 'Montmartre', 'qr167fce4c23d5c4b879ca6423cea15e742': 'Martel'},
'number': {'qr16ef3677886a44f8b9c5bc37dd660688a': '33', 'qr112e28085e3c84c41b6b6a5e13ecf15ac': '99', 'qr1a213d2a5cbf64718892b3dbb3a9024f1': '18', 'qr1fb0760cd0fab4c71a4297d006ec3d119': '123', 'qr167fce4c23d5c4b879ca6423cea15e742': '4'},
'date': {'qr16ef3677886a44f8b9c5bc37dd660688a': 1914, 'qr112e28085e3c84c41b6b6a5e13ecf15ac': 1900, 'qr1a213d2a5cbf64718892b3dbb3a9024f1': 1921, 'qr1fb0760cd0fab4c71a4297d006ec3d119': 1858, 'qr167fce4c23d5c4b879ca6423cea15e742': 1896}
})
df2
df2 = pd.DataFrame({
'number': {'qr152f8de48daa64cf098f44fb3d9e7e145': '123', 'qr18ae0099b6afb48a78d466e5ed6871bec': '18', 'qr183daee61fb98489ebd05556968027a0d': '18', 'qr1e0ee6ec37dbd4e799905db721592ba48': '33', 'qr148505eca183c4fb38f844c35130b92f0': '4'},
'street': {'qr152f8de48daa64cf098f44fb3d9e7e145': 'Montmartre', 'qr18ae0099b6afb48a78d466e5ed6871bec': 'Montmartre', 'qr183daee61fb98489ebd05556968027a0d': 'pass. Brunoy', 'qr1e0ee6ec37dbd4e799905db721592ba48': 'quai de la Tournelle', 'qr148505eca183c4fb38f844c35130b92f0': 'Martel'},
'date': {'qr152f8de48daa64cf098f44fb3d9e7e145': ['1858', '1858'], 'qr18ae0099b6afb48a78d466e5ed6871bec': ['1876', '1881'], 'qr183daee61fb98489ebd05556968027a0d': ['1921', '1921'], 'qr1e0ee6ec37dbd4e799905db721592ba48': ['1914', '1914'], 'qr148505eca183c4fb38f844c35130b92f0': ['1896', '1896']}
})
我的解决方案
我的解决方案依赖于使用一个函数 (foreignkey()
),在 df.apply()
中调用。然而,df.apply()
据说比其他方法慢,而我的解决方案危险地接近于对 df1
和 df2
进行双重循环。
def foreignkey(ro: pd.core.series.Series) -> pd.core.series.Series:
"""
用指向`df2`的外键替换`df1`中的地址
外键插入到`ro.name`中
"""
ro.street = df2.loc[
( df2.street == ro.street ) # 地址具有相同的街道全名
& ( df2.number == ro.number ) # 地址具有相同的街道号码
].index[0]
return ro
df1 = df1.progress_apply( lambda x: foreignkey(x), axis=1 )
英文:
i'm working on a very large dataset (4M rows) to transform CSV files to SQL. i need to create joins between several dataframes, using the index of a table as the foreign key to another table. i've found a solution which is quite slow (1800 rows/s, see below), and i'm looking for a faster (more performant) way.
what's the fastest way to create joins between the two following dataframes? (replace df1.street
by df2.index
where df1.street == df2.street
and df1.number == df2.number
)
i'm open to any solutions, including multiprocessing (tried it, but no luck at it) as long as it makes things go faster, using as little memory as possible. i've also tried df.merge()
and other functions, but no luck so far.
thanks !!!
dataframes
df1
import pandas as pd
df1 = pd.DataFrame({
'street': {'qr16ef3677886a44f8b9c5bc37dd660688a': 'quai de la Tournelle', 'qr112e28085e3c84c41b6b6a5e13ecf15ac': 'r. AlexandreDumas', 'qr1a213d2a5cbf64718892b3dbb3a9024f1': 'pass. Brunoy', 'qr1fb0760cd0fab4c71a4297d006ec3d119': 'Montmartre', 'qr167fce4c23d5c4b879ca6423cea15e742': 'Martel'}
, 'number': {'qr16ef3677886a44f8b9c5bc37dd660688a': '33', 'qr112e28085e3c84c41b6b6a5e13ecf15ac': '99', 'qr1a213d2a5cbf64718892b3dbb3a9024f1': '18', 'qr1fb0760cd0fab4c71a4297d006ec3d119': '123', 'qr167fce4c23d5c4b879ca6423cea15e742': '4'}
, 'date': {'qr16ef3677886a44f8b9c5bc37dd660688a': 1914, 'qr112e28085e3c84c41b6b6a5e13ecf15ac': 1900, 'qr1a213d2a5cbf64718892b3dbb3a9024f1': 1921, 'qr1fb0760cd0fab4c71a4297d006ec3d119': 1858, 'qr167fce4c23d5c4b879ca6423cea15e742': 1896}
})
street number date
qr16ef3677886a44f8b9c5bc37dd660688a quai de la Tournelle 33 1914
qr112e28085e3c84c41b6b6a5e13ecf15ac r. AlexandreDumas 99 1900
qr1a213d2a5cbf64718892b3dbb3a9024f1 pass. Brunoy 18 1921
qr1fb0760cd0fab4c71a4297d006ec3d119 Montmartre 123 1858
qr167fce4c23d5c4b879ca6423cea15e742 Martel 4 1896
df2
df2 = pd.DataFrame({
'number': {'qr152f8de48daa64cf098f44fb3d9e7e145': '123', 'qr18ae0099b6afb48a78d466e5ed6871bec': '18', 'qr183daee61fb98489ebd05556968027a0d': '18', 'qr1e0ee6ec37dbd4e799905db721592ba48': '33', 'qr148505eca183c4fb38f844c35130b92f0': '4'}
, 'street': {'qr152f8de48daa64cf098f44fb3d9e7e145': 'Montmartre', 'qr18ae0099b6afb48a78d466e5ed6871bec': 'Montmartre', 'qr183daee61fb98489ebd05556968027a0d': 'pass. Brunoy', 'qr1e0ee6ec37dbd4e799905db721592ba48': 'quai de la Tournelle', 'qr148505eca183c4fb38f844c35130b92f0': 'Martel'}
, 'date': {'qr152f8de48daa64cf098f44fb3d9e7e145': ['1858', '1858'], 'qr18ae0099b6afb48a78d466e5ed6871bec': ['1876', '1881'], 'qr183daee61fb98489ebd05556968027a0d': ['1921', '1921'], 'qr1e0ee6ec37dbd4e799905db721592ba48': ['1914', '1914'], 'qr148505eca183c4fb38f844c35130b92f0': ['1896', '1896']}
})
number street date
qr152f8de48daa64cf098f44fb3d9e7e145 123 Montmartre [1858, 1858]
qr18ae0099b6afb48a78d466e5ed6871bec 18 Montmartre [1876, 1881]
qr183daee61fb98489ebd05556968027a0d 18 pass. Brunoy [1921, 1921]
qr1e0ee6ec37dbd4e799905db721592ba48 33 quai de la Tournelle [1914, 1914]
qr148505eca183c4fb38f844c35130b92f0 4 Martel [1896, 1896]
my solution
my solution relies on using a function (foreignkey()
) called inside a df.apply()
. however, df.apply()
is supposed to be slower than other methods, and my solution comes dangerously close to doing a double loop, on df1
and df2
def foreignkey(ro: pd.core.series.Series) -> pd.core.series.Series:
"""
replace the address in `ro` of `df1` by a foreign key
pointing to `df2`. the key is inserted in `ro.name`
"""
ro.street = df2.loc[
( df2.street == ro.street ) # address has the same street full name
& ( df2.number == ro.number ) # address has the same street number
].index[0]
return ro
df1 = df1.progress_apply( lambda x: foreignkey(x), axis=1 )
答案1
得分: 1
您描述的是一个经典的merge
操作(在SQL世界中称为JOIN
)。
尝试以下代码以查看是否加速了您的代码:
tmp = (
df1.reset_index(names="df1_id")
.merge(
df2[["street", "number"]].reset_index(names="df2_id"),
how="left",
on=["street", "number"],
)
.drop_duplicates("df1_id")
.set_index("df1_id")
)
df1["street_id"] = tmp["df2_id"]
drop_duplicates
用于在df1
中的一行匹配多行df2
时仅保留第一个匹配项。
英文:
What you described is a classic merge
operation (it's called JOIN
in the SQL world).
Try this to see if it speeds up your code:
tmp = (
df1.reset_index(names="df1_id")
.merge(
df2[["street", "number"]].reset_index(names="df2_id"),
how="left",
on=["street", "number"],
)
.drop_duplicates("df1_id")
.set_index("df1_id")
)
df1["street_id"] = tmp["df2_id"]
The drop_duplicates
is there to keep only the first match when one row in df1
matches multiple rows in df2
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论