[pandas]: 从一张表创建外键到另一张表的最高性能方式?

huangapple go评论66阅读模式
英文:

[pandas]: most performant way to create foreign keys from one table to another?

问题

我正在处理一个非常庞大的数据集(400万行),将CSV文件转换为SQL。我需要在几个数据帧之间创建连接,使用一张表的索引作为另一张表的外键。我已经找到了一种解决方案,但速度很慢(1800行/秒,见下文),我正在寻找一种更快的方法。

创建两个以下数据帧之间连接的最快方法是什么?(用df1.street替换为df2.index,其中df1.street == df2.street并且df1.number == df2.number

我对任何解决方案都持开放态度,包括多进程(已尝试过,但没有运气),只要能加快速度,尽量使用尽可能少的内存。我也尝试过df.merge()和其他函数,但到目前为止都没有运气。

谢谢!!!


数据框

df1

import pandas as pd

df1 = pd.DataFrame({
    'street': {'qr16ef3677886a44f8b9c5bc37dd660688a': 'quai de la Tournelle', 'qr112e28085e3c84c41b6b6a5e13ecf15ac': 'r. AlexandreDumas', 'qr1a213d2a5cbf64718892b3dbb3a9024f1': 'pass. Brunoy', 'qr1fb0760cd0fab4c71a4297d006ec3d119': 'Montmartre', 'qr167fce4c23d5c4b879ca6423cea15e742': 'Martel'},
    'number': {'qr16ef3677886a44f8b9c5bc37dd660688a': '33', 'qr112e28085e3c84c41b6b6a5e13ecf15ac': '99', 'qr1a213d2a5cbf64718892b3dbb3a9024f1': '18', 'qr1fb0760cd0fab4c71a4297d006ec3d119': '123', 'qr167fce4c23d5c4b879ca6423cea15e742': '4'},
    'date': {'qr16ef3677886a44f8b9c5bc37dd660688a': 1914, 'qr112e28085e3c84c41b6b6a5e13ecf15ac': 1900, 'qr1a213d2a5cbf64718892b3dbb3a9024f1': 1921, 'qr1fb0760cd0fab4c71a4297d006ec3d119': 1858, 'qr167fce4c23d5c4b879ca6423cea15e742': 1896}
})

df2

df2 = pd.DataFrame({
    'number': {'qr152f8de48daa64cf098f44fb3d9e7e145': '123', 'qr18ae0099b6afb48a78d466e5ed6871bec': '18', 'qr183daee61fb98489ebd05556968027a0d': '18', 'qr1e0ee6ec37dbd4e799905db721592ba48': '33', 'qr148505eca183c4fb38f844c35130b92f0': '4'},
    'street': {'qr152f8de48daa64cf098f44fb3d9e7e145': 'Montmartre', 'qr18ae0099b6afb48a78d466e5ed6871bec': 'Montmartre', 'qr183daee61fb98489ebd05556968027a0d': 'pass. Brunoy', 'qr1e0ee6ec37dbd4e799905db721592ba48': 'quai de la Tournelle', 'qr148505eca183c4fb38f844c35130b92f0': 'Martel'},
    'date': {'qr152f8de48daa64cf098f44fb3d9e7e145': ['1858', '1858'], 'qr18ae0099b6afb48a78d466e5ed6871bec': ['1876', '1881'], 'qr183daee61fb98489ebd05556968027a0d': ['1921', '1921'], 'qr1e0ee6ec37dbd4e799905db721592ba48': ['1914', '1914'], 'qr148505eca183c4fb38f844c35130b92f0': ['1896', '1896']}
})

我的解决方案

我的解决方案依赖于使用一个函数 (foreignkey()),在 df.apply() 中调用。然而,df.apply() 据说比其他方法慢,而我的解决方案危险地接近于对 df1df2 进行双重循环。

def foreignkey(ro: pd.core.series.Series) -> pd.core.series.Series:
    """
    用指向`df2`的外键替换`df1`中的地址
    外键插入到`ro.name`中
    """
    ro.street = df2.loc[
        ( df2.street == ro.street )    # 地址具有相同的街道全名
        & ( df2.number == ro.number )  # 地址具有相同的街道号码
    ].index[0]
    return ro

df1 = df1.progress_apply( lambda x: foreignkey(x), axis=1 )
英文:

i'm working on a very large dataset (4M rows) to transform CSV files to SQL. i need to create joins between several dataframes, using the index of a table as the foreign key to another table. i've found a solution which is quite slow (1800 rows/s, see below), and i'm looking for a faster (more performant) way.

what's the fastest way to create joins between the two following dataframes? (replace df1.street by df2.index where df1.street == df2.street and df1.number == df2.number)

i'm open to any solutions, including multiprocessing (tried it, but no luck at it) as long as it makes things go faster, using as little memory as possible. i've also tried df.merge() and other functions, but no luck so far.

thanks !!!


dataframes

df1

import pandas as pd

df1 = pd.DataFrame({
    'street': {'qr16ef3677886a44f8b9c5bc37dd660688a': 'quai de la Tournelle', 'qr112e28085e3c84c41b6b6a5e13ecf15ac': 'r. AlexandreDumas', 'qr1a213d2a5cbf64718892b3dbb3a9024f1': 'pass. Brunoy', 'qr1fb0760cd0fab4c71a4297d006ec3d119': 'Montmartre', 'qr167fce4c23d5c4b879ca6423cea15e742': 'Martel'}
    , 'number': {'qr16ef3677886a44f8b9c5bc37dd660688a': '33', 'qr112e28085e3c84c41b6b6a5e13ecf15ac': '99', 'qr1a213d2a5cbf64718892b3dbb3a9024f1': '18', 'qr1fb0760cd0fab4c71a4297d006ec3d119': '123', 'qr167fce4c23d5c4b879ca6423cea15e742': '4'}
    , 'date': {'qr16ef3677886a44f8b9c5bc37dd660688a': 1914, 'qr112e28085e3c84c41b6b6a5e13ecf15ac': 1900, 'qr1a213d2a5cbf64718892b3dbb3a9024f1': 1921, 'qr1fb0760cd0fab4c71a4297d006ec3d119': 1858, 'qr167fce4c23d5c4b879ca6423cea15e742': 1896}
})
                                                    street number  date
qr16ef3677886a44f8b9c5bc37dd660688a  quai de la Tournelle     33  1914
qr112e28085e3c84c41b6b6a5e13ecf15ac     r. AlexandreDumas     99  1900
qr1a213d2a5cbf64718892b3dbb3a9024f1          pass. Brunoy     18  1921
qr1fb0760cd0fab4c71a4297d006ec3d119            Montmartre    123  1858
qr167fce4c23d5c4b879ca6423cea15e742                Martel      4  1896 

df2

df2 = pd.DataFrame({
    'number': {'qr152f8de48daa64cf098f44fb3d9e7e145': '123', 'qr18ae0099b6afb48a78d466e5ed6871bec': '18', 'qr183daee61fb98489ebd05556968027a0d': '18', 'qr1e0ee6ec37dbd4e799905db721592ba48': '33', 'qr148505eca183c4fb38f844c35130b92f0': '4'}
    , 'street': {'qr152f8de48daa64cf098f44fb3d9e7e145': 'Montmartre', 'qr18ae0099b6afb48a78d466e5ed6871bec': 'Montmartre', 'qr183daee61fb98489ebd05556968027a0d': 'pass. Brunoy', 'qr1e0ee6ec37dbd4e799905db721592ba48': 'quai de la Tournelle', 'qr148505eca183c4fb38f844c35130b92f0': 'Martel'}
    , 'date': {'qr152f8de48daa64cf098f44fb3d9e7e145': ['1858', '1858'], 'qr18ae0099b6afb48a78d466e5ed6871bec': ['1876', '1881'], 'qr183daee61fb98489ebd05556968027a0d': ['1921', '1921'], 'qr1e0ee6ec37dbd4e799905db721592ba48': ['1914', '1914'], 'qr148505eca183c4fb38f844c35130b92f0': ['1896', '1896']}
}) 

                                     number                street          date
qr152f8de48daa64cf098f44fb3d9e7e145    123            Montmartre  [1858, 1858]
qr18ae0099b6afb48a78d466e5ed6871bec     18            Montmartre  [1876, 1881]
qr183daee61fb98489ebd05556968027a0d     18          pass. Brunoy  [1921, 1921]
qr1e0ee6ec37dbd4e799905db721592ba48     33  quai de la Tournelle  [1914, 1914]
qr148505eca183c4fb38f844c35130b92f0      4                Martel  [1896, 1896]

my solution

my solution relies on using a function (foreignkey()) called inside a df.apply(). however, df.apply() is supposed to be slower than other methods, and my solution comes dangerously close to doing a double loop, on df1 and df2


def foreignkey(ro: pd.core.series.Series) -> pd.core.series.Series:
    """
    replace the address in `ro` of `df1` by a foreign key 
    pointing to `df2`. the key is inserted in `ro.name`
    """
    ro.street = df2.loc[
        ( df2.street == ro.street )    # address has the same street full name
        & ( df2.number == ro.number )  # address has the same street number
    ].index[0]
    return ro

df1 = df1.progress_apply( lambda x: foreignkey(x), axis=1 )

答案1

得分: 1

您描述的是一个经典的merge操作(在SQL世界中称为JOIN)。

尝试以下代码以查看是否加速了您的代码:

tmp = (
    df1.reset_index(names="df1_id")
    .merge(
        df2[["street", "number"]].reset_index(names="df2_id"),
        how="left",
        on=["street", "number"],
    )
    .drop_duplicates("df1_id")
    .set_index("df1_id")
)
df1["street_id"] = tmp["df2_id"]

drop_duplicates用于在df1中的一行匹配多行df2时仅保留第一个匹配项。

英文:

What you described is a classic merge operation (it's called JOIN in the SQL world).

Try this to see if it speeds up your code:

tmp = (
    df1.reset_index(names="df1_id")
    .merge(
        df2[["street", "number"]].reset_index(names="df2_id"),
        how="left",
        on=["street", "number"],
    )
    .drop_duplicates("df1_id")
    .set_index("df1_id")
)
df1["street_id"] = tmp["df2_id"]

The drop_duplicates is there to keep only the first match when one row in df1 matches multiple rows in df2.

huangapple
  • 本文由 发表于 2023年7月7日 01:08:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/76631115.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定