更快的方法来搜索另一个数据框中的列对。

huangapple go评论72阅读模式
英文:

faster way to search column pairs in another dataframe

问题

I have a big dataframe called df around 45 million rows like below.

0       PIGA  ATF7IP1 -0.047236
1       PIGB  ATF7IP2 -0.047236
2       PIGC  ATF7IP3 -0.047236
3       PIGD  ATF7IP4 -0.047236
4       PIGE  ATF7IP5 -0.047236

and I have a small dataframe called terms, size is around 3k rows.

id                                gene_set
1                                 {HDAC4, BCL6}
2                                 {HDAC5, BCL6}
3                                 {HDAC7, BCL6}
4                 {NCOA3, KAT2B, EP300, CREBBP}
5            {NCAPD2, NCAPH, NCAPG, SMC4, SMC2}
...
2912                              {FOXO1, ESR1}
2913                               {APP, FOXO3}
2914                               {APP, FOXO1}
2915                               {APP, FOXO4}
2916    {MAP3K20, MAPK14, AKAP13, MAP2K3, PKN1}

For each row, I check the presence of gene1, gene2 pairs in the terms dataset.

My code works fine, but I would like to ask if there is any faster idea for that. I have tried a couple of codes, but the runtime is approximately the same.

def search(g1, g2):
    # Search for the gene pair in the go terms
    return sum(terms.gene_set.map(set([g1, g2]).issubset))

Code example 1:

np.sum(np.vectorize(search)(df.gene1, df.gene2))

Code example 2:

[search(g1, g2) for g1, g2 in zip(df.gene1, df.gene2)]

Code example 3:

df[['gene1', 'gene2']].apply(lambda x: search(x.gene1, x.gene2), axis=1)

Link to the dataframe

英文:

I have a big dataframe called df around 45 million row like below. download

        gene1   gene2     score
0       PIGA  ATF7IP1 -0.047236
1       PIGB  ATF7IP2 -0.047236
2       PIGC  ATF7IP3 -0.047236
3       PIGD  ATF7IP4 -0.047236
4       PIGE  ATF7IP5 -0.047236

and I have a small dataframe called terms, size is around 3k row.

id                                gene_set
1                                 {HDAC4, BCL6}
2                                 {HDAC5, BCL6}
3                                 {HDAC7, BCL6}
4                 {NCOA3, KAT2B, EP300, CREBBP}
5            {NCAPD2, NCAPH, NCAPG, SMC4, SMC2}
                         ...                   
2912                              {FOXO1, ESR1}
2913                               {APP, FOXO3}
2914                               {APP, FOXO1}
2915                               {APP, FOXO4}
2916    {MAP3K20, MAPK14, AKAP13, MAP2K3, PKN1}

and for each row I check the presence of gene1,gene2 pairs in the terms dataset.

my code works fine, but I would like to ask is there any faster idea to that?
I have tried couple of codes but the run time approximately is the same.

def search(g1,g2):
    # search gene pair in the go terms
    return sum(terms.gene_set.map(set([g1,g2]).issubset))

code example 1

np.sum(np.vectorize(search)(df.gene1,df.gene2))

code example 2

[search(g1, g2) for g1, g2 in zip(df.gene1,df.gene2)] 

code example 3

df[['gene1','gene2']].apply(lambda x: search(x.gene1,x.gene2), axis=1 )

答案1

得分: 1

我看到一种明显的方法可以稍微加快您的search函数,方法是将set([g1,g2])修改为{g1,g2}.issubset。这样做可以避免将列表转换为集合的工作。

这里是一个简单的测试用例

In [4]: %timeit set([1,2]).issubset({1,2,3})

184 ns ± 2.08 ns 每循环一次(均值 ± 7 次运行的标准偏差,每次循环 10,000,000)

In [5]: %timeit {1,2}.issubset({1,2,3})
120 ns ± 0.607 ns 每循环一次(均值 ± 7 次运行的标准偏差,每次循环 10,000,000)

您也可以将terms数据框传递到函数中,这样解释器就不需要在函数范围外查找它,也许您可以使用pd.Series.sum()函数而不是调用Python的函数(这也可能加快速度)。

def search(g1, g2, terms):
    # 在go terms中搜索基因对
    return terms.gene_set.map({g1,g2}.issubset).sum()

除此之外,您可以尝试使用df.apply()方法,使用PolarsDask进行更大幅度的加速。

英文:

I see a clear way to speedup a bit your search function by modifying set([g1,g2]) into {g1,g2}.issubset. You avoid quite a bit of work since there is no need to convert the list to a set anymore.

Here a simple test case

In [4]: %timeit set([1,2]).issubset({1,2,3})

184 ns ± 2.08 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [5]: %timeit {1,2}.issubset({1,2,3})
120 ns ± 0.607 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

You can also pass the terms dataframe in the function, so that the interpreter doesn't need to look for it out of the function scope, and maybe you can use the pd.Series.sum() function instead of calling the Python one (this also should speed things up).

def search(g1, g2, terms):
    # search gene pair in the go terms
    return terms.gene_set.map({g1,g2}.issubset).sum()

Aside that, you could try the df.apply() method using Polars or Dask for more substantial speed-ups.

答案2

得分: 1

如果你使用.explode函数来展开较小的术语数据框,以消除集合。

long_terms = terms.explode('gene_set')

你可以使用.isin函数。

same_row = (
   long_terms.gene_set.isin(df.gene1).groupby(level=0).any()
   &
   long_terms.gene_set.isin(df.gene2).groupby(level=0).any()
)

found = long_terms.loc[same_row]

要找到相应的匹配:

df.gene1.isin(found.gene_set) & df.gene2.isin(found.gene_set)

用于样本的数据:

df
terms
terms = pd.DataFrame({
   "id": [1, 2, 3, 4, 5, 6, 7, 8], 
   "gene_set": [
      {"HDAC4", "BCL6"},
      {"HDAC5", "BCL6"},
      {"HDAC7", "BCL6"},
      {"NCOA3", "KAT2B", "EP300", "CREBBP"},
      {"NCAPD2", "NCAPH", "NCAPG", "SMC4", "SMC2"},
      {"A", "ATF7IP2"},
      {"B", "PIGB"},
      {"C", "ATF7IP2", "PIGB"},
   ]
})
英文:

What if you .explode the smaller terms dataframe to get rid of the sets.

long_terms = terms.explode('gene_set')
>>> long_terms
   id gene_set
0   1     BCL6
0   1    HDAC4
1   2     BCL6
1   2    HDAC5
2   3    HDAC7
2   3     BCL6
3   4    KAT2B
3   4   CREBBP
3   4    NCOA3
3   4    EP300
4   5     SMC2
4   5    NCAPH
4   5     SMC4
4   5   NCAPD2
4   5    NCAPG
5   6  ATF7IP2
5   6        A
6   7     PIGB
6   7        B
7   8     PIGB
7   8        C
7   8  ATF7IP2

You can then use .isin

same_row = (
   long_terms.gene_set.isin(df.gene1).groupby(level=0).any()
   &
   long_terms.gene_set.isin(df.gene2).groupby(level=0).any()
)

found = long_terms.loc[same_row]
>>> found
   id gene_set
7   8     PIGB
7   8        C
7   8  ATF7IP2

To find the corresponding match:

>>> df.gene1.isin(found.gene_set) & df.gene2.isin(found.gene_set)
0    False
1     True
2    False
3    False
4    False
dtype: bool
>>> df[df.gene1.isin(found.gene_set) & df.gene2.isin(found.gene_set)]
  gene1    gene2     score
1  PIGB  ATF7IP2 -0.047236

Sample used:

>>> df
  gene1    gene2     score
0  PIGA  ATF7IP1 -0.047236
1  PIGB  ATF7IP2 -0.047236
2  PIGC  ATF7IP3 -0.047236
3  PIGD  ATF7IP4 -0.047236
4  PIGE  ATF7IP5 -0.047236
>>> terms
   id                            gene_set
0   1                       {BCL6, HDAC4}
1   2                       {BCL6, HDAC5}
2   3                       {HDAC7, BCL6}
3   4       {KAT2B, CREBBP, NCOA3, EP300}
4   5  {SMC2, NCAPH, SMC4, NCAPD2, NCAPG}
5   6                        {ATF7IP2, A}
6   7                           {PIGB, B}
7   8                  {PIGB, C, ATF7IP2}
terms = pd.DataFrame({
   "id": [1, 2, 3, 4, 5, 6, 7, 8], 
   "gene_set": [
      {"HDAC4", "BCL6"},
      {"HDAC5", "BCL6"},
      {"HDAC7", "BCL6"},
      {"NCOA3", "KAT2B", "EP300", "CREBBP"},
      {"NCAPD2", "NCAPH", "NCAPG", "SMC4", "SMC2"},
      {"A", "ATF7IP2"},
      {"B", "PIGB"},
      {"C", "ATF7IP2", "PIGB"},
   ]
})

huangapple
  • 本文由 发表于 2023年4月6日 23:35:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/75951304.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定