2023年4月6日 23:35:58go评论101阅读模式

英文:

faster way to search column pairs in another dataframe

问题

I have a big dataframe called df around 45 million rows like below.

0       PIGA  ATF7IP1 -0.047236
1       PIGB  ATF7IP2 -0.047236
2       PIGC  ATF7IP3 -0.047236
3       PIGD  ATF7IP4 -0.047236
4       PIGE  ATF7IP5 -0.047236

and I have a small dataframe called terms, size is around 3k rows.

id                                gene_set
1                                 {HDAC4, BCL6}
2                                 {HDAC5, BCL6}
3                                 {HDAC7, BCL6}
4                 {NCOA3, KAT2B, EP300, CREBBP}
5            {NCAPD2, NCAPH, NCAPG, SMC4, SMC2}
...
2912                              {FOXO1, ESR1}
2913                               {APP, FOXO3}
2914                               {APP, FOXO1}
2915                               {APP, FOXO4}
2916    {MAP3K20, MAPK14, AKAP13, MAP2K3, PKN1}

For each row, I check the presence of gene1, gene2 pairs in the terms dataset.

My code works fine, but I would like to ask if there is any faster idea for that. I have tried a couple of codes, but the runtime is approximately the same.

def search(g1, g2):
    # Search for the gene pair in the go terms
    return sum(terms.gene_set.map(set([g1, g2]).issubset))

Code example 1:

np.sum(np.vectorize(search)(df.gene1, df.gene2))

Code example 2:

[search(g1, g2) for g1, g2 in zip(df.gene1, df.gene2)]

Code example 3:

df[['gene1', 'gene2']].apply(lambda x: search(x.gene1, x.gene2), axis=1)

Link to the dataframe

英文:

I have a big dataframe called df around 45 million row like below. download

        gene1   gene2     score
0       PIGA  ATF7IP1 -0.047236
1       PIGB  ATF7IP2 -0.047236
2       PIGC  ATF7IP3 -0.047236
3       PIGD  ATF7IP4 -0.047236
4       PIGE  ATF7IP5 -0.047236

and I have a small dataframe called terms, size is around 3k row.

id                                gene_set
1                                 {HDAC4, BCL6}
2                                 {HDAC5, BCL6}
3                                 {HDAC7, BCL6}
4                 {NCOA3, KAT2B, EP300, CREBBP}
5            {NCAPD2, NCAPH, NCAPG, SMC4, SMC2}
                         ...                   
2912                              {FOXO1, ESR1}
2913                               {APP, FOXO3}
2914                               {APP, FOXO1}
2915                               {APP, FOXO4}
2916    {MAP3K20, MAPK14, AKAP13, MAP2K3, PKN1}

and for each row I check the presence of gene1,gene2 pairs in the terms dataset.

my code works fine, but I would like to ask is there any faster idea to that?
I have tried couple of codes but the run time approximately is the same.

def search(g1,g2):
    # search gene pair in the go terms
    return sum(terms.gene_set.map(set([g1,g2]).issubset))

code example 1

np.sum(np.vectorize(search)(df.gene1,df.gene2))

code example 2

[search(g1, g2) for g1, g2 in zip(df.gene1,df.gene2)]

code example 3

df[[&#39;gene1&#39;,&#39;gene2&#39;]].apply(lambda x: search(x.gene1,x.gene2), axis=1 )

答案1

得分: 1

我看到一种明显的方法可以稍微加快您的search函数，方法是将set([g1,g2])修改为{g1,g2}.issubset。这样做可以避免将列表转换为集合的工作。

这里是一个简单的测试用例

In [4]: %timeit set([1,2]).issubset({1,2,3})
184 ns &#177; 2.08 ns 每循环一次（均值 &#177; 7 次运行的标准偏差，每次循环 10,000,000）
In [5]: %timeit {1,2}.issubset({1,2,3})
120 ns &#177; 0.607 ns 每循环一次（均值 &#177; 7 次运行的标准偏差，每次循环 10,000,000）

您也可以将terms数据框传递到函数中，这样解释器就不需要在函数范围外查找它，也许您可以使用pd.Series.sum()函数而不是调用Python的函数（这也可能加快速度）。

def search(g1, g2, terms):
    # 在go terms中搜索基因对
    return terms.gene_set.map({g1,g2}.issubset).sum()

除此之外，您可以尝试使用df.apply()方法，使用Polars或Dask进行更大幅度的加速。

英文:

I see a clear way to speedup a bit your search function by modifying set([g1,g2]) into {g1,g2}.issubset. You avoid quite a bit of work since there is no need to convert the list to a set anymore.

Here a simple test case

In [4]: %timeit set([1,2]).issubset({1,2,3})
184 ns &#177; 2.08 ns per loop (mean &#177; std. dev. of 7 runs, 10,000,000 loops each)
In [5]: %timeit {1,2}.issubset({1,2,3})
120 ns &#177; 0.607 ns per loop (mean &#177; std. dev. of 7 runs, 10,000,000 loops each)

You can also pass the terms dataframe in the function, so that the interpreter doesn't need to look for it out of the function scope, and maybe you can use the pd.Series.sum() function instead of calling the Python one (this also should speed things up).

def search(g1, g2, terms):
    # search gene pair in the go terms
    return terms.gene_set.map({g1,g2}.issubset).sum()

Aside that, you could try the df.apply() method using Polars or Dask for more substantial speed-ups.

答案2

得分: 1

如果你使用.explode函数来展开较小的术语数据框，以消除集合。

long_terms = terms.explode('gene_set')

你可以使用.isin函数。

same_row = (
   long_terms.gene_set.isin(df.gene1).groupby(level=0).any()
   &
   long_terms.gene_set.isin(df.gene2).groupby(level=0).any()
)
found = long_terms.loc[same_row]

要找到相应的匹配：

df.gene1.isin(found.gene_set) & df.gene2.isin(found.gene_set)

用于样本的数据：

df

terms

terms = pd.DataFrame({
   "id": [1, 2, 3, 4, 5, 6, 7, 8], 
   "gene_set": [
      {"HDAC4", "BCL6"},
      {"HDAC5", "BCL6"},
      {"HDAC7", "BCL6"},
      {"NCOA3", "KAT2B", "EP300", "CREBBP"},
      {"NCAPD2", "NCAPH", "NCAPG", "SMC4", "SMC2"},
      {"A", "ATF7IP2"},
      {"B", "PIGB"},
      {"C", "ATF7IP2", "PIGB"},
   ]
})

英文:

What if you .explode the smaller terms dataframe to get rid of the sets.

long_terms = terms.explode(&#39;gene_set&#39;)

&gt;&gt;&gt; long_terms
   id gene_set
0   1     BCL6
0   1    HDAC4
1   2     BCL6
1   2    HDAC5
2   3    HDAC7
2   3     BCL6
3   4    KAT2B
3   4   CREBBP
3   4    NCOA3
3   4    EP300
4   5     SMC2
4   5    NCAPH
4   5     SMC4
4   5   NCAPD2
4   5    NCAPG
5   6  ATF7IP2
5   6        A
6   7     PIGB
6   7        B
7   8     PIGB
7   8        C
7   8  ATF7IP2

You can then use .isin

same_row = (
   long_terms.gene_set.isin(df.gene1).groupby(level=0).any()
   &amp;
   long_terms.gene_set.isin(df.gene2).groupby(level=0).any()
)
found = long_terms.loc[same_row]

&gt;&gt;&gt; found
   id gene_set
7   8     PIGB
7   8        C
7   8  ATF7IP2

To find the corresponding match:

&gt;&gt;&gt; df.gene1.isin(found.gene_set) &amp; df.gene2.isin(found.gene_set)
0    False
1     True
2    False
3    False
4    False
dtype: bool

&gt;&gt;&gt; df[df.gene1.isin(found.gene_set) &amp; df.gene2.isin(found.gene_set)]
  gene1    gene2     score
1  PIGB  ATF7IP2 -0.047236

Sample used:

&gt;&gt;&gt; df
  gene1    gene2     score
0  PIGA  ATF7IP1 -0.047236
1  PIGB  ATF7IP2 -0.047236
2  PIGC  ATF7IP3 -0.047236
3  PIGD  ATF7IP4 -0.047236
4  PIGE  ATF7IP5 -0.047236

&gt;&gt;&gt; terms
   id                            gene_set
0   1                       {BCL6, HDAC4}
1   2                       {BCL6, HDAC5}
2   3                       {HDAC7, BCL6}
3   4       {KAT2B, CREBBP, NCOA3, EP300}
4   5  {SMC2, NCAPH, SMC4, NCAPD2, NCAPG}
5   6                        {ATF7IP2, A}
6   7                           {PIGB, B}
7   8                  {PIGB, C, ATF7IP2}

terms = pd.DataFrame({
   &quot;id&quot;: [1, 2, 3, 4, 5, 6, 7, 8], 
   &quot;gene_set&quot;: [
      {&quot;HDAC4&quot;, &quot;BCL6&quot;},
      {&quot;HDAC5&quot;, &quot;BCL6&quot;},
      {&quot;HDAC7&quot;, &quot;BCL6&quot;},
      {&quot;NCOA3&quot;, &quot;KAT2B&quot;, &quot;EP300&quot;, &quot;CREBBP&quot;},
      {&quot;NCAPD2&quot;, &quot;NCAPH&quot;, &quot;NCAPG&quot;, &quot;SMC4&quot;, &quot;SMC2&quot;},
      {&quot;A&quot;, &quot;ATF7IP2&quot;},
      {&quot;B&quot;, &quot;PIGB&quot;},
      {&quot;C&quot;, &quot;ATF7IP2&quot;, &quot;PIGB&quot;},
   ]
})

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

更快的方法来搜索另一个数据框中的列对。

问题

答案1

答案2

比较两个部分字符串。

ZeroDivisionError: division by zero (osu learning)

Python装饰器参数作用域

使用cv.rotate和cv.warpAffine时旋转图像的差异

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。