英文:
faster way to search column pairs in another dataframe
问题
I have a big dataframe called df
around 45 million rows like below.
0 PIGA ATF7IP1 -0.047236
1 PIGB ATF7IP2 -0.047236
2 PIGC ATF7IP3 -0.047236
3 PIGD ATF7IP4 -0.047236
4 PIGE ATF7IP5 -0.047236
and I have a small dataframe called terms
, size is around 3k rows.
id gene_set
1 {HDAC4, BCL6}
2 {HDAC5, BCL6}
3 {HDAC7, BCL6}
4 {NCOA3, KAT2B, EP300, CREBBP}
5 {NCAPD2, NCAPH, NCAPG, SMC4, SMC2}
...
2912 {FOXO1, ESR1}
2913 {APP, FOXO3}
2914 {APP, FOXO1}
2915 {APP, FOXO4}
2916 {MAP3K20, MAPK14, AKAP13, MAP2K3, PKN1}
For each row, I check the presence of gene1, gene2
pairs in the terms
dataset.
My code works fine, but I would like to ask if there is any faster idea for that. I have tried a couple of codes, but the runtime is approximately the same.
def search(g1, g2):
# Search for the gene pair in the go terms
return sum(terms.gene_set.map(set([g1, g2]).issubset))
Code example 1:
np.sum(np.vectorize(search)(df.gene1, df.gene2))
Code example 2:
[search(g1, g2) for g1, g2 in zip(df.gene1, df.gene2)]
Code example 3:
df[['gene1', 'gene2']].apply(lambda x: search(x.gene1, x.gene2), axis=1)
英文:
I have a big dataframe called df
around 45 million row like below. download
gene1 gene2 score
0 PIGA ATF7IP1 -0.047236
1 PIGB ATF7IP2 -0.047236
2 PIGC ATF7IP3 -0.047236
3 PIGD ATF7IP4 -0.047236
4 PIGE ATF7IP5 -0.047236
and I have a small dataframe called terms
, size is around 3k row.
id gene_set
1 {HDAC4, BCL6}
2 {HDAC5, BCL6}
3 {HDAC7, BCL6}
4 {NCOA3, KAT2B, EP300, CREBBP}
5 {NCAPD2, NCAPH, NCAPG, SMC4, SMC2}
...
2912 {FOXO1, ESR1}
2913 {APP, FOXO3}
2914 {APP, FOXO1}
2915 {APP, FOXO4}
2916 {MAP3K20, MAPK14, AKAP13, MAP2K3, PKN1}
and for each row I check the presence of gene1,gene2
pairs in the terms
dataset.
my code works fine, but I would like to ask is there any faster idea to that?
I have tried couple of codes but the run time approximately is the same.
def search(g1,g2):
# search gene pair in the go terms
return sum(terms.gene_set.map(set([g1,g2]).issubset))
code example 1
np.sum(np.vectorize(search)(df.gene1,df.gene2))
code example 2
[search(g1, g2) for g1, g2 in zip(df.gene1,df.gene2)]
code example 3
df[['gene1','gene2']].apply(lambda x: search(x.gene1,x.gene2), axis=1 )
答案1
得分: 1
我看到一种明显的方法可以稍微加快您的search
函数,方法是将set([g1,g2])
修改为{g1,g2}.issubset
。这样做可以避免将列表转换为集合的工作。
这里是一个简单的测试用例
In [4]: %timeit set([1,2]).issubset({1,2,3})
184 ns ± 2.08 ns 每循环一次(均值 ± 7 次运行的标准偏差,每次循环 10,000,000)
In [5]: %timeit {1,2}.issubset({1,2,3})
120 ns ± 0.607 ns 每循环一次(均值 ± 7 次运行的标准偏差,每次循环 10,000,000)
您也可以将terms
数据框传递到函数中,这样解释器就不需要在函数范围外查找它,也许您可以使用pd.Series.sum()
函数而不是调用Python的函数(这也可能加快速度)。
def search(g1, g2, terms):
# 在go terms中搜索基因对
return terms.gene_set.map({g1,g2}.issubset).sum()
除此之外,您可以尝试使用df.apply()
方法,使用Polars或Dask进行更大幅度的加速。
英文:
I see a clear way to speedup a bit your search
function by modifying set([g1,g2])
into {g1,g2}.issubset
. You avoid quite a bit of work since there is no need to convert the list to a set anymore.
Here a simple test case
In [4]: %timeit set([1,2]).issubset({1,2,3})
184 ns ± 2.08 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [5]: %timeit {1,2}.issubset({1,2,3})
120 ns ± 0.607 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
You can also pass the terms
dataframe in the function, so that the interpreter doesn't need to look for it out of the function scope, and maybe you can use the pd.Series.sum()
function instead of calling the Python one (this also should speed things up).
def search(g1, g2, terms):
# search gene pair in the go terms
return terms.gene_set.map({g1,g2}.issubset).sum()
Aside that, you could try the df.apply()
method using Polars or Dask for more substantial speed-ups.
答案2
得分: 1
如果你使用.explode
函数来展开较小的术语数据框,以消除集合。
long_terms = terms.explode('gene_set')
你可以使用.isin
函数。
same_row = (
long_terms.gene_set.isin(df.gene1).groupby(level=0).any()
&
long_terms.gene_set.isin(df.gene2).groupby(level=0).any()
)
found = long_terms.loc[same_row]
要找到相应的匹配:
df.gene1.isin(found.gene_set) & df.gene2.isin(found.gene_set)
用于样本的数据:
df
terms
terms = pd.DataFrame({
"id": [1, 2, 3, 4, 5, 6, 7, 8],
"gene_set": [
{"HDAC4", "BCL6"},
{"HDAC5", "BCL6"},
{"HDAC7", "BCL6"},
{"NCOA3", "KAT2B", "EP300", "CREBBP"},
{"NCAPD2", "NCAPH", "NCAPG", "SMC4", "SMC2"},
{"A", "ATF7IP2"},
{"B", "PIGB"},
{"C", "ATF7IP2", "PIGB"},
]
})
英文:
What if you .explode
the smaller terms dataframe to get rid of the sets.
long_terms = terms.explode('gene_set')
>>> long_terms
id gene_set
0 1 BCL6
0 1 HDAC4
1 2 BCL6
1 2 HDAC5
2 3 HDAC7
2 3 BCL6
3 4 KAT2B
3 4 CREBBP
3 4 NCOA3
3 4 EP300
4 5 SMC2
4 5 NCAPH
4 5 SMC4
4 5 NCAPD2
4 5 NCAPG
5 6 ATF7IP2
5 6 A
6 7 PIGB
6 7 B
7 8 PIGB
7 8 C
7 8 ATF7IP2
You can then use .isin
same_row = (
long_terms.gene_set.isin(df.gene1).groupby(level=0).any()
&
long_terms.gene_set.isin(df.gene2).groupby(level=0).any()
)
found = long_terms.loc[same_row]
>>> found
id gene_set
7 8 PIGB
7 8 C
7 8 ATF7IP2
To find the corresponding match:
>>> df.gene1.isin(found.gene_set) & df.gene2.isin(found.gene_set)
0 False
1 True
2 False
3 False
4 False
dtype: bool
>>> df[df.gene1.isin(found.gene_set) & df.gene2.isin(found.gene_set)]
gene1 gene2 score
1 PIGB ATF7IP2 -0.047236
Sample used:
>>> df
gene1 gene2 score
0 PIGA ATF7IP1 -0.047236
1 PIGB ATF7IP2 -0.047236
2 PIGC ATF7IP3 -0.047236
3 PIGD ATF7IP4 -0.047236
4 PIGE ATF7IP5 -0.047236
>>> terms
id gene_set
0 1 {BCL6, HDAC4}
1 2 {BCL6, HDAC5}
2 3 {HDAC7, BCL6}
3 4 {KAT2B, CREBBP, NCOA3, EP300}
4 5 {SMC2, NCAPH, SMC4, NCAPD2, NCAPG}
5 6 {ATF7IP2, A}
6 7 {PIGB, B}
7 8 {PIGB, C, ATF7IP2}
terms = pd.DataFrame({
"id": [1, 2, 3, 4, 5, 6, 7, 8],
"gene_set": [
{"HDAC4", "BCL6"},
{"HDAC5", "BCL6"},
{"HDAC7", "BCL6"},
{"NCOA3", "KAT2B", "EP300", "CREBBP"},
{"NCAPD2", "NCAPH", "NCAPG", "SMC4", "SMC2"},
{"A", "ATF7IP2"},
{"B", "PIGB"},
{"C", "ATF7IP2", "PIGB"},
]
})
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论