英文:
Is there a way of generating combinations while increasing the values?
问题
以下是您提供的代码部分的中文翻译:
我正在尝试编写一个Python程序,该程序从用户提供的UniProt ID列表中获取FASTA序列。这是我用于第一个任务的代码片段(这个部分有效):
```python
def fetch_sequence(uniprot_id):
url = f'https://www.uniprot.org/uniprot/{uniprot_id}.fasta'
response = urllib.request.urlopen(url)
sequence = ''
for line in response:
line = line.decode('utf-8').strip()
if not line.startswith('>'):
sequence += line
return sequence
我试图生成包含不同蛋白质序列组合的FASTA文件,用冒号':'分隔,最多包含5000个氨基酸。例如:
> 2xprotein1, 3xprotein2, 1xprotein3
sequence1:sequence1:sequence2:sequence2:sequence2:sequence3
重要的细节是:用户提供的初始蛋白质列表[UniProt1,Uniprot2等]中的每个蛋白质应至少在生成的每个FASTA文件中出现一次;如果我已经有一个FASTA文件,例如protein1:protein2:protein3,我不想生成它的排列,比如protein2:protein1:protein3。
我尝试使用itertools组合函数,但它不会生成所需的组合,因为它只创建初始列表的组合。我想尝试类似这样的东西:
# 例如,对于4个蛋白质
[1, 1, 1, 1]
[2, 1, 1, 1]
[1, 2, 1, 1]
[1, 1, 2, 1]
...
通过尝试所有这些组合,我希望创建每个组合,但现在我担心,如果代码达到阈值,它将排除在此之后将仍然在5000个氨基酸阈值以下生成的可能性。此外,这将是一种递归方法,会显著减慢代码的速度。
<details>
<summary>英文:</summary>
I am trying to write a program in Python that fetches FASTA sequences from a list of UniProt IDs given by the user. This is the code snippet I use for the first task (this works):
def fetch_sequence(uniprot_id):
url = f'https://www.uniprot.org/uniprot/{uniprot_id}.fasta'
response = urllib.request.urlopen(url)
sequence = ''
for line in response:
line = line.decode('utf-8').strip()
if not line.startswith('>'):
sequence += line
return sequence
What I am trying to do is generate FASTA files containing different combinations of the protein sequences fetched separated by a ':', for a maximum of 5000 amino acids. For example:
> 2xprotein1, 3xprotein2, 1xprotein3
sequence1:sequence1:sequence2:sequence2:sequence2:sequence3
Important details are: each of the proteins in the initial list provided by the user [UniProt1, Uniprot2, etc.] should be present at least one time in each of the generated FASTA files; if I already have a FASTA file, for example protein1:protein2:protein3, I do not want to generate its permutations, like protein2:protein1:protein3.
I tried using the itertools combination function, but it does not generate the required combinations because, of course, it just creates combinations of the initial list. I wanted to try something like this:
#i.e. for 4 proteins
[1, 1, 1, 1]
[2, 1, 1, 1]
[1, 2, 1, 1]
[1, 1, 2, 1]
...
By trying all of these combinations, I was hoping to create every combination, but now I am concerned that, if the code hits the threshold, it will exclude possibilities that would have been generated after this and still would have been below the 5000 amino acid threshold. Also, this would be a recursive approach, which slows the code down significantly.
</details>
# 答案1
**得分**: 2
你可以编写一个递归生成器函数,根据剩余可用大小逐渐构建序列计数,考虑到即将出现的计数的最小大小。这将只生成符合您条件的组合。
```python
def genSequences(proteins,maxSize):
if not proteins:
yield []
return
minSize = sum(map(len,proteins))
count = 0
while maxSize >= minSize:
count += 1
maxSize -= len(proteins[0])
for seq in genSequences(proteins[1:],maxSize):
yield [count]+seq
如果你不需要计数,可以使生成器直接产生合并后的序列:
def genSequences(proteins,maxSize,seq=""):
if not proteins:
yield seq
return
minSize = sum(map(len,proteins))
while maxSize >= minSize:
seq += proteins[0]
maxSize -= len(proteins[0])
yield from genSequences(proteins[1:],maxSize,seq)
如果递归函数的性能是一个问题,你可以为生成器使用迭代方法:
def sequences(proteins,maxSize):
*sizes, = map(len,proteins)
counts = [1]*len(proteins)
yield tuple(counts)
p = 0
size = sum(map(len,proteins))
while p < len(proteins):
if size+sizes > maxSize:
p += 1
counts[:p] = [1]*p
size = sum(c*s for c,s in zip(counts,sizes))
continue
counts += 1
size += sizes
yield tuple(counts)
p = 0
英文:
You could write a recursive generator function that progressively builds the sequence counts based on the remaining available size taking into account the minimum size of the upcoming counts. This will only produce combinations that meet your criteria.
def genSequences(proteins,maxSize):
if not proteins:
yield []
return
minSize = sum(map(len,proteins))
count = 0
while maxSize >= minSize:
count += 1
maxSize -= len(proteins[0])
for seq in genSequences(proteins[1:],maxSize):
yield [count]+seq
ouput:
maxSize = 30
proteins = ["ABCD","12345","XYZ","MPPPQQQ"]
for sq in genSequences(proteins,maxSize):
print(sq,"size:",sum(c*len(p) for c,p in zip(sq,proteins)),
"".join(p*c for c,p in zip(sq,proteins)))
[1, 1, 1, 1] size: 19 ABCD12345XYZMPPPQQQ
[1, 1, 1, 2] size: 26 ABCD12345XYZMPPPQQQMPPPQQQ
[1, 1, 2, 1] size: 22 ABCD12345XYZXYZMPPPQQQ
[1, 1, 2, 2] size: 29 ABCD12345XYZXYZMPPPQQQMPPPQQQ
[1, 1, 3, 1] size: 25 ABCD12345XYZXYZXYZMPPPQQQ
[1, 1, 4, 1] size: 28 ABCD12345XYZXYZXYZXYZMPPPQQQ
[1, 2, 1, 1] size: 24 ABCD1234512345XYZMPPPQQQ
[1, 2, 2, 1] size: 27 ABCD1234512345XYZXYZMPPPQQQ
[1, 2, 3, 1] size: 30 ABCD1234512345XYZXYZXYZMPPPQQQ
[1, 3, 1, 1] size: 29 ABCD123451234512345XYZMPPPQQQ
[2, 1, 1, 1] size: 23 ABCDABCD12345XYZMPPPQQQ
[2, 1, 1, 2] size: 30 ABCDABCD12345XYZMPPPQQQMPPPQQQ
[2, 1, 2, 1] size: 26 ABCDABCD12345XYZXYZMPPPQQQ
[2, 1, 3, 1] size: 29 ABCDABCD12345XYZXYZXYZMPPPQQQ
[2, 2, 1, 1] size: 28 ABCDABCD1234512345XYZMPPPQQQ
[3, 1, 1, 1] size: 27 ABCDABCDABCD12345XYZMPPPQQQ
[3, 1, 2, 1] size: 30 ABCDABCDABCD12345XYZXYZMPPPQQQ
If you don't need the counts, you could make the generator yield the combined sequences directly:
def genSequences(proteins,maxSize,seq=""):
if not proteins:
yield seq
return
minSize = sum(map(len,proteins))
while maxSize >= minSize:
seq += proteins[0]
maxSize -= len(proteins[0])
yield from genSequences(proteins[1:],maxSize,seq)
...
for sq in genSequences(proteins,maxSize):
print(len(sq),sq)
19 ABCD12345XYZMPPPQQQ
26 ABCD12345XYZMPPPQQQMPPPQQQ
22 ABCD12345XYZXYZMPPPQQQ
29 ABCD12345XYZXYZMPPPQQQMPPPQQQ
25 ABCD12345XYZXYZXYZMPPPQQQ
28 ABCD12345XYZXYZXYZXYZMPPPQQQ
24 ABCD1234512345XYZMPPPQQQ
27 ABCD1234512345XYZXYZMPPPQQQ
30 ABCD1234512345XYZXYZXYZMPPPQQQ
29 ABCD123451234512345XYZMPPPQQQ
23 ABCDABCD12345XYZMPPPQQQ
30 ABCDABCD12345XYZMPPPQQQMPPPQQQ
26 ABCDABCD12345XYZXYZMPPPQQQ
29 ABCDABCD12345XYZXYZXYZMPPPQQQ
28 ABCDABCD1234512345XYZMPPPQQQ
27 ABCDABCDABCD12345XYZMPPPQQQ
30 ABCDABCDABCD12345XYZXYZMPPPQQQ
If the performance of recursive functions is a concern, you can use an iterative approach for the generator:
def sequences(proteins,maxSize):
*sizes, = map(len,proteins)
counts = [1]*len(proteins)
yield tuple(counts)
p = 0
size = sum(map(len,proteins))
while p < len(proteins):
if size+sizes > maxSize:
p += 1
counts[:p] = [1]*p
size = sum(c*s for c,s in zip(counts,sizes))
continue
counts
+= 1
size += sizes
yield tuple(counts)
p = 0
答案2
得分: 1
以下是您要翻译的内容:
"Edit: Thanks for clarifying! I used <strike>itertools.combinations_with_replacement
</strike> itertools.product
to generate the number of representations for each sequence instead.
import itertools
import urllib
import numpy as np
Specify the UniProt IDs here
uids = [protein1, protein2, protein3, protein4]
Fetch sequences from UniProt
seqs = [fetch_sequence(uid) for uid in uids]
minimum no. of instances = 1
maximum no. is infinite, or until total seq length reaches 5000
If each sequence was represented at least once, the max no. of times is given by (5000/smallest seq length)
smallest_seq_length = min([len(s) for s in seqs])
In this instance we have 5000 as the sequence length limit. So max times a sequence can be represented is
max_rep = int(np.ceil(5000/smallest_seq_length))
times = list(range(1,max_rep))
times = [times]*len(seqs)
seqs_combinations = []
for c in itertools.product(*times):
# To deal with time complexity, we can safely eliminate combinations whose sum is greater than max_rep. So,
if sum(c) <= max_rep:
temp = []
for ndx, i in enumerate(c):
temp += [seqs[ndx]]*i
seqs_combinations.append(":".join(temp))
Eliminate sequences above threshold length
seqs_combinations =
%%timeit
gives the following result for the above code using 4 protein sequences:
13.7 s ± 1.16 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Demo with short sequence (threshold length of 30):
seqs = ['THERE','WERE','DAYS']
smallest_seq_length = min([len(s) for s in seqs])
max_rep = int(np.ceil(50/smallest_seq_length))
times = list(range(1,max_rep))
times = [times]*len(seqs)
seqs_combinations = []
for c in itertools.product(*times):
print(c)
if sum(c) <= max_rep:
temp = []
for ndx, i in enumerate(c):
temp += [seqs[ndx]]*i
seqs_combinations.append(":".join(temp))
seqs_combinations =
Here's the result:
['THERE:WERE:DAYS',
'THERE:WERE:DAYS:DAYS',
'THERE:WERE:DAYS:DAYS:DAYS',
'THERE:WERE:DAYS:DAYS:DAYS:DAYS',
'THERE:WERE:DAYS:DAYS:DAYS:DAYS:DAYS',
'THERE:WERE:WERE:DAYS',
'THERE:WERE:WERE:DAYS:DAYS',
'THERE:WERE:WERE:DAYS:DAYS:DAYS',
'THERE:WERE:WERE:DAYS:DAYS:DAYS:DAYS',
'THERE:WERE:WERE:WERE:DAYS',
'THERE:WERE:WERE:WERE:DAYS:DAYS',
'THERE:WERE:WERE:WERE:DAYS:DAYS:DAYS',
'THERE:WERE:WERE:WERE:WERE:DAYS',
'THERE:WERE:WERE:WERE:WERE:DAYS:DAYS',
'THERE:WERE:WERE:WERE:WERE:WERE:DAYS',
'THERE:THERE:WERE:DAYS',
'THERE:THERE:WERE:DAYS:DAYS',
'THERE:THERE:WERE:DAYS:DAYS:DAYS',
'THERE:THERE:WERE:WERE:DAYS',
'THERE:THERE:WERE:WERE:DAYS:DAYS',
'THERE:THERE:WERE:WERE:WERE:DAYS',
'THERE:THERE:THERE:WERE:DAYS',
'THERE:THERE:THERE:WERE:DAYS:DAYS',
'THERE:THERE:THERE:WERE:WERE:DAYS',
'THERE:THERE:THERE:THERE:WERE:DAYS']"
英文:
Edit: Thanks for clarifying! I used <strike>itertools.combinations_with_replacement
</strike> itertools.product
to generate the number of representations for each sequence instead.
import itertools
import urllib
import numpy as np
# Specify the UniProt IDs here
uids = [protein1, protein2, protein3, protein4]
# Fetch sequences from UniProt
seqs = [fetch_sequence(uid) for uid in uids]
# minimum no. of instances = 1
# maximum no. is infinite, or until total seq length reaches 5000
# If each sequence was represented at least once, the max no. of times is given by (5000/smallest seq length)
smallest_seq_length = min([len(s) for s in seqs])
# In this instance we have 5000 as the sequence length limit. So max times a sequence can be represented is
max_rep = int(np.ceil(5000/smallest_seq_length))
times = list(range(1,max_rep))
times = [times]*len(seqs)
seqs_combinations = []
for c in itertools.product(*times):
# To deal with time complexity, we can safely eliminate combinations whose sum is greater than max_rep. So,
if sum(c) <= max_rep:
temp = []
for ndx, i in enumerate(c):
temp += [seqs[ndx]]*i
seqs_combinations.append(":".join(temp))
# Eliminate sequences above threshold length
seqs_combinations = 展开收缩
%%timeit
gives the following result for the above code using 4 protein sequences:
13.7 s ± 1.16 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Demo with short sequence (threshold length of 30):
seqs = ['THERE','WERE','DAYS']
smallest_seq_length = min([len(s) for s in seqs])
max_rep = int(np.ceil(50/smallest_seq_length))
times = list(range(1,max_rep))
times = [times]*len(seqs)
seqs_combinations = []
for c in itertools.product(*times):
print(c)
if sum(c) <= max_rep:
temp = []
for ndx, i in enumerate(c):
temp += [seqs[ndx]]*i
seqs_combinations.append(":".join(temp))
seqs_combinations = 展开收缩
Here's the result:
['THERE:WERE:DAYS',
'THERE:WERE:DAYS:DAYS',
'THERE:WERE:DAYS:DAYS:DAYS',
'THERE:WERE:DAYS:DAYS:DAYS:DAYS',
'THERE:WERE:DAYS:DAYS:DAYS:DAYS:DAYS',
'THERE:WERE:WERE:DAYS',
'THERE:WERE:WERE:DAYS:DAYS',
'THERE:WERE:WERE:DAYS:DAYS:DAYS',
'THERE:WERE:WERE:DAYS:DAYS:DAYS:DAYS',
'THERE:WERE:WERE:WERE:DAYS',
'THERE:WERE:WERE:WERE:DAYS:DAYS',
'THERE:WERE:WERE:WERE:DAYS:DAYS:DAYS',
'THERE:WERE:WERE:WERE:WERE:DAYS',
'THERE:WERE:WERE:WERE:WERE:DAYS:DAYS',
'THERE:WERE:WERE:WERE:WERE:WERE:DAYS',
'THERE:THERE:WERE:DAYS',
'THERE:THERE:WERE:DAYS:DAYS',
'THERE:THERE:WERE:DAYS:DAYS:DAYS',
'THERE:THERE:WERE:WERE:DAYS',
'THERE:THERE:WERE:WERE:DAYS:DAYS',
'THERE:THERE:WERE:WERE:WERE:DAYS',
'THERE:THERE:THERE:WERE:DAYS',
'THERE:THERE:THERE:WERE:DAYS:DAYS',
'THERE:THERE:THERE:WERE:WERE:DAYS',
'THERE:THERE:THERE:THERE:WERE:DAYS']
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论