2023年7月14日 05:22:37go评论75阅读模式

英文:

Is there a recommended algorithm for compressing multiple specific substrings in a string that resemble DNA?

问题

我目前正在寻找一种解决方案，可以帮助我最小化特定字符串集合占用的存储空间。这些个别字符串实质上是较大字符串的一部分。
例如，考虑以下字符串：

b bc b bcd b b bb abc

这些字符串是下面较大字符串的子串：

bcde
bbde
abce

我正在寻求一种编码这些特定字符串的解决方案，以消耗最少的内存资源。

英文:

I am currently in search of a solution that would help me minimize the storage space occupied by specific sets of strings. These individual strings are essentially parts of larger strings.
For example, consider the following strings :

b bc b bcd b b bb abc

These strings are substrings of the larger string below:

bcde
bbde
abce

I am seeking a solution to encode these specific strings in a manner that would consume minimal memory resources.

答案1

得分: 1

你正在寻找的数据结构是前缀树（trie）。基于您提供的字符串：

b bc b bcd b b bb abc

输出应该是：

bb
abc
bcd

一个非常简单的树数据结构的实现如下：

class Tree():
    def __init__(self):
        self.firstletter = {}
    def insert(self, word):
        current = self.firstletter
        for l in word:
            current.setdefault(l, {})
            current = current[l]
            
newtree = Tree()
instr = ['b', 'bc', 'b', 'bcd', 'b', 'b', 'bb', 'abc']
_ = [newtree.insert(word) for word in instr]

您可以使用深度搜索从中提取所有的单词：

def get_words(trie, strname):
    if not trie.keys():
        print(strname)
        return
    for n in trie.keys():
        get_words(trie[n], strname + n)
_ = [get_words(val, n) for n, val in newtree.firstletter.items()]

这将为您提供我上面列出的输出。

一个良好实现的前缀树将进一步压缩数据并加快搜索速度。不同编程语言中有许多精心实现的前缀树。根据任务的不同，您可能还会对前缀/后缀数组和FM-Indexes感兴趣。

英文:

The data structure you are looking for is trie. Based on the strings you provided:

b bc b bcd b b bb abc

The outputs should be:

bb
abc
bcd

A very naive implementation of the tree data structure looks like this:

class Tree():
    def __init__(self):
        self.firstletter = {}
    def insert(self, word):
        current = self.firstletter
        for l in word:
            current.setdefault(l, {})
            current = current[l]
            
newtree = Tree()
instr = [&#39;b&#39;, &#39;bc&#39;, &#39;b&#39;, &#39;bcd&#39;, &#39;b&#39;, &#39;b&#39;, &#39;bb&#39;, &#39;abc&#39;]
_ = [newtree.insert(word) for word in instr]

And you can get all the 'words' out with a depth search:

def get_words(trie, strname):
    if not trie.keys():
        print(strname)
        return
    for n in trie.keys():
        get_words(trie[n], strname + n)
_ = [get_words(val, n) for n, val in newtrie.firstletter.items()]

which gives you the outputs I listed above.

A nicely implemented trie will compress the data further and make searches faster. There are many nicely implemented tries in different languages. Depending on the task you may also be interested in prefix/suffix arrays and FM-Indexes.

答案2

得分: 0

我不确定这是否是你在寻找的，但假设存在多个重复的子字符串，你可以在字典中跟踪它们的计数。

ss = ['b', 'bc', 'b', 'bcd', 'b', 'b', 'bb', 'abc']
substrings = {k: ss.count(k) for k in set(ss)}
print(substrings)

将会给你：

{'bb': 1, 'bc': 1, 'bcd': 1, 'b': 4, 'abc': 1}

英文:

I'm not sure if this is what you're looking for but, assuming there would be several duplicate substrings, you could keep track of their counts in a dictionary.

ss = [&#39;b&#39;, &#39;bc&#39;, &#39;b&#39;, &#39;bcd&#39;, &#39;b&#39;, &#39;b&#39;, &#39;bb&#39;, &#39;abc&#39;]
substrings = {k: ss.count(k) for k in set(ss)}
print(substrings)

would give you:

{&#39;bb&#39;: 1, &#39;bc&#39;: 1, &#39;bcd&#39;: 1, &#39;b&#39;: 4, &#39;abc&#39;: 1}

答案3

得分: 0

你可以查看关于这个主题的文档。

大多数模块都有类似.compress()方法，返回字节对象，然后使用.decompress()。它们似乎非常容易使用。

所以，例如，如果我决定尝试lzma压缩，我可以尝试以下内容：

from sys import getsizeof
import lzma

dna_string = ("gggtaatggttgctatcccagtaatacccaatgattcaagctgatgagtgaaatgggggg"
              "tttcaacacactggtagctgagatgaagtcattaggaaatttaatggtggataaccgagc"
              "ccatgcgatgtggatcaattggctgtacggaggtagtttattcctgcggtctagacgatc"
              "tatgtacttcgaagctaagtaacgctgaattcgcttcaaatcgttccctaacgcgtagat"
              "catgttgggacgttgtttcagcgccgggttcgagaaaatgaggatacatgattcccgtcc"
              "acgtctacgtacacttagtccgacagagactgtgtgtacgtcgactgcgaacgtggattc"
              "ggtgtaaaataggctattgtgtcttagaacacgaaaaagaagagggttctggtcggcgtc"
              "tagccgctgctctgctggcgtgtgtctctggctattgaaactgactactgccgttaacct"
              "agtgattaccgtatttaaaagcctgcggccttgatacgaactatatatgggatttaggca"
              "gtcccccagaaatcagttctactcacccggaaactgagtattcctcgccaccccttattc"
              "ctcagactagaaaagatgctgggggcgtgctgcatagcgaaattcaggtataatgcaggc"
              "acctgtgatatcccggcattcctaggcttaagaccgttataatcaaatgcatcttcccct"
              "gagtagggagaccccaactcggtccggactacgagctatcgtattatggtaccttattat"
              "gggcatcgggcattcttcggtgcttaacgccacgggaagtgaagaggtcgcgacaggaac"
              "ctattattgatgaattgagtgaattgatttgctcgatgcatagggctctggggtcataac"
              "aacaaagtatgtcggttatcaaacccaatgtagtgctgtcagttgcacatggccgagtgg"
              )         
print(getsizeof(dna_string))

compressed_dna_string = lzma.compress(dna_string.encode("ascii"))
print(getsizeof(compressed_dna_string))

decompressed_dna_string = lzma.decompress(compressed_dna_string).decode("ascii")
print((decompressed_dna_string == dna_string)*"Decompressed result matches original")

它会打印：

1009
485
Decompressed result matches original

所以它工作了，对数据进行了一些压缩。我怀疑在处理更多数据时，压缩比可能会提高。

通过尝试不同的算法，测试对实际数据效果最好不应该太困难。

英文:

You could have a look at the documentation on the subject.

Most of the modules have something like .compress() method that returns bytes object and then .decompress(). They seem to be really easy to use.

So for example if I decide to give lzma compression a go I could try something like this:

from sys import getsizeof
import lzma

dna_string = (&quot;gggtaatggttgctatcccagtaatacccaatgattcaagctgatgagtgaaatgggggg&quot;
              &quot;tttcaacacactggtagctgagatgaagtcattaggaaatttaatggtggataaccgagc&quot;
              &quot;ccatgcgatgtggatcaattggctgtacggaggtagtttattcctgcggtctagacgatc&quot;
              &quot;tatgtacttcgaagctaagtaacgctgaattcgcttcaaatcgttccctaacgcgtagat&quot;
              &quot;catgttgggacgttgtttcagcgccgggttcgagaaaatgaggatacatgattcccgtcc&quot;
              &quot;acgtctacgtacacttagtccgacagagactgtgtgtacgtcgactgcgaacgtggattc&quot;
              &quot;ggtgtaaaataggctattgtgtcttagaacacgaaaaagaagagggttctggtcggcgtc&quot;
              &quot;tagccgctgctctgctggcgtgtgtctctggctattgaaactgactactgccgttaacct&quot;
              &quot;agtgattaccgtatttaaaagcctgcggccttgatacgaactatatatgggatttaggca&quot;
              &quot;gtcccccagaaatcagttctactcacccggaaactgagtattcctcgccaccccttattc&quot;
              &quot;ctcagactagaaaagatgctgggggcgtgctgcatagcgaaattcaggtataatgcaggc&quot;
              &quot;acctgtgatatcccggcattcctaggcttaagaccgttataatcaaatgcatcttcccct&quot;
              &quot;gagtagggagaccccaactcggtccggactacgagctatcgtattatggtaccttattat&quot;
              &quot;gggcatcgggcattcttcggtgcttaacgccacgggaagtgaagaggtcgcgacaggaac&quot;
              &quot;ctattattgatgaattgagtgaattgatttgctcgatgcatagggctctggggtcataac&quot;
              &quot;aacaaagtatgtcggttatcaaacccaatgtagtgctgtcagttgcacatggccgagtgg&quot;
              )         
print(getsizeof(dna_string))

compressed_dna_string = lzma.compress(dna_string.encode(&quot;ascii&quot;))
print(getsizeof(compressed_dna_string))

decompressed_dna_string = lzma.decompress(compressed_dna_string).decode(&quot;ascii&quot;)
print((decompressed_dna_string == dna_string)*&quot;Decompressed result matches original&quot;)

and it prints:

1009
485
Decompressed result matches original

So it works and it compresses the data a bit. I suspect the compression ratio could improve with bigger amounts of data.

It shouldn't be too difficult to test what works best for your actual data by trying different algorithms out.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

有没有推荐的算法，用于压缩字符串中类似DNA的多个特定子串？

问题

答案1

答案2

答案3

Django模板渲染嵌套字典与元组。

Generic vs Specific MyPy types of functions

Python列表中的数据框之间的分段线性插值

最佳方法在Java中创建邻接表是什么？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论