有没有推荐的算法,用于压缩字符串中类似DNA的多个特定子串?

huangapple go评论91阅读模式
英文:

Is there a recommended algorithm for compressing multiple specific substrings in a string that resemble DNA?

问题

我目前正在寻找一种解决方案,可以帮助我最小化特定字符串集合占用的存储空间。这些个别字符串实质上是较大字符串的一部分。
例如,考虑以下字符串:

b bc b bcd b b bb abc

这些字符串是下面较大字符串的子串:

bcde
bbde
abce

我正在寻求一种编码这些特定字符串的解决方案,以消耗最少的内存资源。

英文:

I am currently in search of a solution that would help me minimize the storage space occupied by specific sets of strings. These individual strings are essentially parts of larger strings.
For example, consider the following strings :

b bc b bcd b b bb abc 

These strings are substrings of the larger string below:

bcde
bbde
abce

I am seeking a solution to encode these specific strings in a manner that would consume minimal memory resources.

答案1

得分: 1

你正在寻找的数据结构是前缀树(trie)。基于您提供的字符串:

b bc b bcd b b bb abc

输出应该是:

bb
abc
bcd

一个非常简单的数据结构的实现如下:

class Tree():
    def __init__(self):
        self.firstletter = {}
    def insert(self, word):
        current = self.firstletter
        for l in word:
            current.setdefault(l, {})
            current = current[l]
            
newtree = Tree()
instr = ['b', 'bc', 'b', 'bcd', 'b', 'b', 'bb', 'abc']
_ = [newtree.insert(word) for word in instr]

您可以使用深度搜索从中提取所有的单词:

def get_words(trie, strname):
    if not trie.keys():
        print(strname)
        return
    for n in trie.keys():
        get_words(trie[n], strname + n)
_ = [get_words(val, n) for n, val in newtree.firstletter.items()]

这将为您提供我上面列出的输出。

一个良好实现的前缀树将进一步压缩数据并加快搜索速度。不同编程语言中有许多精心实现的前缀树。根据任务的不同,您可能还会对前缀/后缀数组和FM-Indexes感兴趣。

英文:

The data structure you are looking for is trie. Based on the strings you provided:

b bc b bcd b b bb abc

The outputs should be:

bb
abc
bcd

A very naive implementation of the tree data structure looks like this:

class Tree():
    def __init__(self):
        self.firstletter = {}
    def insert(self, word):
        current = self.firstletter
        for l in word:
            current.setdefault(l, {})
            current = current[l]
            
newtree = Tree()
instr = ['b', 'bc', 'b', 'bcd', 'b', 'b', 'bb', 'abc']
_ = [newtree.insert(word) for word in instr]

And you can get all the 'words' out with a depth search:

def get_words(trie, strname):
    if not trie.keys():
        print(strname)
        return
    for n in trie.keys():
        get_words(trie[n], strname + n)
_ = [get_words(val, n) for n, val in newtrie.firstletter.items()]

which gives you the outputs I listed above.

A nicely implemented trie will compress the data further and make searches faster. There are many nicely implemented tries in different languages. Depending on the task you may also be interested in prefix/suffix arrays and FM-Indexes.

答案2

得分: 0

我不确定这是否是你在寻找的,但假设存在多个重复的子字符串,你可以在字典中跟踪它们的计数。

ss = ['b', 'bc', 'b', 'bcd', 'b', 'b', 'bb', 'abc']
substrings = {k: ss.count(k) for k in set(ss)}
print(substrings)

将会给你:

{'bb': 1, 'bc': 1, 'bcd': 1, 'b': 4, 'abc': 1}
英文:

I'm not sure if this is what you're looking for but, assuming there would be several duplicate substrings, you could keep track of their counts in a dictionary.

ss = ['b', 'bc', 'b', 'bcd', 'b', 'b', 'bb', 'abc']
substrings = {k: ss.count(k) for k in set(ss)}
print(substrings)

would give you:

{'bb': 1, 'bc': 1, 'bcd': 1, 'b': 4, 'abc': 1}

答案3

得分: 0

你可以查看关于这个主题的文档

大多数模块都有类似.compress()方法,返回字节对象,然后使用.decompress()。它们似乎非常容易使用。

所以,例如,如果我决定尝试lzma压缩,我可以尝试以下内容:

from sys import getsizeof
import lzma

dna_string = ("gggtaatggttgctatcccagtaatacccaatgattcaagctgatgagtgaaatgggggg"
              "tttcaacacactggtagctgagatgaagtcattaggaaatttaatggtggataaccgagc"
              "ccatgcgatgtggatcaattggctgtacggaggtagtttattcctgcggtctagacgatc"
              "tatgtacttcgaagctaagtaacgctgaattcgcttcaaatcgttccctaacgcgtagat"
              "catgttgggacgttgtttcagcgccgggttcgagaaaatgaggatacatgattcccgtcc"
              "acgtctacgtacacttagtccgacagagactgtgtgtacgtcgactgcgaacgtggattc"
              "ggtgtaaaataggctattgtgtcttagaacacgaaaaagaagagggttctggtcggcgtc"
              "tagccgctgctctgctggcgtgtgtctctggctattgaaactgactactgccgttaacct"
              "agtgattaccgtatttaaaagcctgcggccttgatacgaactatatatgggatttaggca"
              "gtcccccagaaatcagttctactcacccggaaactgagtattcctcgccaccccttattc"
              "ctcagactagaaaagatgctgggggcgtgctgcatagcgaaattcaggtataatgcaggc"
              "acctgtgatatcccggcattcctaggcttaagaccgttataatcaaatgcatcttcccct"
              "gagtagggagaccccaactcggtccggactacgagctatcgtattatggtaccttattat"
              "gggcatcgggcattcttcggtgcttaacgccacgggaagtgaagaggtcgcgacaggaac"
              "ctattattgatgaattgagtgaattgatttgctcgatgcatagggctctggggtcataac"
              "aacaaagtatgtcggttatcaaacccaatgtagtgctgtcagttgcacatggccgagtgg"
              )         
print(getsizeof(dna_string))

compressed_dna_string = lzma.compress(dna_string.encode("ascii"))
print(getsizeof(compressed_dna_string))

decompressed_dna_string = lzma.decompress(compressed_dna_string).decode("ascii")
print((decompressed_dna_string == dna_string)*"Decompressed result matches original")

它会打印:

1009
485
Decompressed result matches original

所以它工作了,对数据进行了一些压缩。我怀疑在处理更多数据时,压缩比可能会提高。

通过尝试不同的算法,测试对实际数据效果最好不应该太困难。

英文:

You could have a look at the documentation on the subject.

Most of the modules have something like .compress() method that returns bytes object and then .decompress(). They seem to be really easy to use.

So for example if I decide to give lzma compression a go I could try something like this:

from sys import getsizeof
import lzma

dna_string = ("gggtaatggttgctatcccagtaatacccaatgattcaagctgatgagtgaaatgggggg"
              "tttcaacacactggtagctgagatgaagtcattaggaaatttaatggtggataaccgagc"
              "ccatgcgatgtggatcaattggctgtacggaggtagtttattcctgcggtctagacgatc"
              "tatgtacttcgaagctaagtaacgctgaattcgcttcaaatcgttccctaacgcgtagat"
              "catgttgggacgttgtttcagcgccgggttcgagaaaatgaggatacatgattcccgtcc"
              "acgtctacgtacacttagtccgacagagactgtgtgtacgtcgactgcgaacgtggattc"
              "ggtgtaaaataggctattgtgtcttagaacacgaaaaagaagagggttctggtcggcgtc"
              "tagccgctgctctgctggcgtgtgtctctggctattgaaactgactactgccgttaacct"
              "agtgattaccgtatttaaaagcctgcggccttgatacgaactatatatgggatttaggca"
              "gtcccccagaaatcagttctactcacccggaaactgagtattcctcgccaccccttattc"
              "ctcagactagaaaagatgctgggggcgtgctgcatagcgaaattcaggtataatgcaggc"
              "acctgtgatatcccggcattcctaggcttaagaccgttataatcaaatgcatcttcccct"
              "gagtagggagaccccaactcggtccggactacgagctatcgtattatggtaccttattat"
              "gggcatcgggcattcttcggtgcttaacgccacgggaagtgaagaggtcgcgacaggaac"
              "ctattattgatgaattgagtgaattgatttgctcgatgcatagggctctggggtcataac"
              "aacaaagtatgtcggttatcaaacccaatgtagtgctgtcagttgcacatggccgagtgg"
              )         
print(getsizeof(dna_string))

compressed_dna_string = lzma.compress(dna_string.encode("ascii"))
print(getsizeof(compressed_dna_string))

decompressed_dna_string = lzma.decompress(compressed_dna_string).decode("ascii")
print((decompressed_dna_string == dna_string)*"Decompressed result matches original")

and it prints:

1009
485
Decompressed result matches original

So it works and it compresses the data a bit. I suspect the compression ratio could improve with bigger amounts of data.

It shouldn't be too difficult to test what works best for your actual data by trying different algorithms out.

huangapple
  • 本文由 发表于 2023年7月14日 05:22:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76683316.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定