如何使用Golang处理非常长的字符串以避免内存溢出。

huangapple go评论85阅读模式
英文:

how to manipulate very long string to avoid out of memory with golang

问题

我正在尝试提高个人技能,解决HackerRank挑战:

有一个由小写英文字母组成的字符串s,该字符串被无限重复。给定一个整数n,找到并打印出无限字符串的前n个字母中字母'a'的数量。

1<=s<=100 && 1<=n<=10^12

我天真地认为以下代码是可以的:

fs := strings.Repeat(s, int(n)) // 完整字符串
ss := fs[:n]                    // 子字符串
fmt.Println(strings.Count(ss, "a"))

显然,我耗尽了内存并得到了一个"out of memory"的错误。

我从未遇到过这种问题,对如何处理它一无所知。
如何操作非常长的字符串以避免"out of memory"错误?

英文:

I trying for personal skills improvement to solve the hacker rank challenge:

There is a string, s, of lowercase English letters that is repeated infinitely many times. Given an integer, n, find and print the number of letter a's in the first n letters of the infinite string.

1<=s<=100 && 1<=n<=10^12

Very naively I though this code will be fine:

fs := strings.Repeat(s, int(n)) // full string
ss := fs[:n]                    // sub string
fmt.Println(strings.Count(ss, &quot;a&quot;))

Obviously I explode the memory and got an: "out of memory".

I never faced this kind of issue, and I'm clueless on how to handle it.
How can I manipulate very long string to avoid out of memory ?

答案1

得分: 2

我希望这可以帮到你,你不需要通过遍历字符串来进行实际计数。
那是一种朴素的方法。你需要使用一些基本的算术来得到答案,而不会耗尽内存,希望注释有所帮助。

var answer int64

// 首先确定字符串 s 中有多少个字符"a"。
aCount := int64(strings.Count(s, "a"))

// 如果 s 必须达到长度 n,它将完整重复多少次
repeats := n / int64(len(s))
remainder := n % int64(len(s))

// 如果 n/len(s) 不能完全整除,意味着必须有一个余数,检查是否是这种情况。
// 如果 s 的长度为 5,n 的值为 22,那么 s 的前两个字符将多重复一次。
if remainder > 0{
aCountInRemainder := strings.Count(s[:remainder], "a")
answer = int64((aCount * repeats) + int64(aCountInRemainder))
} else{ 
answer = int64((aCount * repeats))
}
 
return answer

可能还有其他方法,但这是我想到的方法。

英文:

I hope this helps, you don't have to actually count by running through the string.
That is the naive approach. You need to use some basic arithmetic to get the answer without running out of memory, I hope the comments help.

var answer int64

// 1st figure out how many a&#39;s are present in s.
aCount := int64(strings.Count(s, &quot;a&quot;))

// How many times will s repeat in its entirety if it had to be of length n
repeats := n / int64(len(s))
remainder := n % int64(len(s))

// If n/len(s) is not perfectly divisible, it means there has to be a remainder, check if that&#39;s the case.
// If s is of length 5 and the value of n = 22, then the first 2 characters of s would repeat an extra time.
if remainder &gt; 0{
aCountInRemainder := strings.Count(s[:remainder], &quot;a&quot;)
answer = int64((aCount * repeats) + int64(aCountInRemainder))
} else{ 
answer = int64((aCount * repeats))
}
 
return answer

There might be other methods but this is what came to my mind.

答案2

得分: 1

正如你所发现的,如果实际生成字符串,你最终会在内存中拥有一个巨大的内存块。

表示“一系列传入字节”的一种常见方法是将其实现为io.Reader(可以将其视为字节流),然后让代码运行r.Read(buff)循环。


根据你提到的练习的具体情况(一个重复出现的固定字符串n次),特定字母的出现次数也可以直接从s中该字母的出现次数计算出来,再加上一些其他操作(我会让你自己弄清楚需要进行哪些乘法和计数)。


如何实现一个重复字符串的Reader,而不需要分配10^12倍的字符串内存?

请注意,在实现.Read()方法时,调用者已经分配了自己的缓冲区。你不需要在内存中重复字符串,只需要用正确的值填充缓冲区,例如通过将数据逐字节复制到缓冲区中。

以下是一种实现方式:

type RepeatReader struct {
	str   string
	count int
}

func (r *RepeatReader) Read(p []byte) (int, error) {
	if r.count == 0 {
		return 0, io.EOF
	}

    // 在每次迭代中,pos将保存到目前为止已复制的字节数
	var pos = 0
	for r.count > 0 && pos < len(p) {
        // 要复制切片,可以使用内置的'copy'方法
        // 在每次迭代中,你需要在已经复制的字节之后写入字节,
        // 因此使用'p[pos:]'
		n := copy(p[pos:], r.str)
        // 更新已复制字节的数量
		pos += n

		// 对于这个第一个示例,计算方式不正确:
		// 即使只有部分复制了str,我也会减少一个完整的count
		r.count--
	}

	return pos, nil
}

https://go.dev/play/p/QyFQ-3NzUDV

要完整、正确地实现,你还需要跟踪下次调用.Read()时需要从哪个偏移量开始:

type RepeatReader struct {
	str    string
	count  int
	offset int
}

func (r *RepeatReader) Read(p []byte) (int, error) {
	if r.count == 0 {
		return 0, io.EOF
	}

	var pos = 0
	for r.count > 0 && pos < len(p) {
        // 在复制到p时,你应该从r.offset开始:
		n := copy(p[pos:], r.str[r.offset:])
		pos += n

        // 更新r.offset:
		r.offset += n
		// 如果已经发出了一个完整的str的副本,减少'count'并将'offset'重置为0
		if r.offset == len(r.str) {
			r.count--
			r.offset = 0
		}
	}

	return pos, nil
}

https://go.dev/play/p/YapRuioQcOz


现在你可以在遍历这个Reader时计算出a的数量。

英文:

As you found out, if you actually generate the string you will end up having that huge memory block in RAM.

One common way to represent a "big sequence of incoming bytes" is to implement it as an io.Reader (which you can view as a stream of bytes), and have your code run a r.Read(buff) loop.


Given the specifics of the exercise you mention (a fixed string repeated n times), the number of occurrence of a specific letter can also be computed straight from the number of occurences of that letter in s, plus something more (I'll let you figure out what multiplications and counting should be done).


How to implement a Reader that repeats the string without allocating 10^12 times the string ?

Note that, when implementing the .Read() method, the caller has already allocated his buffer. You don't need to repeat your string in memory, you just need to fill the buffer with the correct values -- for example by copying byte by byte your data into the buffer.

Here is one way to do it :

type RepeatReader struct {
	str   string
	count int
}

func (r *RepeatReader) Read(p []byte) (int, error) {
	if r.count == 0 {
		return 0, io.EOF
	}

    // at each iteration, pos will hold the number of bytes copied so far
	var pos = 0
	for r.count &gt; 0 &amp;&amp; pos &lt; len(p) {
        // to copy slices over, you can use the built-in &#39;copy&#39; method
        // at each iteration, you need to write bytes *after* the ones you have already copied,
        // hence the &quot;p[pos:]&quot;
		n := copy(p[pos:], r.str)
        // update the amount of copied bytes
		pos += n

		// bad computation for this first example :
		// I decrement one complete count, even if str was only partially copied
		r.count--
	}

	return pos, nil
}

https://go.dev/play/p/QyFQ-3NzUDV

To have a complete, correct implementation, you also need to keep track of the offset you need to start from next time .Read() is called :

type RepeatReader struct {
	str    string
	count  int
	offset int
}

func (r *RepeatReader) Read(p []byte) (int, error) {
	if r.count == 0 {
		return 0, io.EOF
	}

	var pos = 0
	for r.count &gt; 0 &amp;&amp; pos &lt; len(p) {
        // when copying over to p, you should start at r.offset :
		n := copy(p[pos:], r.str[r.offset:])
		pos += n

        // update r.offset :
		r.offset += n
		// if one full copy of str has been issued, decrement &#39;count&#39; and reset &#39;offset&#39; to 0
		if r.offset == len(r.str) {
			r.count--
			r.offset = 0
		}
	}

	return pos, nil
}

https://go.dev/play/p/YapRuioQcOz


You can now count the as while iterating through this Reader.

huangapple
  • 本文由 发表于 2022年6月7日 12:51:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/72526073.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定