英文:
Read random lines off a text file in go
问题
我正在使用encoding/csv
来读取和解析一个非常大的.csv文件。
我需要随机选择几行并通过一些测试。
我目前的解决方案是像这样读取整个文件:
reader := csv.NewReader(file)
lines, err := reader.ReadAll()
然后从lines
中随机选择行。
明显的问题是读取整个文件需要很长时间,而且需要大量的内存。
问题:
我的问题是,encoding/csv
给我提供了一个io/reader
,有没有办法使用它来随机读取行,而不是一次性加载整个文件?
这更多是出于对io/reader
的好奇,想了解更多,而不是一个实际的问题,因为最终一次性读取并在内存中访问它可能更高效,而不是在磁盘上随机查找行。
英文:
I am using encoding/csv
to read and parse a very large .csv file.
I need to randomly select lines and pass them through some test.
My current solution is to read the whole file like
reader := csv.NewReader(file)
lines, err := reader.ReadAll()
then randomly select lines from lines
The obvious problem is it takes a long time to read the whole thing and I need lots of memory.
Question:
my question is, encoding/csv
gives me an io/reader
is there a way to use that to read random lines instead of loading the whole thing at once?
This is more of a curiosity to learn more about io/reader
than a practical question, since it is very likely that in the end it is more efficient to read it once and access it in memory, that to keep seeking random lines off on the disk.
答案1
得分: 4
Apokalyptik的回答是最接近你想要的。读者是流式读取的,所以你不能随机跳转到某个位置。
简单地选择一个概率来决定是否保留每一行,可能会导致问题:你可能在读取完文件之前没有保留足够的输入行,或者你可能过早地保留了行而没有得到一个好的样本。这两种情况比猜对正确答案的可能性要大得多,因为你事先不知道文件中有多少行(除非你先迭代一次来计数)。
你真正需要的是**蓄水池抽样**。
基本上,逐行读取文件。每一行,你选择是否保留它的方式如下:第一行你读取时,你有1/1
的概率保留它。在读取第二行之后,你有1/2
的概率用这一行替换你正在保留的行。在读取第三行之后,你有1/2 * 2/3 = 1/3
的概率保留那一行。因此,你保留任意给定行的概率是1/N
,其中N
是你已经读取的行数。这里有一个更详细的算法介绍(仅凭我在本段中告诉你的信息,不要试图直接实现它)。
英文:
Apokalyptik's answer is the closest to what you want. Readers are streamers so you can't just hop to a random place (per-se).
Naively choosing a probability against which you keep any given line as you read it in can lead to problems: you may get to the end of the file without holding enough lines of input, or you may be too quick to hold lines and not get a good sample. Either is much more likely than guessing correctly, since you don't know beforehand how many lines are in the file (unless you first iterate it once to count them).
What you really need is reservoir sampling.
Basically, read the file line-by-line. Each line, you choose whether to hold it like so: The first line you read, you have a 1/1
chance of holding it. After you read the second line, you have 1/2
chance of replacing what you're holding with this one. After the third line, you have a 1/2 * 2/3 = 1/3
chance of holding onto that one instead. Thus you have a 1/N
chance of holding onto any given line, where N
is the number of lines you've read in. Here's a more detailed look at the algorithm (don't try to implement it just from what I've told you in this paragraph alone).
答案2
得分: 2
最简单的解决方案是在读取每一行时决定是测试还是丢弃它... 让你的决定是随机的,这样你就不需要将整个文件保存在内存中... 然后通过文件运行测试... 你也可以使用非随机分布测试来采用相同的方式(例如,在X字节或X行之后)。
英文:
The simplest solution would be to make a decision as you read each line whether to test it or throw it away... make your decision random so that you don't have the requirement of keeping the entire thing in RAM... then pass through the file once running your tests... you can also do this same style with non-random distribution tests (e.g. after X bytes, or x lines, etc)
答案3
得分: 1
我的建议是提前对输入文件进行随机化处理,例如使用 shuf 命令。
然后,你可以根据需要简单地读取前 n 行。
这并不能帮助你更多地了解 io/readers,但可能会解决你的问题。
英文:
My suggestion would be to randomize the input file in advance, e.g. using shuf
http://en.wikipedia.org/wiki/Shuf
Then you can simply read the first n lines as needed.
This doesn't help you learning more about io/readers, but might solve your problem nevertheless.
答案4
得分: 1
我有类似的需求:从一个庞大的文本文件中随机读取(特定的)行。我编写了一个名为ramcsv的包来实现这个功能。
它首先遍历整个文件一次,并标记每行的字节偏移量(它将这些信息存储在内存中,但不存储完整的行)。
当您请求一个行号时,它会自动定位到正确的偏移量,并给您返回解析为csv的行。
(请注意,作为ramcsv.New的第二个参数传递的csv.Reader参数仅用于将设置复制到新的读取器中。)这可能可以更高效地实现,但对于我的需求来说已经足够了,并且避免了将大约20GB的文本文件读入内存。
英文:
I had a similar need: to randomly read (specific) lines from a massive text file. I wrote a package that I call ramcsv to do this.
It first reads through the entire file once and marks the byte offset of each line (it stores this information in memory, but does not store the full line).
When you request a line number, it will transparently seek to the correct offset and give you the csv-parsed line.
(Note that the csv.Reader parameter that is passed as the second argument to ramcsv.New is used only to copy the settings into a new reader.) This could no doubt be made more efficient, but it was sufficient for my needs and spared me from reading a ~20GB text file into memory.
答案5
得分: 0
encoding/csv
模块并不提供io.Reader
,而是提供了csv.Reader
(请注意在csv.NewReader
的定义中缺少包限定符[1],表示返回的Reader
属于同一个包)。
csv.Reader
只实现了你在那里看到的方法,所以看起来除非编写自己的CSV解析器,否则没有办法实现你想要的功能。
[1] http://golang.org/pkg/encoding/csv/#NewReader
英文:
encoding/csv
does not give you an io.Reader
it gives you a csv.Reader
(note the lack of package qualification on the definition of csv.NewReader
[1] indicating that the Reader
it returns belongs to the same package.
A csv.Reader
implements only the methods you see there, so it looks like there is no way to do what you want short of writing your own CSV parser.
答案6
得分: 0
根据这个Stack Overflow答案,有一种相对内存高效的方法可以从大文件中读取一行随机内容。
package main
import (
"bufio"
"bytes"
"fmt"
"io"
"math/rand"
"strconv"
"time"
)
var words []byte
func main() {
prepareWordsVar()
var r = rand.New(rand.NewSource(time.Now().Unix()))
var line string
for len(line) == 0 {
line = getRandomLine(r)
}
fmt.Println(line)
}
func prepareWordsVar() {
base := []string{"some", "really", "file", "with", "many", "manyy", "manyyy", "manyyyy", "manyyyyy", "lines."}
words = make([]byte, 200*len(base))
for i := 0; i < 200; i++ {
for _, s := range base {
words = append(words, []byte(s+strconv.Itoa(i)+"\n")...)
}
}
}
func getRandomLine(r *rand.Rand) string {
wordsLen := int64(len(words))
offset := r.Int63n(wordsLen)
rd := bytes.NewReader(words)
scanner := bufio.NewScanner(rd)
_, _ = rd.Seek(offset, io.SeekStart)
// discard - bound to be partial line
if !scanner.Scan() {
return ""
}
scanner.Scan()
if err := scanner.Err(); err != nil {
fmt.Printf("err: %s\n", err)
return ""
}
// now we have a random line.
return scanner.Text()
}
注意事项:
- 如果需要加密安全,请使用
crypto/rand
。 - 注意
bufio.Scanner
的默认MaxScanTokenSize,并相应地调整代码。 - 正如原始的Stack Overflow答案所述,这种方法会根据行的长度引入偏差。
英文:
Per this SO answer, there's a relatively memory efficient way to read a single random line from a large file.
package main
import (
"bufio"
"bytes"
"fmt"
"io"
"math/rand"
"strconv"
"time"
)
var words []byte
func main() {
prepareWordsVar()
var r = rand.New(rand.NewSource(time.Now().Unix()))
var line string
for len(line) == 0 {
line = getRandomLine(r)
}
fmt.Println(line)
}
func prepareWordsVar() {
base := []string{"some", "really", "file", "with", "many", "manyy", "manyyy", "manyyyy", "manyyyyy", "lines."}
words = make([]byte, 200*len(base))
for i := 0; i < 200; i++ {
for _, s := range base {
words = append(words, []byte(s+strconv.Itoa(i)+"\n")...)
}
}
}
func getRandomLine(r *rand.Rand) string {
wordsLen := int64(len(words))
offset := r.Int63n(wordsLen)
rd := bytes.NewReader(words)
scanner := bufio.NewScanner(rd)
_, _ = rd.Seek(offset, io.SeekStart)
// discard - bound to be partial line
if !scanner.Scan() {
return ""
}
scanner.Scan()
if err := scanner.Err(); err != nil {
fmt.Printf("err: %s\n", err)
return ""
}
// now we have a random line.
return scanner.Text()
}
Couple of caveats:
- You should use
crypto/rand
if you need it to be cryptographically secure. - Note the
bufio.Scanner
's default MaxScanTokenSize, and adjust code accordingly. - As per original SO answer, this does introduce bias based on the length of the line.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论