为什么逐行读取文件时,缓冲区大小不总是4096的整数倍?

huangapple go评论83阅读模式
英文:

Why isn't buffer size always an integer multiple of 4096 when reading file line by line?

问题

示例代码如下:

// test.go
package main

import (
	"bufio"
	"os"
)

func main() {
	if len(os.Args) != 2 {
		println("Usage:", os.Args[0], "")
		os.Exit(1)
	}
	fileName := os.Args[1]
	fp, err := os.Open(fileName)
	if err != nil {
		println(err.Error())
		os.Exit(2)
	}
	defer fp.Close()
	r := bufio.NewScanner(fp)
	var lines []string
	for r.Scan() {
		lines = append(lines, r.Text())
	}
}

然后我使用进程监视器监视其执行过程,部分输出如下:

test.exe  ReadFile  SUCCESS	     Offset: 4,692,375, Length: 8,056
test.exe  ReadFile  SUCCESS	     Offset: 4,700,431, Length: 7,198
test.exe  ReadFile  SUCCESS	     Offset: 4,707,629, Length: 8,134
test.exe  ReadFile  SUCCESS	     Offset: 4,715,763, Length: 7,361
test.exe  ReadFile  SUCCESS	     Offset: 4,723,124, Length: 8,056
test.exe  ReadFile  SUCCESS	     Offset: 4,731,180, Length: 4,322
test.exe  ReadFile  END OF FILE  Offset: 4,735,502, Length: 8,192

等效的Java代码如下:

//Test.java
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;

public class Test{
    public static void main(String[] args) {
        try {
            FileInputStream in = new FileInputStream("test.txt");
            BufferedReader br = new BufferedReader(new InputStreamReader(in));
            String strLine;
            while((strLine = br.readLine())!= null) {
                ;
            }
        } catch(Exception e) {
            System.out.println(e);
        }
    }
}

然后部分监视输出如下:

java.exe  ReadFile  SUCCESS	      Offset: 4,694,016, Length: 8,192
java.exe  ReadFile  SUCCESS	      Offset: 4,702,208, Length: 8,192
java.exe  ReadFile  SUCCESS	      Offset: 4,710,400, Length: 8,192
java.exe  ReadFile  SUCCESS	      Offset: 4,718,592, Length: 8,192
java.exe  ReadFile  SUCCESS	      Offset: 4,726,784, Length: 8,192
java.exe  ReadFile  SUCCESS	      Offset: 4,734,976, Length: 526
java.exe  ReadFile  END OF FILE	  Offset: 4,735,502, Length: 8,192

正如你所看到的,Java中的缓冲区大小为8192,并且每次读取8192个字节。为什么Go中的长度在每次读取文件时会发生变化?

我尝试过bufio.ReadString('\n')bufio.ReadBytes('\n'),它们都有相同的问题。

[更新]
我在C中测试了该示例:

//test.c
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>

int main(void)
{
    FILE * fp;
    char * line = NULL;
    size_t len = 0;
    ssize_t read;
    fp = fopen("test.txt", "r");
    if (fp == NULL)
        exit(EXIT_FAILURE);
    while ((read = getline(&line, &len, fp)) != -1) {
        printf("Retrieved line of length %zu :\n", read);
    }
    if (line)
        free(line);
    return EXIT_SUCCESS;
}

输出与Java代码类似(在我的系统上缓冲区大小为65536)。那么为什么Go在这里如此不同呢?

英文:

The sample code is,

// test.go
package main

import (
    &quot;bufio&quot;
    &quot;os&quot;
)

func main() {
    if len(os.Args) != 2 {
	    println(&quot;Usage:&quot;, os.Args[0], &quot;&quot;)
	    os.Exit(1)
    }
    fileName := os.Args[1]
    fp, err := os.Open(fileName)
    if err != nil {
	    println(err.Error())
	    os.Exit(2)
    }
    defer fp.Close()
    r := bufio.NewScanner(fp)
    var lines []string
    for r.Scan() {
	    lines = append(lines, r.Text())
    }
}

c:\>go build test.go

c:\>test.exe test.txt

Then I monitored its process using process monitor when executing it, part of the output is:

test.exe  ReadFile  SUCCESS	     Offset: 4,692,375, Length: 8,056
test.exe  ReadFile  SUCCESS	     Offset: 4,700,431, Length: 7,198
test.exe  ReadFile  SUCCESS	     Offset: 4,707,629, Length: 8,134
test.exe  ReadFile  SUCCESS	     Offset: 4,715,763, Length: 7,361
test.exe  ReadFile  SUCCESS	     Offset: 4,723,124, Length: 8,056
test.exe  ReadFile  SUCCESS	     Offset: 4,731,180, Length: 4,322
test.exe  ReadFile  END OF FILE  Offset: 4,735,502, Length: 8,192

The equivalent java code is,

//Test.java
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;
 
public class Test{
public static void main(String[] args) {
  try
  {
  FileInputStream in = new FileInputStream(&quot;test.txt&quot;);
  BufferedReader br = new BufferedReader(new InputStreamReader(in));
  String strLine;
  while((strLine = br.readLine())!= null)
  {
   ;
  }
  }catch(Exception e){
   System.out.println(e);
  }
 }
}

c:\>javac Test.java

c:\>java Test

Then part of the monitoring output is:

java.exe  ReadFile  SUCCESS	      Offset: 4,694,016, Length: 8,192
java.exe  ReadFile  SUCCESS	      Offset: 4,702,208, Length: 8,192
java.exe  ReadFile  SUCCESS	      Offset: 4,710,400, Length: 8,192
java.exe  ReadFile  SUCCESS	      Offset: 4,718,592, Length: 8,192
java.exe  ReadFile  SUCCESS	      Offset: 4,726,784, Length: 8,192
java.exe  ReadFile  SUCCESS	      Offset: 4,734,976, Length: 526
java.exe  ReadFile  END OF FILE	  Offset: 4,735,502, Length: 8,192

As you see, the buffer size in java is 8192 and it read 8192 bytes each time.Why is the Length in Go changing during each time reading file?

I have tried bufio.ReadString(&#39;\n&#39;),bufio.ReadBytes(&#39;\n&#39;)and both of them have the same problem.

[Update]
I have tested the sample in C,

//test.c
#define _GNU_SOURCE
#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;

int main(void)
{
        FILE * fp;
        char * line = NULL;
        size_t len = 0;
        ssize_t read;
        fp = fopen(&quot;test.txt&quot;, &quot;r&quot;);
        if (fp == NULL)
                exit(EXIT_FAILURE);
        while ((read = getline(&amp;line, &amp;len, fp)) != -1) {
                printf(&quot;Retrieved line of length %zu :\n&quot;, read);
        }
        if (line)
                free(line);
        return EXIT_SUCCESS;
}

The output is similar with java code(the buffer size is 65536 on my system).So why Go is so different here?

答案1

得分: 2

阅读bufio.Scan源代码可以看到,虽然缓冲区大小为4096,但它根据剩余的“空”空间来读取,具体来说是这部分代码:

n, err := s.r.Read(s.buf[s.end:len(s.buf)])

就性能而言,我几乎可以肯定,无论您使用的是哪个文件系统,它都会足够智能地进行预读和缓存数据,因此缓冲区大小不会产生太大的差异。

英文:

Reading bufio.Scan's source shows that while the buffer size is 4096, it reads depending on how much "empty" space is left in it, specifically this part:

n, err := s.r.Read(s.buf[s.end:len(s.buf)])

Now performance wise, I'm almost positive whatever file system you're using will be smart enough to read-ahead and cache the data, so the buffer size shouldn't make that much of a difference.

答案2

得分: 1

这可能是原因:

在你提到的所有示例中,Scan 函数的输出是由行结束符确定的。

Go 的默认扫描函数按行分割(http://golang.org/pkg/bufio/#Scanner.Scan):

默认的分割函数将输入按行分割,并去除行终止符

bufio.ReadString('\n')bufio.ReadBytes('\n') 由于 \n 字符也存在同样的问题。

尝试从测试文件中删除所有换行符,并测试是否仍然会出现非 4096 的倍数的 READFILE 记录。

正如一些人所建议的,你所看到的情况实际上可能是由 bufio 包使用的 IO 策略引起的。

英文:

This may be the reason:

In all of the examples you cite, the Scan function output is determined by line-endings.

Go's default scan function splits by line (http://golang.org/pkg/bufio/#Scanner.Scan):

> the default split function breaks the input into lines with line termination stripped

And bufio.ReadString(&#39;\n&#39;) and bufio.ReadBytes(&#39;\n&#39;) have the same problem due to the \n character.

Try removing all newlines from your test file and testing if it still gives non 4096 multiples on the READFILE records.

As some have suggested, what you're seeing may actually be due to the IO strategy used by the bufio package.

huangapple
  • 本文由 发表于 2014年7月6日 23:00:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/24597157.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定