2014年7月6日 23:00:06go评论113阅读模式

英文:

Why isn't buffer size always an integer multiple of 4096 when reading file line by line?

问题

示例代码如下：

// test.go
package main
import (
	"bufio"
	"os"
)
func main() {
	if len(os.Args) != 2 {
		println("Usage:", os.Args[0], "")
		os.Exit(1)
	}
	fileName := os.Args[1]
	fp, err := os.Open(fileName)
	if err != nil {
		println(err.Error())
		os.Exit(2)
	}
	defer fp.Close()
	r := bufio.NewScanner(fp)
	var lines []string
	for r.Scan() {
		lines = append(lines, r.Text())
	}
}

然后我使用进程监视器监视其执行过程，部分输出如下：

test.exe  ReadFile  SUCCESS	     Offset: 4,692,375, Length: 8,056
test.exe  ReadFile  SUCCESS	     Offset: 4,700,431, Length: 7,198
test.exe  ReadFile  SUCCESS	     Offset: 4,707,629, Length: 8,134
test.exe  ReadFile  SUCCESS	     Offset: 4,715,763, Length: 7,361
test.exe  ReadFile  SUCCESS	     Offset: 4,723,124, Length: 8,056
test.exe  ReadFile  SUCCESS	     Offset: 4,731,180, Length: 4,322
test.exe  ReadFile  END OF FILE  Offset: 4,735,502, Length: 8,192

等效的Java代码如下：

//Test.java
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;
public class Test{
    public static void main(String[] args) {
        try {
            FileInputStream in = new FileInputStream("test.txt");
            BufferedReader br = new BufferedReader(new InputStreamReader(in));
            String strLine;
            while((strLine = br.readLine())!= null) {
                ;
            }
        } catch(Exception e) {
            System.out.println(e);
        }
    }
}

然后部分监视输出如下：

java.exe  ReadFile  SUCCESS	      Offset: 4,694,016, Length: 8,192
java.exe  ReadFile  SUCCESS	      Offset: 4,702,208, Length: 8,192
java.exe  ReadFile  SUCCESS	      Offset: 4,710,400, Length: 8,192
java.exe  ReadFile  SUCCESS	      Offset: 4,718,592, Length: 8,192
java.exe  ReadFile  SUCCESS	      Offset: 4,726,784, Length: 8,192
java.exe  ReadFile  SUCCESS	      Offset: 4,734,976, Length: 526
java.exe  ReadFile  END OF FILE	  Offset: 4,735,502, Length: 8,192

正如你所看到的，Java中的缓冲区大小为8192，并且每次读取8192个字节。为什么Go中的长度在每次读取文件时会发生变化？

我尝试过bufio.ReadString('\n')和bufio.ReadBytes('\n')，它们都有相同的问题。

[更新]
我在C中测试了该示例：

//test.c
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
    FILE * fp;
    char * line = NULL;
    size_t len = 0;
    ssize_t read;
    fp = fopen("test.txt", "r");
    if (fp == NULL)
        exit(EXIT_FAILURE);
    while ((read = getline(&line, &len, fp)) != -1) {
        printf("Retrieved line of length %zu :\n", read);
    }
    if (line)
        free(line);
    return EXIT_SUCCESS;
}

输出与Java代码类似（在我的系统上缓冲区大小为65536）。那么为什么Go在这里如此不同呢？

英文:

The sample code is,

// test.go
package main
import (
    &quot;bufio&quot;
    &quot;os&quot;
)
func main() {
    if len(os.Args) != 2 {
	    println(&quot;Usage:&quot;, os.Args[0], &quot;&quot;)
	    os.Exit(1)
    }
    fileName := os.Args[1]
    fp, err := os.Open(fileName)
    if err != nil {
	    println(err.Error())
	    os.Exit(2)
    }
    defer fp.Close()
    r := bufio.NewScanner(fp)
    var lines []string
    for r.Scan() {
	    lines = append(lines, r.Text())
    }
}

c:\>go build test.go

c:\>test.exe test.txt

Then I monitored its process using process monitor when executing it, part of the output is:

test.exe  ReadFile  SUCCESS	     Offset: 4,692,375, Length: 8,056
test.exe  ReadFile  SUCCESS	     Offset: 4,700,431, Length: 7,198
test.exe  ReadFile  SUCCESS	     Offset: 4,707,629, Length: 8,134
test.exe  ReadFile  SUCCESS	     Offset: 4,715,763, Length: 7,361
test.exe  ReadFile  SUCCESS	     Offset: 4,723,124, Length: 8,056
test.exe  ReadFile  SUCCESS	     Offset: 4,731,180, Length: 4,322
test.exe  ReadFile  END OF FILE  Offset: 4,735,502, Length: 8,192

The equivalent java code is,

//Test.java
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;
 
public class Test{
public static void main(String[] args) {
  try
  {
  FileInputStream in = new FileInputStream(&quot;test.txt&quot;);
  BufferedReader br = new BufferedReader(new InputStreamReader(in));
  String strLine;
  while((strLine = br.readLine())!= null)
  {
   ;
  }
  }catch(Exception e){
   System.out.println(e);
  }
 }
}

c:\>javac Test.java

c:\>java Test

Then part of the monitoring output is:

java.exe  ReadFile  SUCCESS	      Offset: 4,694,016, Length: 8,192
java.exe  ReadFile  SUCCESS	      Offset: 4,702,208, Length: 8,192
java.exe  ReadFile  SUCCESS	      Offset: 4,710,400, Length: 8,192
java.exe  ReadFile  SUCCESS	      Offset: 4,718,592, Length: 8,192
java.exe  ReadFile  SUCCESS	      Offset: 4,726,784, Length: 8,192
java.exe  ReadFile  SUCCESS	      Offset: 4,734,976, Length: 526
java.exe  ReadFile  END OF FILE	  Offset: 4,735,502, Length: 8,192

As you see, the buffer size in java is 8192 and it read 8192 bytes each time.Why is the Length in Go changing during each time reading file?

I have tried bufio.ReadString('\n'),bufio.ReadBytes('\n')and both of them have the same problem.

[Update]
I have tested the sample in C,

//test.c
#define _GNU_SOURCE
#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;
int main(void)
{
        FILE * fp;
        char * line = NULL;
        size_t len = 0;
        ssize_t read;
        fp = fopen(&quot;test.txt&quot;, &quot;r&quot;);
        if (fp == NULL)
                exit(EXIT_FAILURE);
        while ((read = getline(&amp;line, &amp;len, fp)) != -1) {
                printf(&quot;Retrieved line of length %zu :\n&quot;, read);
        }
        if (line)
                free(line);
        return EXIT_SUCCESS;
}

The output is similar with java code(the buffer size is 65536 on my system).So why Go is so different here?

答案1

得分: 2

阅读bufio.Scan的源代码可以看到，虽然缓冲区大小为4096，但它根据剩余的“空”空间来读取，具体来说是这部分代码：

n, err := s.r.Read(s.buf[s.end:len(s.buf)])

就性能而言，我几乎可以肯定，无论您使用的是哪个文件系统，它都会足够智能地进行预读和缓存数据，因此缓冲区大小不会产生太大的差异。

英文:

Reading bufio.Scan's source shows that while the buffer size is 4096, it reads depending on how much "empty" space is left in it, specifically this part:

n, err := s.r.Read(s.buf[s.end:len(s.buf)])

Now performance wise, I'm almost positive whatever file system you're using will be smart enough to read-ahead and cache the data, so the buffer size shouldn't make that much of a difference.

答案2

得分: 1

这可能是原因：

在你提到的所有示例中，Scan 函数的输出是由行结束符确定的。

Go 的默认扫描函数按行分割（http://golang.org/pkg/bufio/#Scanner.Scan）：

默认的分割函数将输入按行分割，并去除行终止符

而 bufio.ReadString('\n') 和 bufio.ReadBytes('\n') 由于 \n 字符也存在同样的问题。

尝试从测试文件中删除所有换行符，并测试是否仍然会出现非 4096 的倍数的 READFILE 记录。

正如一些人所建议的，你所看到的情况实际上可能是由 bufio 包使用的 IO 策略引起的。

英文:

This may be the reason:

In all of the examples you cite, the Scan function output is determined by line-endings.

Go's default scan function splits by line (http://golang.org/pkg/bufio/#Scanner.Scan):

> the default split function breaks the input into lines with line termination stripped

And bufio.ReadString('\n') and bufio.ReadBytes('\n') have the same problem due to the \n character.

Try removing all newlines from your test file and testing if it still gives non 4096 multiples on the READFILE records.

As some have suggested, what you're seeing may actually be due to the IO strategy used by the bufio package.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么逐行读取文件时，缓冲区大小不总是4096的整数倍？

问题

答案1

答案2

在一个持久文件中存储映射的键/值对。

Example code for testing the filesystem in Golang

Bazel构建、protobuf和代码补全

Docker compose无法解析容器内配置文件的环境变量。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。