问题

我正在使用golang编写一个小型web应用程序，其中涉及解析用户上传的文件。我想自动检测文件是否为gzip格式，并相应地创建读取器/扫描器。有一个限制是我不能将整个文件读入内存，只能在流上操作。以下是我的代码：

func scannerFromFile(reader io.Reader) (*bufio.Scanner, error) {
    var scanner *bufio.Scanner
    // 创建一个bufio.Reader，以便我们可以“窥视”前几个字节
    bReader := bufio.NewReader(reader)
    
    testBytes, err := bReader.Peek(64) // 读取几个字节而不消耗
    if err != nil {
        return nil, err
    }
    // 检测内容是否为gzip格式
    contentType := http.DetectContentType(testBytes)

    // 如果检测到gzip格式，则创建一个gzip读取器，然后将其包装在一个扫描器中
    if strings.Contains(contentType, "x-gzip") {
        gzipReader, err := gzip.NewReader(bReader)
        if err != nil {
            return nil, err
        }
        
        scanner = bufio.NewScanner(gzipReader)
        
    } else {
        // 非gzip格式，根据读取器创建一个扫描器
        scanner = bufio.NewScanner(bReader)
    }

    return scanner, nil
}

这对于纯文本文件可以正常工作，但对于gzip数据，它会解压错误，并且在几千行后会出现乱码。是否有更简单的方法？为什么在几千行后解压错误？

英文:

I'm writing a small webapp in golang, and it involves parsing a file uploaded by the user. I'd like to auto-detect if the file is gzipped or not and create readers / scanners appropriately. One twist is that I can't read the whole file into memory, I can only operate on the stream alone. Here's what I've got:

func scannerFromFile(reader io.Reader) (*bufio.Scanner, error) {

var scanner *bufio.Scanner
//create a bufio.Reader so we can &#39;peek&#39; at the first few bytes
bReader := bufio.NewReader(reader)

testBytes, err := bReader.Peek(64) //read a few bytes without consuming
if err != nil {
	return nil, err
}
//Detect if the content is gzipped
contentType := http.DetectContentType(testBytes)

//If we detect gzip, then make a gzip reader, then wrap it in a scanner
if strings.Contains(contentType, &quot;x-gzip&quot;) {
	gzipReader, err := gzip.NewReader(bReader)
	if (err != nil) {
		return nil, err
	}
	
	scanner = bufio.NewScanner(gzipReader)
	
} else {
    //Not gzipped, just make a scanner based on the reader
	scanner = bufio.NewScanner(bReader)
}

return scanner, nil
}

This works fine for plain text, but for gzipped data it inflates incorrectly, and after a few kb I inevitably get garbled text. Is there a simpler method out there? Any ideas why after a few thousand lines it uncompresses incorrectly?

答案1

得分: 7

你可以通过检查文件的前两个字节是否等于0x1f8b来判断文件是否为gzip格式（我在这里找到了这个信息）。

在评论中，有人提到应该分别检查这两个字节，第一个字节是0x1f，第二个字节是0x8b。

testBytes, err := bReader.Peek(2) //读取2个字节
....
if testBytes[0] == 31 && testBytes[1] == 139 {
    //gzip
} else {
    //非gzip
}

希望对你有所帮助。

英文:

You can detect that a file is gziped by checking if the first 2 bytes are equal to 0x1f8b (I found that information here).

In comments someone mentioned that you should check these bytes separately, so the first one is 0x1f and the second is 0x8b.

testBytes, err := bReader.Peek(2) //read 2 bytes
....
if testBytes[0] == 31 &amp;&amp; testBytes[1] == 139 {
    //gzip
}else{
   ...
}

Hope that helps.

答案2

得分: 0

谢谢大家 - 结果证明，twotwotwo和thundercat是正确的，流在与我发布的代码无关的地方被损坏了。奇怪的是，它似乎与在仍然从请求流中读取时写入http响应有关。我仍在调查，但似乎最初的问题是错误的。

英文:

Thanks everyone - turns out that twotwotwo and thundercat were correct, and the stream was getting corrupted in a spot unrelated to the code I posted. Weirdly, it seems to be related to writing to the http response while still reading from the request stream. I'm still investigating it, but it seems the original question was misguided.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How to read from either gzip or plain text reader in golang?

问题

答案1

答案2

go: 找不到 GOROOT 目录: C:\Go; C:\Go\bin

读取通道的不同方式

Golang SSH服务器：如何使用scp处理文件传输？

检查持续集成的格式。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论