How to read from either gzip or plain text reader in golang?

huangapple go评论71阅读模式
英文:

How to read from either gzip or plain text reader in golang?

问题

我正在使用golang编写一个小型web应用程序,其中涉及解析用户上传的文件。我想自动检测文件是否为gzip格式,并相应地创建读取器/扫描器。有一个限制是我不能将整个文件读入内存,只能在流上操作。以下是我的代码:

func scannerFromFile(reader io.Reader) (*bufio.Scanner, error) {
    var scanner *bufio.Scanner
    // 创建一个bufio.Reader,以便我们可以“窥视”前几个字节
    bReader := bufio.NewReader(reader)
    
    testBytes, err := bReader.Peek(64) // 读取几个字节而不消耗
    if err != nil {
        return nil, err
    }
    // 检测内容是否为gzip格式
    contentType := http.DetectContentType(testBytes)

    // 如果检测到gzip格式,则创建一个gzip读取器,然后将其包装在一个扫描器中
    if strings.Contains(contentType, "x-gzip") {
        gzipReader, err := gzip.NewReader(bReader)
        if err != nil {
            return nil, err
        }
        
        scanner = bufio.NewScanner(gzipReader)
        
    } else {
        // 非gzip格式,根据读取器创建一个扫描器
        scanner = bufio.NewScanner(bReader)
    }

    return scanner, nil
}

这对于纯文本文件可以正常工作,但对于gzip数据,它会解压错误,并且在几千行后会出现乱码。是否有更简单的方法?为什么在几千行后解压错误?

英文:

I'm writing a small webapp in golang, and it involves parsing a file uploaded by the user. I'd like to auto-detect if the file is gzipped or not and create readers / scanners appropriately. One twist is that I can't read the whole file into memory, I can only operate on the stream alone. Here's what I've got:

func scannerFromFile(reader io.Reader) (*bufio.Scanner, error) {

var scanner *bufio.Scanner
//create a bufio.Reader so we can 'peek' at the first few bytes
bReader := bufio.NewReader(reader)

testBytes, err := bReader.Peek(64) //read a few bytes without consuming
if err != nil {
	return nil, err
}
//Detect if the content is gzipped
contentType := http.DetectContentType(testBytes)

//If we detect gzip, then make a gzip reader, then wrap it in a scanner
if strings.Contains(contentType, "x-gzip") {
	gzipReader, err := gzip.NewReader(bReader)
	if (err != nil) {
		return nil, err
	}
	
	scanner = bufio.NewScanner(gzipReader)
	
} else {
    //Not gzipped, just make a scanner based on the reader
	scanner = bufio.NewScanner(bReader)
}

return scanner, nil
}

This works fine for plain text, but for gzipped data it inflates incorrectly, and after a few kb I inevitably get garbled text. Is there a simpler method out there? Any ideas why after a few thousand lines it uncompresses incorrectly?

答案1

得分: 7

你可以通过检查文件的前两个字节是否等于0x1f8b来判断文件是否为gzip格式(我在这里找到了这个信息)。

在评论中,有人提到应该分别检查这两个字节,第一个字节是0x1f,第二个字节是0x8b

testBytes, err := bReader.Peek(2) //读取2个字节
....
if testBytes[0] == 31 && testBytes[1] == 139 {
    //gzip
} else {
    //非gzip
}

希望对你有所帮助。

英文:

You can detect that a file is gziped by checking if the first 2 bytes are equal to 0x1f8b (I found that information here).

In comments someone mentioned that you should check these bytes separately, so the first one is 0x1f and the second is 0x8b.

testBytes, err := bReader.Peek(2) //read 2 bytes
....
if testBytes[0] == 31 && testBytes[1] == 139 {
    //gzip
}else{
   ...
}

Hope that helps.

答案2

得分: 0

谢谢大家 - 结果证明,twotwotwo和thundercat是正确的,流在与我发布的代码无关的地方被损坏了。奇怪的是,它似乎与在仍然从请求流中读取时写入http响应有关。我仍在调查,但似乎最初的问题是错误的。

英文:

Thanks everyone - turns out that twotwotwo and thundercat were correct, and the stream was getting corrupted in a spot unrelated to the code I posted. Weirdly, it seems to be related to writing to the http response while still reading from the request stream. I'm still investigating it, but it seems the original question was misguided.

huangapple
  • 本文由 发表于 2015年2月4日 06:28:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/28309988.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定