使用Go读取带有BOM的文件。

huangapple go评论89阅读模式
英文:

Reading files with a BOM in Go

问题

我需要读取可能包含字节顺序标记(BOM)的 Unicode 文件。当然,我可以自己检查文件的前几个字节,如果找到 BOM,就将其丢弃。但在我这样做之前,是否有任何标准的方法可以在核心库或第三方库中实现这一功能?

英文:

I need to read Unicode files that may or may not contain a byte-order mark. I could of course check the first few bytes of the file myself, and discard a BOM if I find one. But before I do, is there any standard way of doing this, either in the core libraries or a third party?

答案1

得分: 11

没有标准的方法来检查文件是否包含BOM(字节顺序标记),所以以下是两个处理方法的示例。

第一种方法是在数据流上使用缓冲读取器:

import (
    "bufio"
    "os"
    "log"
)

func main() {
    fd, err := os.Open("filename")
    if err != nil {
        log.Fatal(err)
    }
    defer closeOrDie(fd)
    br := bufio.NewReader(fd)
    r, _, err := br.ReadRune()
    if err != nil {
        log.Fatal(err)
    }
    if r != '\uFEFF' {
        br.UnreadRune() // 不是BOM,将该字符放回
    }
    // 现在可以像处理fd一样处理br
    // ...
}

另一种方法适用于实现了io.Seeker接口的对象,它是读取前三个字节,如果它们不是BOM,则使用io.Seek()返回到开头,例如:

import (
    "os"
    "log"
)

func main() {
    fd, err := os.Open("filename")
    if err != nil {
        log.Fatal(err)
    }
    defer closeOrDie(fd)
    bom := [3]byte{}
    _, err = io.ReadFull(fd, bom[:])
    if err != nil {
        log.Fatal(err)
    }
    if bom[0] != 0xef || bom[1] != 0xbb || bom[2] != 0xbf {
        _, err = fd.Seek(0, 0) // 不是BOM,回到开头
        if err != nil {
            log.Fatal(err)
        }
    }
    // 对fd的下一次读取操作将读取真实数据
    // ...
}

这是因为*os.File的实例(os.Open()返回的类型)支持寻址,因此实现了io.Seeker接口。请注意,对于HTTP响应的Body读取器等情况,不支持“倒带”。bufio.Buffer通过执行一些缓冲来解决这种不可寻址流的特性,这就允许你在其上执行UnreadRune()操作。

请注意,这两个示例都假设我们处理的文件是以UTF-8编码的。如果需要处理其他(或未知)编码,情况会更加复杂。

英文:

No standard way, IIRC (and the standard library would really be a wrong layer to implement such a check in) so here are two examples of how you could deal with it yourself.

One is to use a buffered reader above your data stream:

import (
    "bufio"
    "os"
    "log"
)

func main() {
    fd, err := os.Open("filename")
    if err != nil {
        log.Fatal(err)
    }
    defer closeOrDie(fd)
    br := bufio.NewReader(fd)
    r, _, err := br.ReadRune()
    if err != nil {
        log.Fatal(err)
    }
    if r != '\uFEFF' {
        br.UnreadRune() // Not a BOM -- put the rune back
    }
    // Now work with br as you would do with fd
    // ...
}

Another approach, which works with objects implementing the io.Seeker interface, is to read the first three bytes and if they're not BOM, io.Seek() back to the beginning, like in:

import (
    "os"
    "log"
)

func main() {
    fd, err := os.Open("filename")
    if err != nil {
        log.Fatal(err)
    }
    defer closeOrDie(fd)
    bom := [3]byte
    _, err = io.ReadFull(fd, bom[:])
    if err != nil {
        log.Fatal(err)
    }
    if bom[0] != 0xef || bom[1] != 0xbb || bom[2] != 0xbf {
        _, err = fd.Seek(0, 0) // Not a BOM -- seek back to the beginning
        if err != nil {
            log.Fatal(err)
        }
    }
    // The next read operation on fd will read real data
    // ...
}

This is possible since instances of *os.File (what os.Open() returns) support seeking and hence implement io.Seeker. Note that that's not the case for, say, Body reader of HTTP responses since you can't "rewind" it. bufio.Buffer works around this feature of non-seekable streams by performing some buffering (obviously) — that's what allows you yo UnreadRune() on it.

Note that both examples assume the file we're dealing with is encoded in UTF-8. If you need to deal with other (or unknown) encoding, things get more complicated.

答案2

得分: 5

你可以使用 utfbom 包。它包装了 io.Reader,可以检测并丢弃必要的 BOM。它还可以返回由 BOM 检测到的编码。

英文:

You can use utfbom package. It wraps io.Reader, detects and discards BOM as necessary. It can also return the encoding detected by the BOM.

答案3

得分: 4

我想在这里添加一种从字符串中去除 字节顺序标记序列的方法,而不是直接操作字节(如上所示)。

package main

import (
    "fmt"
    "strings"
)

func main() {
    s := "\uFEFF is a string that starts with a Byte Order Mark"
    fmt.Printf("before: '%v' (len=%v)\n", s, len(s))

    ByteOrderMarkAsString := string('\uFEFF')

    if strings.HasPrefix(s, ByteOrderMarkAsString) {

        fmt.Printf("Found leading Byte Order Mark sequence!\n")

        s = strings.TrimPrefix(s, ByteOrderMarkAsString)
    }
    fmt.Printf("after: '%v' (len=%v)\n", s, len(s))
}

其他的 "strings" 函数也可以正常工作。

这是打印出来的结果:

before: ' is a string that starts with a Byte Order Mark (len=50)'
Found leading Byte Order Mark sequence!
after: ' is a string that starts with a Byte Order Mark (len=47)'

祝好!

英文:

I thought I would add here the way to strip the Byte Order Mark sequence from a string -- rather than messing around with bytes directly (as shown above).

package main

import (
    "fmt"
    "strings"
)

func main() {
    s := "\uFEFF is a string that starts with a Byte Order Mark"
    fmt.Printf("before: '%v' (len=%v)\n", s, len(s))

    ByteOrderMarkAsString := string('\uFEFF')

    if strings.HasPrefix(s, ByteOrderMarkAsString) {

        fmt.Printf("Found leading Byte Order Mark sequence!\n")
		
        s = strings.TrimPrefix(s, ByteOrderMarkAsString)
    }
    fmt.Printf("after: '%v' (len=%v)\n", s, len(s))	
}

Other "strings" functions should work as well.

And this is what prints out:

before: ' is a string that starts with a Byte Order Mark (len=50)'
Found leading Byte Order Mark sequence!
after: ' is a string that starts with a Byte Order Mark (len=47)'

Cheers!

答案4

得分: 3

在Go核心包中,没有标准的方法来执行这个操作。请遵循Unicode标准。

Unicode字节顺序标记(BOM)常见问题解答

英文:

There's no standard way of doing this in the Go core packages. Follow the Unicode standard.

Unicode Byte Order Mark (BOM) FAQ

答案5

得分: 0

我们使用transform包来读取CSV文件(这些文件可能是从Excel中以UTF8、UTF8带BOM、UTF16的格式保存的),代码如下:

import (
    "encoding/csv"
    "golang.org/x/text/encoding"
    "golang.org/x/text/encoding/unicode"
    "golang.org/x/text/transform"
    "io"
)

// BOMAwareCSVReader函数将检测数据开头的UTF BOM(字节顺序标记)并相应地转换为UTF8。
// 如果没有BOM,它将直接读取数据而不进行任何转换。
func BOMAwareCSVReader(reader io.Reader) *csv.Reader {
    var transformer = unicode.BOMOverride(encoding.Nop.NewDecoder())
    return csv.NewReader(transform.NewReader(reader, transformer))
}

我们使用的是Go 1.18版本

[1]: https://pkg.go.dev/golang.org/x/text/transform

希望对你有帮助!

英文:

We used the transform package to read CSV files (which may have been saved from Excel in UTF8, UTF8-with-BOM, UTF16) as follows:

import (
    "encoding/csv"
    "golang.org/x/text/encoding"
    "golang.org/x/text/encoding/unicode"
    "golang.org/x/text/transform"
    "io"
}

// BOMAwareCSVReader will detect a UTF BOM (Byte Order Mark) at the
// start of the data and transform to UTF8 accordingly.
// If there is no BOM, it will read the data without any transformation.
func BOMAwareCSVReader(reader io.Reader) *csv.Reader {
    var transformer = unicode.BOMOverride(encoding.Nop.NewDecoder())
    return csv.NewReader(transform.NewReader(reader, transformer))
}

We are using Go 1.18.

huangapple
  • 本文由 发表于 2014年1月27日 09:37:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/21371673.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定