2014年1月27日 09:37:05go评论89阅读模式

英文:

Reading files with a BOM in Go

问题

我需要读取可能包含字节顺序标记（BOM）的 Unicode 文件。当然，我可以自己检查文件的前几个字节，如果找到 BOM，就将其丢弃。但在我这样做之前，是否有任何标准的方法可以在核心库或第三方库中实现这一功能？

英文:

I need to read Unicode files that may or may not contain a byte-order mark. I could of course check the first few bytes of the file myself, and discard a BOM if I find one. But before I do, is there any standard way of doing this, either in the core libraries or a third party?

答案1

得分: 11

没有标准的方法来检查文件是否包含BOM（字节顺序标记），所以以下是两个处理方法的示例。

第一种方法是在数据流上使用缓冲读取器：

import (
    "bufio"
    "os"
    "log"
)

func main() {
    fd, err := os.Open("filename")
    if err != nil {
        log.Fatal(err)
    }
    defer closeOrDie(fd)
    br := bufio.NewReader(fd)
    r, _, err := br.ReadRune()
    if err != nil {
        log.Fatal(err)
    }
    if r != '\uFEFF' {
        br.UnreadRune() // 不是BOM，将该字符放回
    }
    // 现在可以像处理fd一样处理br
    // ...
}

另一种方法适用于实现了io.Seeker接口的对象，它是读取前三个字节，如果它们不是BOM，则使用io.Seek()返回到开头，例如：

import (
    "os"
    "log"
)

func main() {
    fd, err := os.Open("filename")
    if err != nil {
        log.Fatal(err)
    }
    defer closeOrDie(fd)
    bom := [3]byte{}
    _, err = io.ReadFull(fd, bom[:])
    if err != nil {
        log.Fatal(err)
    }
    if bom[0] != 0xef || bom[1] != 0xbb || bom[2] != 0xbf {
        _, err = fd.Seek(0, 0) // 不是BOM，回到开头
        if err != nil {
            log.Fatal(err)
        }
    }
    // 对fd的下一次读取操作将读取真实数据
    // ...
}

这是因为*os.File的实例（os.Open()返回的类型）支持寻址，因此实现了io.Seeker接口。请注意，对于HTTP响应的Body读取器等情况，不支持“倒带”。bufio.Buffer通过执行一些缓冲来解决这种不可寻址流的特性，这就允许你在其上执行UnreadRune()操作。

请注意，这两个示例都假设我们处理的文件是以UTF-8编码的。如果需要处理其他（或未知）编码，情况会更加复杂。

英文:

No standard way, IIRC (and the standard library would really be a wrong layer to implement such a check in) so here are two examples of how you could deal with it yourself.

One is to use a buffered reader above your data stream:

import (
    &quot;bufio&quot;
    &quot;os&quot;
    &quot;log&quot;
)

func main() {
    fd, err := os.Open(&quot;filename&quot;)
    if err != nil {
        log.Fatal(err)
    }
    defer closeOrDie(fd)
    br := bufio.NewReader(fd)
    r, _, err := br.ReadRune()
    if err != nil {
        log.Fatal(err)
    }
    if r != &#39;\uFEFF&#39; {
        br.UnreadRune() // Not a BOM -- put the rune back
    }
    // Now work with br as you would do with fd
    // ...
}

Another approach, which works with objects implementing the io.Seeker interface, is to read the first three bytes and if they're not BOM, io.Seek() back to the beginning, like in:

import (
    &quot;os&quot;
    &quot;log&quot;
)

func main() {
    fd, err := os.Open(&quot;filename&quot;)
    if err != nil {
        log.Fatal(err)
    }
    defer closeOrDie(fd)
    bom := [3]byte
    _, err = io.ReadFull(fd, bom[:])
    if err != nil {
        log.Fatal(err)
    }
    if bom[0] != 0xef || bom[1] != 0xbb || bom[2] != 0xbf {
        _, err = fd.Seek(0, 0) // Not a BOM -- seek back to the beginning
        if err != nil {
            log.Fatal(err)
        }
    }
    // The next read operation on fd will read real data
    // ...
}

This is possible since instances of *os.File (what os.Open() returns) support seeking and hence implement io.Seeker. Note that that's not the case for, say, Body reader of HTTP responses since you can't "rewind" it. bufio.Buffer works around this feature of non-seekable streams by performing some buffering (obviously) — that's what allows you yo UnreadRune() on it.

Note that both examples assume the file we're dealing with is encoded in UTF-8. If you need to deal with other (or unknown) encoding, things get more complicated.

答案2

得分: 5

你可以使用 utfbom 包。它包装了 io.Reader，可以检测并丢弃必要的 BOM。它还可以返回由 BOM 检测到的编码。

英文:

You can use utfbom package. It wraps io.Reader, detects and discards BOM as necessary. It can also return the encoding detected by the BOM.

答案3

得分: 4

我想在这里添加一种从字符串中去除 字节顺序标记序列的方法，而不是直接操作字节（如上所示）。

package main

import (
    "fmt"
    "strings"
)

func main() {
    s := "\uFEFF is a string that starts with a Byte Order Mark"
    fmt.Printf("before: '%v' (len=%v)\n", s, len(s))

    ByteOrderMarkAsString := string('\uFEFF')

    if strings.HasPrefix(s, ByteOrderMarkAsString) {

        fmt.Printf("Found leading Byte Order Mark sequence!\n")

        s = strings.TrimPrefix(s, ByteOrderMarkAsString)
    }
    fmt.Printf("after: '%v' (len=%v)\n", s, len(s))
}

其他的 "strings" 函数也可以正常工作。

这是打印出来的结果：

before: ' is a string that starts with a Byte Order Mark (len=50)'
Found leading Byte Order Mark sequence!
after: ' is a string that starts with a Byte Order Mark (len=47)'

祝好！

英文:

I thought I would add here the way to strip the Byte Order Mark sequence from a string -- rather than messing around with bytes directly (as shown above).

package main

import (
    &quot;fmt&quot;
    &quot;strings&quot;
)

func main() {
    s := &quot;\uFEFF is a string that starts with a Byte Order Mark&quot;
    fmt.Printf(&quot;before: &#39;%v&#39; (len=%v)\n&quot;, s, len(s))

    ByteOrderMarkAsString := string(&#39;\uFEFF&#39;)

    if strings.HasPrefix(s, ByteOrderMarkAsString) {

        fmt.Printf(&quot;Found leading Byte Order Mark sequence!\n&quot;)
		
        s = strings.TrimPrefix(s, ByteOrderMarkAsString)
    }
    fmt.Printf(&quot;after: &#39;%v&#39; (len=%v)\n&quot;, s, len(s))	
}

Other "strings" functions should work as well.

And this is what prints out:

before: &#39; is a string that starts with a Byte Order Mark (len=50)&#39;
Found leading Byte Order Mark sequence!
after: &#39; is a string that starts with a Byte Order Mark (len=47)&#39;

Cheers!

答案4

得分: 3

在Go核心包中，没有标准的方法来执行这个操作。请遵循Unicode标准。

Unicode字节顺序标记（BOM）常见问题解答

英文:

There's no standard way of doing this in the Go core packages. Follow the Unicode standard.

Unicode Byte Order Mark (BOM) FAQ

答案5

得分: 0

我们使用transform包来读取CSV文件（这些文件可能是从Excel中以UTF8、UTF8带BOM、UTF16的格式保存的），代码如下：

import (
    "encoding/csv"
    "golang.org/x/text/encoding"
    "golang.org/x/text/encoding/unicode"
    "golang.org/x/text/transform"
    "io"
)

// BOMAwareCSVReader函数将检测数据开头的UTF BOM（字节顺序标记）并相应地转换为UTF8。
// 如果没有BOM，它将直接读取数据而不进行任何转换。
func BOMAwareCSVReader(reader io.Reader) *csv.Reader {
    var transformer = unicode.BOMOverride(encoding.Nop.NewDecoder())
    return csv.NewReader(transform.NewReader(reader, transformer))
}

我们使用的是Go 1.18版本。

[1]: https://pkg.go.dev/golang.org/x/text/transform

希望对你有帮助！

英文:

We used the transform package to read CSV files (which may have been saved from Excel in UTF8, UTF8-with-BOM, UTF16) as follows:

import (
    &quot;encoding/csv&quot;
    &quot;golang.org/x/text/encoding&quot;
    &quot;golang.org/x/text/encoding/unicode&quot;
    &quot;golang.org/x/text/transform&quot;
    &quot;io&quot;
}

// BOMAwareCSVReader will detect a UTF BOM (Byte Order Mark) at the
// start of the data and transform to UTF8 accordingly.
// If there is no BOM, it will read the data without any transformation.
func BOMAwareCSVReader(reader io.Reader) *csv.Reader {
    var transformer = unicode.BOMOverride(encoding.Nop.NewDecoder())
    return csv.NewReader(transform.NewReader(reader, transformer))
}

We are using Go 1.18.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Go读取带有BOM的文件。

问题

答案1

答案2

答案3

答案4

答案5

将数据从一个goroutine发送到多个其他goroutine

使用通道来交替打印。

How to generate random date in Go lang?

CGO将Xlib的XEvent结构转换为字节数组的方法是什么？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论