英文:
Reading files with a BOM in Go
问题
我需要读取可能包含字节顺序标记(BOM)的 Unicode 文件。当然,我可以自己检查文件的前几个字节,如果找到 BOM,就将其丢弃。但在我这样做之前,是否有任何标准的方法可以在核心库或第三方库中实现这一功能?
英文:
I need to read Unicode files that may or may not contain a byte-order mark. I could of course check the first few bytes of the file myself, and discard a BOM if I find one. But before I do, is there any standard way of doing this, either in the core libraries or a third party?
答案1
得分: 11
没有标准的方法来检查文件是否包含BOM(字节顺序标记),所以以下是两个处理方法的示例。
第一种方法是在数据流上使用缓冲读取器:
import (
"bufio"
"os"
"log"
)
func main() {
fd, err := os.Open("filename")
if err != nil {
log.Fatal(err)
}
defer closeOrDie(fd)
br := bufio.NewReader(fd)
r, _, err := br.ReadRune()
if err != nil {
log.Fatal(err)
}
if r != '\uFEFF' {
br.UnreadRune() // 不是BOM,将该字符放回
}
// 现在可以像处理fd一样处理br
// ...
}
另一种方法适用于实现了io.Seeker
接口的对象,它是读取前三个字节,如果它们不是BOM,则使用io.Seek()
返回到开头,例如:
import (
"os"
"log"
)
func main() {
fd, err := os.Open("filename")
if err != nil {
log.Fatal(err)
}
defer closeOrDie(fd)
bom := [3]byte{}
_, err = io.ReadFull(fd, bom[:])
if err != nil {
log.Fatal(err)
}
if bom[0] != 0xef || bom[1] != 0xbb || bom[2] != 0xbf {
_, err = fd.Seek(0, 0) // 不是BOM,回到开头
if err != nil {
log.Fatal(err)
}
}
// 对fd的下一次读取操作将读取真实数据
// ...
}
这是因为*os.File
的实例(os.Open()
返回的类型)支持寻址,因此实现了io.Seeker
接口。请注意,对于HTTP响应的Body
读取器等情况,不支持“倒带”。bufio.Buffer
通过执行一些缓冲来解决这种不可寻址流的特性,这就允许你在其上执行UnreadRune()
操作。
请注意,这两个示例都假设我们处理的文件是以UTF-8编码的。如果需要处理其他(或未知)编码,情况会更加复杂。
英文:
No standard way, IIRC (and the standard library would really be a wrong layer to implement such a check in) so here are two examples of how you could deal with it yourself.
One is to use a buffered reader above your data stream:
import (
"bufio"
"os"
"log"
)
func main() {
fd, err := os.Open("filename")
if err != nil {
log.Fatal(err)
}
defer closeOrDie(fd)
br := bufio.NewReader(fd)
r, _, err := br.ReadRune()
if err != nil {
log.Fatal(err)
}
if r != '\uFEFF' {
br.UnreadRune() // Not a BOM -- put the rune back
}
// Now work with br as you would do with fd
// ...
}
Another approach, which works with objects implementing the io.Seeker
interface, is to read the first three bytes and if they're not BOM, io.Seek()
back to the beginning, like in:
import (
"os"
"log"
)
func main() {
fd, err := os.Open("filename")
if err != nil {
log.Fatal(err)
}
defer closeOrDie(fd)
bom := [3]byte
_, err = io.ReadFull(fd, bom[:])
if err != nil {
log.Fatal(err)
}
if bom[0] != 0xef || bom[1] != 0xbb || bom[2] != 0xbf {
_, err = fd.Seek(0, 0) // Not a BOM -- seek back to the beginning
if err != nil {
log.Fatal(err)
}
}
// The next read operation on fd will read real data
// ...
}
This is possible since instances of *os.File
(what os.Open()
returns) support seeking and hence implement io.Seeker
. Note that that's not the case for, say, Body
reader of HTTP responses since you can't "rewind" it. bufio.Buffer
works around this feature of non-seekable streams by performing some buffering (obviously) — that's what allows you yo UnreadRune()
on it.
Note that both examples assume the file we're dealing with is encoded in UTF-8. If you need to deal with other (or unknown) encoding, things get more complicated.
答案2
得分: 5
你可以使用 utfbom 包。它包装了 io.Reader
,可以检测并丢弃必要的 BOM。它还可以返回由 BOM 检测到的编码。
英文:
You can use utfbom package. It wraps io.Reader
, detects and discards BOM as necessary. It can also return the encoding detected by the BOM.
答案3
得分: 4
我想在这里添加一种从字符串中去除 字节顺序标记序列的方法,而不是直接操作字节(如上所示)。
package main
import (
"fmt"
"strings"
)
func main() {
s := "\uFEFF is a string that starts with a Byte Order Mark"
fmt.Printf("before: '%v' (len=%v)\n", s, len(s))
ByteOrderMarkAsString := string('\uFEFF')
if strings.HasPrefix(s, ByteOrderMarkAsString) {
fmt.Printf("Found leading Byte Order Mark sequence!\n")
s = strings.TrimPrefix(s, ByteOrderMarkAsString)
}
fmt.Printf("after: '%v' (len=%v)\n", s, len(s))
}
其他的 "strings" 函数也可以正常工作。
这是打印出来的结果:
before: ' is a string that starts with a Byte Order Mark (len=50)'
Found leading Byte Order Mark sequence!
after: ' is a string that starts with a Byte Order Mark (len=47)'
祝好!
英文:
I thought I would add here the way to strip the Byte Order Mark sequence from a string -- rather than messing around with bytes directly (as shown above).
package main
import (
"fmt"
"strings"
)
func main() {
s := "\uFEFF is a string that starts with a Byte Order Mark"
fmt.Printf("before: '%v' (len=%v)\n", s, len(s))
ByteOrderMarkAsString := string('\uFEFF')
if strings.HasPrefix(s, ByteOrderMarkAsString) {
fmt.Printf("Found leading Byte Order Mark sequence!\n")
s = strings.TrimPrefix(s, ByteOrderMarkAsString)
}
fmt.Printf("after: '%v' (len=%v)\n", s, len(s))
}
Other "strings" functions should work as well.
And this is what prints out:
before: ' is a string that starts with a Byte Order Mark (len=50)'
Found leading Byte Order Mark sequence!
after: ' is a string that starts with a Byte Order Mark (len=47)'
Cheers!
答案4
得分: 3
在Go核心包中,没有标准的方法来执行这个操作。请遵循Unicode标准。
英文:
There's no standard way of doing this in the Go core packages. Follow the Unicode standard.
答案5
得分: 0
我们使用transform包来读取CSV文件(这些文件可能是从Excel中以UTF8、UTF8带BOM、UTF16的格式保存的),代码如下:
import (
"encoding/csv"
"golang.org/x/text/encoding"
"golang.org/x/text/encoding/unicode"
"golang.org/x/text/transform"
"io"
)
// BOMAwareCSVReader函数将检测数据开头的UTF BOM(字节顺序标记)并相应地转换为UTF8。
// 如果没有BOM,它将直接读取数据而不进行任何转换。
func BOMAwareCSVReader(reader io.Reader) *csv.Reader {
var transformer = unicode.BOMOverride(encoding.Nop.NewDecoder())
return csv.NewReader(transform.NewReader(reader, transformer))
}
我们使用的是Go 1.18版本。
[1]: https://pkg.go.dev/golang.org/x/text/transform
希望对你有帮助!
英文:
We used the transform package to read CSV files (which may have been saved from Excel in UTF8, UTF8-with-BOM, UTF16) as follows:
import (
"encoding/csv"
"golang.org/x/text/encoding"
"golang.org/x/text/encoding/unicode"
"golang.org/x/text/transform"
"io"
}
// BOMAwareCSVReader will detect a UTF BOM (Byte Order Mark) at the
// start of the data and transform to UTF8 accordingly.
// If there is no BOM, it will read the data without any transformation.
func BOMAwareCSVReader(reader io.Reader) *csv.Reader {
var transformer = unicode.BOMOverride(encoding.Nop.NewDecoder())
return csv.NewReader(transform.NewReader(reader, transformer))
}
We are using Go 1.18.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论