如何在golang中将utf16文本文件读取为字符串?

huangapple go评论118阅读模式
英文:

How to read utf16 text file to string in golang?

问题

我可以将文件读取为字节数组

但是当我将其转换为字符串时

它将utf16字节视为ascii

如何正确转换它?

  1. package main
  2. import (
  3. "fmt"
  4. "os"
  5. "bufio"
  6. )
  7. func main(){
  8. // 读取整个文件
  9. f, err := os.Open("test.txt")
  10. if err != nil {
  11. fmt.Printf("打开文件时出错:%v\n",err)
  12. os.Exit(1)
  13. }
  14. r := bufio.NewReader(f)
  15. var s,b,e = r.ReadLine()
  16. if e==nil{
  17. fmt.Println(b)
  18. fmt.Println(s)
  19. fmt.Println(string(s))
  20. }
  21. }
英文:

I can read the file to bytes array

but when I convert it to string

it treat the utf16 bytes as ascii

How to convert it correctly?

  1. package main
  2. import ("fmt"
  3. "os"
  4. "bufio"
  5. )
  6. func main(){
  7. // read whole the file
  8. f, err := os.Open("test.txt")
  9. if err != nil {
  10. fmt.Printf("error opening file: %v\n",err)
  11. os.Exit(1)
  12. }
  13. r := bufio.NewReader(f)
  14. var s,b,e = r.ReadLine()
  15. if e==nil{
  16. fmt.Println(b)
  17. fmt.Println(s)
  18. fmt.Println(string(s))
  19. }
  20. }

output:

false

[255 254 91 0 83 0 99 0 114 0 105 0 112 0 116 0 32 0 73 0 110 0 102 0 111 0 93 0
13 0]

S c r i p t I n f o ]


Update:

After I tested the two examples, I have understanded what is the exact problem now.

In windows, if I add the line break (CR+LF) at the end of the line, the CR will be read in the line. Because the readline function cannot handle unicode correctly ([OD OA]=ok, [OD 00 OA 00]=not ok).

If the readline function can recognize unicode, it should understand [OD 00 OA 00] and return []uint16 rather than []bytes.

So I think I should not use bufio.NewReader as it is not able to read utf16, I don't see bufio.NewReader.ReadLine can accept parameter as flag to indicate the reading text is utf8, utf16le/be or utf32. Is there any readline function for unicode text in go library?

答案1

得分: 20

最新版本的golang.org/x/text/encoding/unicode使得这个过程更加容易,因为它包含了unicode.BOMOverride,它会智能地解释BOM。

这是ReadFileUTF16()函数,类似于os.ReadFile()但解码UTF-16。

  1. package main
  2. import (
  3. "bytes"
  4. "fmt"
  5. "io/ioutil"
  6. "log"
  7. "strings"
  8. "golang.org/x/text/encoding/unicode"
  9. "golang.org/x/text/transform"
  10. )
  11. // 类似于ioutil.ReadFile()但解码UTF-16。在从生成UTF-16BE文件的MS-Windows系统读取数据时很有用,但如果找到其他BOM,则会做正确的事情。
  12. func ReadFileUTF16(filename string) ([]byte, error) {
  13. // 将文件读入[]byte:
  14. raw, err := ioutil.ReadFile(filename)
  15. if err != nil {
  16. return nil, err
  17. }
  18. // 创建一个将MS-Win默认转换为UTF8的转换器:
  19. win16be := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
  20. // 创建一个类似于win16be的转换器,但遵守BOM:
  21. utf16bom := unicode.BOMOverride(win16be.NewDecoder())
  22. // 创建一个使用utf16bom的Reader:
  23. unicodeReader := transform.NewReader(bytes.NewReader(raw), utf16bom)
  24. // 解码并打印:
  25. decoded, err := ioutil.ReadAll(unicodeReader)
  26. return decoded, err
  27. }
  28. func main() {
  29. data, err := ReadFileUTF16("inputfile.txt")
  30. if err != nil {
  31. log.Fatal(err)
  32. }
  33. final := strings.Replace(string(data), "\r\n", "\n", -1)
  34. fmt.Println(final)
  35. }

这是NewScannerUTF16函数,类似于os.Open()但返回一个scanner。

  1. package main
  2. import (
  3. "bufio"
  4. "fmt"
  5. "log"
  6. "os"
  7. "golang.org/x/text/encoding/unicode"
  8. "golang.org/x/text/transform"
  9. )
  10. type utfScanner interface {
  11. Read(p []byte) (n int, err error)
  12. }
  13. // 创建一个类似于os.Open()但解码UTF-16的scanner。在从生成UTF-16BE文件的MS-Windows系统读取数据时很有用,但如果找到其他BOM,则会做正确的事情。
  14. func NewScannerUTF16(filename string) (utfScanner, error) {
  15. // 将文件读入[]byte:
  16. file, err := os.Open(filename)
  17. if err != nil {
  18. return nil, err
  19. }
  20. // 创建一个将MS-Win默认转换为UTF8的转换器:
  21. win16be := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
  22. // 创建一个类似于win16be的转换器,但遵守BOM:
  23. utf16bom := unicode.BOMOverride(win16be.NewDecoder())
  24. // 创建一个使用utf16bom的Reader:
  25. unicodeReader := transform.NewReader(file, utf16bom)
  26. return unicodeReader, nil
  27. }
  28. func main() {
  29. s, err := NewScannerUTF16("inputfile.txt")
  30. if err != nil {
  31. log.Fatal(err)
  32. }
  33. scanner := bufio.NewScanner(s)
  34. for scanner.Scan() {
  35. fmt.Println(scanner.Text()) // Println会添加最后的'\n'
  36. }
  37. if err := scanner.Err(); err != nil {
  38. fmt.Fprintln(os.Stderr, "reading inputfile:", err)
  39. }
  40. }

FYI:我已经将这些函数放入了一个开源模块中,并进行了进一步的改进。请参阅https://github.com/TomOnTime/utfutil/。

英文:

The latest version of golang.org/x/text/encoding/unicode makes it easier to do this because it includes unicode.BOMOverride, which will intelligently interpret the BOM.

Here is ReadFileUTF16(), which is like os.ReadFile() but decodes UTF-16.

<!-- language: lang-go -->

  1. package main
  2. import (
  3. &quot;bytes&quot;
  4. &quot;fmt&quot;
  5. &quot;io/ioutil&quot;
  6. &quot;log&quot;
  7. &quot;strings&quot;
  8. &quot;golang.org/x/text/encoding/unicode&quot;
  9. &quot;golang.org/x/text/transform&quot;
  10. )
  11. // Similar to ioutil.ReadFile() but decodes UTF-16. Useful when
  12. // reading data from MS-Windows systems that generate UTF-16BE files,
  13. // but will do the right thing if other BOMs are found.
  14. func ReadFileUTF16(filename string) ([]byte, error) {
  15. // Read the file into a []byte:
  16. raw, err := ioutil.ReadFile(filename)
  17. if err != nil {
  18. return nil, err
  19. }
  20. // Make an tranformer that converts MS-Win default to UTF8:
  21. win16be := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
  22. // Make a transformer that is like win16be, but abides by BOM:
  23. utf16bom := unicode.BOMOverride(win16be.NewDecoder())
  24. // Make a Reader that uses utf16bom:
  25. unicodeReader := transform.NewReader(bytes.NewReader(raw), utf16bom)
  26. // decode and print:
  27. decoded, err := ioutil.ReadAll(unicodeReader)
  28. return decoded, err
  29. }
  30. func main() {
  31. data, err := ReadFileUTF16(&quot;inputfile.txt&quot;)
  32. if err != nil {
  33. log.Fatal(err)
  34. }
  35. final := strings.Replace(string(data), &quot;\r\n&quot;, &quot;\n&quot;, -1)
  36. fmt.Println(final)
  37. }

Here is NewScannerUTF16 which is like os.Open() but returns a scanner.

<!-- language: lang-go -->

  1. package main
  2. import (
  3. &quot;bufio&quot;
  4. &quot;fmt&quot;
  5. &quot;log&quot;
  6. &quot;os&quot;
  7. &quot;golang.org/x/text/encoding/unicode&quot;
  8. &quot;golang.org/x/text/transform&quot;
  9. )
  10. type utfScanner interface {
  11. Read(p []byte) (n int, err error)
  12. }
  13. // Creates a scanner similar to os.Open() but decodes the file as UTF-16.
  14. // Useful when reading data from MS-Windows systems that generate UTF-16BE
  15. // files, but will do the right thing if other BOMs are found.
  16. func NewScannerUTF16(filename string) (utfScanner, error) {
  17. // Read the file into a []byte:
  18. file, err := os.Open(filename)
  19. if err != nil {
  20. return nil, err
  21. }
  22. // Make an tranformer that converts MS-Win default to UTF8:
  23. win16be := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
  24. // Make a transformer that is like win16be, but abides by BOM:
  25. utf16bom := unicode.BOMOverride(win16be.NewDecoder())
  26. // Make a Reader that uses utf16bom:
  27. unicodeReader := transform.NewReader(file, utf16bom)
  28. return unicodeReader, nil
  29. }
  30. func main() {
  31. s, err := NewScannerUTF16(&quot;inputfile.txt&quot;)
  32. if err != nil {
  33. log.Fatal(err)
  34. }
  35. scanner := bufio.NewScanner(s)
  36. for scanner.Scan() {
  37. fmt.Println(scanner.Text()) // Println will add back the final &#39;\n&#39;
  38. }
  39. if err := scanner.Err(); err != nil {
  40. fmt.Fprintln(os.Stderr, &quot;reading inputfile:&quot;, err)
  41. }
  42. }

FYI: I have put these functions into an open source module and have made further improvements. See https://github.com/TomOnTime/utfutil/

答案2

得分: 14

UTF16,UTF8和字节顺序标记由Unicode Consortium定义:UTF-16 FAQUTF-8 FAQ字节顺序标记(BOM)FAQ


> 问题4802:bufio:读取行太麻烦
>
> 在Go中从文件中读取行太麻烦了。
>
> 人们通常会被bufio.Reader.ReadLine的名称所吸引,
> 但它有一个奇怪的签名,返回(line []byte,isPrefix bool,
> err error),并且需要很多工作。
>
> ReadSlice和ReadString需要一个分隔符字节,几乎
> 总是明显且不美观的'\n',并且还可以返回一行
> 和一个EOF


> 修订:f685026a2d38
>
> bufio:新的Scanner接口
>
> 基于一个名为Scanner的新类型,添加了一个新的简单接口来扫描(可能是文本)数据。
> 它具有自己的内部缓冲区,因此即使没有注入bufio.Reader,也应该是有效的。
> 输入的格式由“split函数”定义,默认情况下分割为行。


> go1.1beta1发布
>
> 您可以从常规位置下载二进制和源代码分发:
> https://code.google.com/p/go/downloads/list?q=go1.1beta1


这是一个使用Unicode规则将UTF16文本文件行转换为Go UTF8编码字符串的程序。该代码已经修订以利用Go 1.1中的新的bufio.Scanner接口。

  1. package main
  2. import (
  3. &quot;bufio&quot;
  4. &quot;bytes&quot;
  5. &quot;encoding/binary&quot;
  6. &quot;fmt&quot;
  7. &quot;os&quot;
  8. &quot;runtime&quot;
  9. &quot;unicode/utf16&quot;
  10. &quot;unicode/utf8&quot;
  11. )
  12. // UTF16BytesToString将UTF-16编码的字节(大端或小端字节顺序)转换为UTF-8编码的字符串。
  13. func UTF16BytesToString(b []byte, o binary.ByteOrder) string {
  14. utf := make([]uint16, (len(b)+(2-1))/2)
  15. for i := 0; i+(2-1) &lt; len(b); i += 2 {
  16. utf[i/2] = o.Uint16(b[i:])
  17. }
  18. if len(b)/2 &lt; len(utf) {
  19. utf[len(utf)-1] = utf8.RuneError
  20. }
  21. return string(utf16.Decode(utf))
  22. }
  23. // UTF-16字节顺序
  24. const (
  25. unknownEndian = iota
  26. bigEndian
  27. littleEndian
  28. )
  29. // dropCREndian从字节顺序数据中删除终端\r。
  30. func dropCREndian(data []byte, t1, t2 byte) []byte {
  31. if len(data) &gt; 1 {
  32. if data[len(data)-2] == t1 &amp;&amp; data[len(data)-1] == t2 {
  33. return data[0 : len(data)-2]
  34. }
  35. }
  36. return data
  37. }
  38. // dropCRBE从大端数据中删除终端\r。
  39. func dropCRBE(data []byte) []byte {
  40. return dropCREndian(data, &#39;\x00&#39;, &#39;\r&#39;)
  41. }
  42. // dropCRLE从小端数据中删除终端\r。
  43. func dropCRLE(data []byte) []byte {
  44. return dropCREndian(data, &#39;\r&#39;, &#39;\x00&#39;)
  45. }
  46. // dropCR从数据中删除终端\r。
  47. func dropCR(data []byte) ([]byte, int) {
  48. var endian = unknownEndian
  49. switch ld := len(data); {
  50. case ld != len(dropCRLE(data)):
  51. endian = littleEndian
  52. case ld != len(dropCRBE(data)):
  53. endian = bigEndian
  54. }
  55. return data, endian
  56. }
  57. // SplitFunc是Scanner的拆分函数,它返回每行文本,不带任何尾随的行结束标记。
  58. // 返回的行可能为空。行结束标记是一个可选的回车符,后面跟一个必需的换行符。在正则表达式表示中,它是`\r?\n`。
  59. // 即使没有换行符,也将返回输入的最后一个非空行。
  60. func ScanUTF16LinesFunc(byteOrder binary.ByteOrder) (bufio.SplitFunc, func() binary.ByteOrder) {
  61. // 函数闭包变量
  62. var endian = unknownEndian
  63. switch byteOrder {
  64. case binary.BigEndian:
  65. endian = bigEndian
  66. case binary.LittleEndian:
  67. endian = littleEndian
  68. }
  69. const bom = 0xFEFF
  70. var checkBOM bool = endian == unknownEndian
  71. // Scanner拆分函数
  72. splitFunc := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
  73. if atEOF &amp;&amp; len(data) == 0 {
  74. return 0, nil, nil
  75. }
  76. if checkBOM {
  77. checkBOM = false
  78. if len(data) &gt; 1 {
  79. switch uint16(bom) {
  80. case uint16(data[0])&lt;&lt;8 | uint16(data[1]):
  81. endian = bigEndian
  82. return 2, nil, nil
  83. case uint16(data[1])&lt;&lt;8 | uint16(data[0]):
  84. endian = littleEndian
  85. return 2, nil, nil
  86. }
  87. }
  88. }
  89. // 扫描以换行符结尾的行。
  90. i := 0
  91. for {
  92. j := bytes.IndexByte(data[i:], &#39;\n&#39;)
  93. if j &lt; 0 {
  94. break
  95. }
  96. i += j
  97. switch e := i % 2; e {
  98. case 1: // UTF-16BE
  99. if endian != littleEndian {
  100. if i &gt; 1 {
  101. if data[i-1] == &#39;\x00&#39; {
  102. endian = bigEndian
  103. // 我们有一个完整的以换行符结尾的行。
  104. return i + 1, dropCRBE(data[0 : i-1]), nil
  105. }
  106. }
  107. }
  108. case 0: // UTF-16LE
  109. if endian != bigEndian {
  110. if i+1 &lt; len(data) {
  111. i++
  112. if data[i] == &#39;\x00&#39; {
  113. endian = littleEndian
  114. // 我们有一个完整的以换行符结尾的行。
  115. return i + 1, dropCRLE(data[0 : i-1]), nil
  116. }
  117. }
  118. }
  119. }
  120. i++
  121. }
  122. // 如果我们在EOF处,我们有一行最后的非终止行。返回它。
  123. if atEOF {
  124. // 删除CR。
  125. advance = len(data)
  126. switch endian {
  127. case bigEndian:
  128. data = dropCRBE(data)
  129. case littleEndian:
  130. data = dropCRLE(data)
  131. default:
  132. data, endian = dropCR(data)
  133. }
  134. if endian == unknownEndian {
  135. if runtime.GOOS == &quot;windows&quot; {
  136. endian = littleEndian
  137. } else {
  138. endian = bigEndian
  139. }
  140. }
  141. return advance, data, nil
  142. }
  143. // 请求更多数据。
  144. return 0, nil, nil
  145. }
  146. // 字节顺序函数
  147. orderFunc := func() (byteOrder binary.ByteOrder) {
  148. switch endian {
  149. case bigEndian:
  150. byteOrder = binary.BigEndian
  151. case littleEndian:
  152. byteOrder = binary.LittleEndian
  153. }
  154. return byteOrder
  155. }
  156. return splitFunc, orderFunc
  157. }
  158. func main() {
  159. file, err := os.Open(&quot;utf16.le.txt&quot;)
  160. if err != nil {
  161. fmt.Println(err)
  162. os.Exit(1)
  163. }
  164. defer file.Close()
  165. fmt.Println(file.Name())
  166. rdr := bufio.NewReader(file)
  167. scanner := bufio.NewScanner(rdr)
  168. var bo binary.ByteOrder // unknown, infer from data
  169. // bo = binary.LittleEndian // windows
  170. splitFunc, orderFunc := ScanUTF16LinesFunc(bo)
  171. scanner.Split(splitFunc)
  172. for scanner.Scan() {
  173. b := scanner.Bytes()
  174. s := UTF16BytesToString(b, orderFunc())
  175. fmt.Println(len(s), s)
  176. fmt.Println(len(b), b)
  177. }
  178. fmt.Println(orderFunc())
  179. if err := scanner.Err(); err != nil {
  180. fmt.Println(err)
  181. }
  182. }

输出:

  1. utf16.le.txt
  2. 15 "Hello, 世界"
  3. 22 [34 0 72 0 101 0 108 0 108 0 111 0 44 0 32 0 22 78 76 117 34 0]
  4. 0
  5. 0 []
  6. 15 "Hello, 世界"
  7. 22 [34 0 72 0 101 0 108 0 108 0 111 0 44 0 32 0 22 78 76 117 34 0]
  8. LittleEndian
  9. utf16.be.txt
  10. 15 "Hello, 世界"
  11. 22 [0 34 0 72 0 101 0 108 0 108 0 111 0 44 0 32 78 22 117 76 0 34]
  12. 0
  13. 0 []
  14. 15 "Hello, 世界"
  15. 22 [0 34 0 72 0 101 0 108 0 108 0 111 0 44 0 32 78 22 117 76 0 34]
  16. BigEndian
英文:

UTF16, UTF8, and Byte Order Marks are defined by the Unicode Consortium: UTF-16 FAQ, UTF-8 FAQ, and Byte Order Mark (BOM) FAQ.


> Issue 4802: bufio: reading lines is too cumbersome
>
> Reading lines from a file is too cumbersome in Go.
>
> People are often drawn to bufio.Reader.ReadLine because of its name,
> but it has a weird signature, returning (line []byte, isPrefix bool,
> err error), and requires a lot of work.
>
> ReadSlice and ReadString require a delimiter byte, which is almost
> always the obvious and unsightly '\n', and also can return both a line
> and an EOF


> Revision: f685026a2d38
>
> bufio: new Scanner interface
>
> Add a new, simple interface for scanning (probably textual) data,
> based on a new type called Scanner. It does its own internal
> buffering, so should be plausibly efficient even without injecting a
> bufio.Reader. The format of the input is defined by a "split
> function", by default splitting into lines.


> go1.1beta1 released
>
> You can download binary and source distributions from the usual place:
> https://code.google.com/p/go/downloads/list?q=go1.1beta1


Here's a program which uses the Unicode rules to convert UTF16 text file lines to Go UTF8 encoded strings. The code has been revised to take advantage of the new bufio.Scanner interface in Go 1.1.

  1. package main
  2. import (
  3. &quot;bufio&quot;
  4. &quot;bytes&quot;
  5. &quot;encoding/binary&quot;
  6. &quot;fmt&quot;
  7. &quot;os&quot;
  8. &quot;runtime&quot;
  9. &quot;unicode/utf16&quot;
  10. &quot;unicode/utf8&quot;
  11. )
  12. // UTF16BytesToString converts UTF-16 encoded bytes, in big or little endian byte order,
  13. // to a UTF-8 encoded string.
  14. func UTF16BytesToString(b []byte, o binary.ByteOrder) string {
  15. utf := make([]uint16, (len(b)+(2-1))/2)
  16. for i := 0; i+(2-1) &lt; len(b); i += 2 {
  17. utf[i/2] = o.Uint16(b[i:])
  18. }
  19. if len(b)/2 &lt; len(utf) {
  20. utf[len(utf)-1] = utf8.RuneError
  21. }
  22. return string(utf16.Decode(utf))
  23. }
  24. // UTF-16 endian byte order
  25. const (
  26. unknownEndian = iota
  27. bigEndian
  28. littleEndian
  29. )
  30. // dropCREndian drops a terminal \r from the endian data.
  31. func dropCREndian(data []byte, t1, t2 byte) []byte {
  32. if len(data) &gt; 1 {
  33. if data[len(data)-2] == t1 &amp;&amp; data[len(data)-1] == t2 {
  34. return data[0 : len(data)-2]
  35. }
  36. }
  37. return data
  38. }
  39. // dropCRBE drops a terminal \r from the big endian data.
  40. func dropCRBE(data []byte) []byte {
  41. return dropCREndian(data, &#39;\x00&#39;, &#39;\r&#39;)
  42. }
  43. // dropCRLE drops a terminal \r from the little endian data.
  44. func dropCRLE(data []byte) []byte {
  45. return dropCREndian(data, &#39;\r&#39;, &#39;\x00&#39;)
  46. }
  47. // dropCR drops a terminal \r from the data.
  48. func dropCR(data []byte) ([]byte, int) {
  49. var endian = unknownEndian
  50. switch ld := len(data); {
  51. case ld != len(dropCRLE(data)):
  52. endian = littleEndian
  53. case ld != len(dropCRBE(data)):
  54. endian = bigEndian
  55. }
  56. return data, endian
  57. }
  58. // SplitFunc is a split function for a Scanner that returns each line of
  59. // text, stripped of any trailing end-of-line marker. The returned line may
  60. // be empty. The end-of-line marker is one optional carriage return followed
  61. // by one mandatory newline. In regular expression notation, it is `\r?\n`.
  62. // The last non-empty line of input will be returned even if it has no
  63. // newline.
  64. func ScanUTF16LinesFunc(byteOrder binary.ByteOrder) (bufio.SplitFunc, func() binary.ByteOrder) {
  65. // Function closure variables
  66. var endian = unknownEndian
  67. switch byteOrder {
  68. case binary.BigEndian:
  69. endian = bigEndian
  70. case binary.LittleEndian:
  71. endian = littleEndian
  72. }
  73. const bom = 0xFEFF
  74. var checkBOM bool = endian == unknownEndian
  75. // Scanner split function
  76. splitFunc := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
  77. if atEOF &amp;&amp; len(data) == 0 {
  78. return 0, nil, nil
  79. }
  80. if checkBOM {
  81. checkBOM = false
  82. if len(data) &gt; 1 {
  83. switch uint16(bom) {
  84. case uint16(data[0])&lt;&lt;8 | uint16(data[1]):
  85. endian = bigEndian
  86. return 2, nil, nil
  87. case uint16(data[1])&lt;&lt;8 | uint16(data[0]):
  88. endian = littleEndian
  89. return 2, nil, nil
  90. }
  91. }
  92. }
  93. // Scan for newline-terminated lines.
  94. i := 0
  95. for {
  96. j := bytes.IndexByte(data[i:], &#39;\n&#39;)
  97. if j &lt; 0 {
  98. break
  99. }
  100. i += j
  101. switch e := i % 2; e {
  102. case 1: // UTF-16BE
  103. if endian != littleEndian {
  104. if i &gt; 1 {
  105. if data[i-1] == &#39;\x00&#39; {
  106. endian = bigEndian
  107. // We have a full newline-terminated line.
  108. return i + 1, dropCRBE(data[0 : i-1]), nil
  109. }
  110. }
  111. }
  112. case 0: // UTF-16LE
  113. if endian != bigEndian {
  114. if i+1 &lt; len(data) {
  115. i++
  116. if data[i] == &#39;\x00&#39; {
  117. endian = littleEndian
  118. // We have a full newline-terminated line.
  119. return i + 1, dropCRLE(data[0 : i-1]), nil
  120. }
  121. }
  122. }
  123. }
  124. i++
  125. }
  126. // If we&#39;re at EOF, we have a final, non-terminated line. Return it.
  127. if atEOF {
  128. // drop CR.
  129. advance = len(data)
  130. switch endian {
  131. case bigEndian:
  132. data = dropCRBE(data)
  133. case littleEndian:
  134. data = dropCRLE(data)
  135. default:
  136. data, endian = dropCR(data)
  137. }
  138. if endian == unknownEndian {
  139. if runtime.GOOS == &quot;windows&quot; {
  140. endian = littleEndian
  141. } else {
  142. endian = bigEndian
  143. }
  144. }
  145. return advance, data, nil
  146. }
  147. // Request more data.
  148. return 0, nil, nil
  149. }
  150. // Endian byte order function
  151. orderFunc := func() (byteOrder binary.ByteOrder) {
  152. switch endian {
  153. case bigEndian:
  154. byteOrder = binary.BigEndian
  155. case littleEndian:
  156. byteOrder = binary.LittleEndian
  157. }
  158. return byteOrder
  159. }
  160. return splitFunc, orderFunc
  161. }
  162. func main() {
  163. file, err := os.Open(&quot;utf16.le.txt&quot;)
  164. if err != nil {
  165. fmt.Println(err)
  166. os.Exit(1)
  167. }
  168. defer file.Close()
  169. fmt.Println(file.Name())
  170. rdr := bufio.NewReader(file)
  171. scanner := bufio.NewScanner(rdr)
  172. var bo binary.ByteOrder // unknown, infer from data
  173. // bo = binary.LittleEndian // windows
  174. splitFunc, orderFunc := ScanUTF16LinesFunc(bo)
  175. scanner.Split(splitFunc)
  176. for scanner.Scan() {
  177. b := scanner.Bytes()
  178. s := UTF16BytesToString(b, orderFunc())
  179. fmt.Println(len(s), s)
  180. fmt.Println(len(b), b)
  181. }
  182. fmt.Println(orderFunc())
  183. if err := scanner.Err(); err != nil {
  184. fmt.Println(err)
  185. }
  186. }

Output:

  1. utf16.le.txt
  2. 15 &quot;Hello, 世界&quot;
  3. 22 [34 0 72 0 101 0 108 0 108 0 111 0 44 0 32 0 22 78 76 117 34 0]
  4. 0
  5. 0 []
  6. 15 &quot;Hello, 世界&quot;
  7. 22 [34 0 72 0 101 0 108 0 108 0 111 0 44 0 32 0 22 78 76 117 34 0]
  8. LittleEndian
  9. utf16.be.txt
  10. 15 &quot;Hello, 世界&quot;
  11. 22 [0 34 0 72 0 101 0 108 0 108 0 111 0 44 0 32 78 22 117 76 0 34]
  12. 0
  13. 0 []
  14. 15 &quot;Hello, 世界&quot;
  15. 22 [0 34 0 72 0 101 0 108 0 108 0 111 0 44 0 32 78 22 117 76 0 34]
  16. BigEndian

答案3

得分: 11

这是最简单的读取方法:

  1. package main
  2. import (
  3. "bufio"
  4. "fmt"
  5. "log"
  6. "os"
  7. "golang.org/x/text/encoding/unicode"
  8. "golang.org/x/text/transform"
  9. )
  10. func main() {
  11. file, err := os.Open("./text.txt")
  12. if err != nil {
  13. log.Fatal(err)
  14. }
  15. scanner := bufio.NewScanner(transform.NewReader(file, unicode.UTF16(unicode.LittleEndian, unicode.UseBOM).NewDecoder()))
  16. for scanner.Scan() {
  17. fmt.Printf(scanner.Text())
  18. }
  19. }

由于Windows默认使用小端序链接,我们使用unicode.UseBOM策略从文本中获取BOM,并使用unicode.LittleEndian作为备选方案。

英文:

Here is the simplest way to read it:

  1. package main
  2. import (
  3. &quot;bufio&quot;
  4. &quot;fmt&quot;
  5. &quot;log&quot;
  6. &quot;os&quot;
  7. &quot;golang.org/x/text/encoding/unicode&quot;
  8. &quot;golang.org/x/text/transform&quot;
  9. )
  10. func main() {
  11. file, err := os.Open(&quot;./text.txt&quot;)
  12. if err != nil {
  13. log.Fatal(err)
  14. }
  15. scanner := bufio.NewScanner(transform.NewReader(file, unicode.UTF16(unicode.LittleEndian, unicode.UseBOM).NewDecoder()))
  16. for scanner.Scan() {
  17. fmt.Printf(scanner.Text())
  18. }
  19. }

since Windows use little-endian order by default link, we use unicode.UseBOM policy to retrieve BOM from the text, and unicode.LittleEndian as a fallback

答案4

得分: 4

package main

import (
"errors"
"fmt"
"log"
"unicode/utf16"
)

func utf16toString(b []uint8) (string, error) {
if len(b)&1 != 0 {
return "", errors.New("len(b) must be even")
}

  1. // Check BOM
  2. var bom int
  3. if len(b) >= 2 {
  4. switch n := int(b[0])<<8 | int(b[1]); n {
  5. case 0xfffe:
  6. bom = 1
  7. fallthrough
  8. case 0xfeff:
  9. b = b[2:]
  10. }
  11. }
  12. w := make([]uint16, len(b)/2)
  13. for i := range w {
  14. w[i] = uint16(b[2*i+bom&1])<<8 | uint16(b[2*i+(bom+1)&1])
  15. }
  16. return string(utf16.Decode(w)), nil

}

func main() {
// Simulated data from e.g. a file
b := []byte{255, 254, 91, 0, 83, 0, 99, 0, 114, 0, 105, 0, 112, 0, 116, 0, 32, 0, 73, 0, 110, 0, 102, 0, 111, 0, 93, 0, 13, 0}
s, err := utf16toString(b)
if err != nil {
log.Fatal(err)
}

  1. fmt.Printf("%q", s)

}

英文:

For example:

  1. package main
  2. import (
  3. &quot;errors&quot;
  4. &quot;fmt&quot;
  5. &quot;log&quot;
  6. &quot;unicode/utf16&quot;
  7. )
  8. func utf16toString(b []uint8) (string, error) {
  9. if len(b)&amp;1 != 0 {
  10. return &quot;&quot;, errors.New(&quot;len(b) must be even&quot;)
  11. }
  12. // Check BOM
  13. var bom int
  14. if len(b) &gt;= 2 {
  15. switch n := int(b[0])&lt;&lt;8 | int(b[1]); n {
  16. case 0xfffe:
  17. bom = 1
  18. fallthrough
  19. case 0xfeff:
  20. b = b[2:]
  21. }
  22. }
  23. w := make([]uint16, len(b)/2)
  24. for i := range w {
  25. w[i] = uint16(b[2*i+bom&amp;1])&lt;&lt;8 | uint16(b[2*i+(bom+1)&amp;1])
  26. }
  27. return string(utf16.Decode(w)), nil
  28. }
  29. func main() {
  30. // Simulated data from e.g. a file
  31. b := []byte{255, 254, 91, 0, 83, 0, 99, 0, 114, 0, 105, 0, 112, 0, 116, 0, 32, 0, 73, 0, 110, 0, 102, 0, 111, 0, 93, 0, 13, 0}
  32. s, err := utf16toString(b)
  33. if err != nil {
  34. log.Fatal(err)
  35. }
  36. fmt.Printf(&quot;%q&quot;, s)
  37. }

(Also here)

Output:


  1. &quot;[Script Info]\r&quot;

答案5

得分: 0

如果你想将任何内容打印为字符串,可以使用fmt.Sprint

  1. package main
  2. import (
  3. "bufio"
  4. "fmt"
  5. "os"
  6. )
  7. func main() {
  8. // 读取整个文件
  9. f, err := os.Open("test.txt")
  10. if err != nil {
  11. fmt.Printf("打开文件时发生错误:%v\n", err)
  12. return
  13. }
  14. r := bufio.NewReader(f)
  15. var s, _, e = r.ReadLine()
  16. if e != nil {
  17. fmt.Println(e)
  18. return
  19. }
  20. fmt.Println(fmt.Sprint(string(s)))
  21. }
英文:

If you want anything to print as a string you could use fmt.Sprint

  1. package main
  2. import (
  3. &quot;bufio&quot;
  4. &quot;fmt&quot;
  5. &quot;os&quot;
  6. )
  7. func main() {
  8. // read whole the file
  9. f, err := os.Open(&quot;test.txt&quot;)
  10. if err != nil {
  11. fmt.Printf(&quot;error opening file: %v\n&quot;, err)
  12. return
  13. }
  14. r := bufio.NewReader(f)
  15. var s, _, e = r.ReadLine()
  16. if e != nil {
  17. fmt.Println(e)
  18. return
  19. }
  20. fmt.Println(fmt.Sprint(string(s)))
  21. }

huangapple
  • 本文由 发表于 2013年4月3日 17:38:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/15783830.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定