Go:CSV NewReader没有获取到正确的字段数

huangapple go评论86阅读模式
英文:

Go: CSV NewReader not getting the correct number of fields

问题

如何在使用NewReader时获取正确的字段数?

package main

import (
	"encoding/csv"
	"fmt"
	"log"
	"strings"
)

func main() {
	parser := csv.NewReader(strings.NewReader(`||"FOO"||`))
	parser.Comma = '|'
	parser.LazyQuotes = true
	record, err := parser.Read()
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("record length: %v\n", len(record))
}

它应该返回5,但实际上我得到的是3

record length: 3

Program exited.

编辑

我实际上正在处理一个包含许多双引号的大型CSV文件。

英文:

How to get the correct number of fields when using NewReader ?

package main

import (
	"encoding/csv"
	"fmt"
	"log"
	"strings"
)

func main() {
	parser := csv.NewReader(strings.NewReader(`||""FOO""||`))
	parser.Comma = '|'
	parser.LazyQuotes = true
	record, err := parser.Read()
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("record length: %v\n", len(record))
}

https://go.dev/play/p/gg-KYRciWFH

It should return 5, but instead I'm getting 3:

record length: 3

Program exited.

EDIT

I'm actually working with a big CSV file containing many double quotes.

答案1

得分: 2

注意:

encoding/csv 包实现了 RFC 4180 标准。如果你有这样的输入,它不是一个符合 RFC 4180 的 CSV 文件,encoding/csv 将无法正确解析它。


你误用了引号。引用一个单独的字段 FOO 应该是这样的:

parser := csv.NewReader(strings.NewReader(`||"FOO"||`))

如果你想让字段具有 "FOO" 的值,你必须在引用字段中使用两个双引号,所以应该是:

parser := csv.NewReader(strings.NewReader(`||""""FOO""""||`))

这将输出 5。在 Go Playground 上试一试。

你现在的代码是这样的:

parser := csv.NewReader(strings.NewReader(`||""FOO""||`))

由于第二个 " 字符后面没有分隔符字符,该字段没有被中断,剩下的部分被处理为引用字段的内容(直到行尾终止)。

如果你打印 record

fmt.Println(record)
fmt.Printf("%#v", record)

输出将是(在 Go Playground 上试一试):

[  "FOO"||]
[]string{"", "", "\"FOO\"||"}
英文:

Note:

The encoding/csv package implements the RFC 4180 standard. If you have such input, that's not an RFC 4180 compliant CSV file and encoding/csv will not parse it properly.


You're misusing the quotes. Quoting a single field FOO is like this:

parser := csv.NewReader(strings.NewReader(`||"FOO"||`))

If you want the field to have the "FOO" value, you have to use 2 double quotes in a quoted field, so it should be:

parser := csv.NewReader(strings.NewReader(`||"""FOO"""||`))

This will output 5. Try it on the Go Playground.

What you have is this:

parser := csv.NewReader(strings.NewReader(`||""FOO""||`))

Since the second " character is not followed by a separator character, the field is not interrupted and the rest is processed as the content of the quoted field (which will terminate at the end of the line).

If you print the record:

fmt.Println(record)
fmt.Printf("%#v", record)

Output will be (try it on the Go Playground):

[  "FOO"||]
[]string{"", "", "\"FOO\"||"}

答案2

得分: 2

在检查了你的代码之后,我决定稍微修改一下,然后打印结果:

package main

import (
	"encoding/csv"
	"fmt"
	"log"
	"strings"
)

func main() {
	parser := csv.NewReader(strings.NewReader(`x||"FOO"|x|x\n`))
	parser.Comma = '|'
	parser.LazyQuotes = true
	record, err := parser.Read()
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("record length: %v, Data: %v\n", len(record), strings.Join(record, ", "))
}

当你运行这段代码时,数据会被打印为x, , "FOO"||x|x\n。我的想法是,当你以两个双引号结尾时,解析器会认为字符串仍然被引号包围,因此将剩余的内容合并到第三个条目中。这似乎是 csv 包中懒惰引号处理的一个 bug。然而,在查看 LazyQuotes文档时,你会看到以下内容:

> 如果 LazyQuotes 为 true,则引号可以出现在未引用的字段中,并且非重复的引号可以出现在引用的字段中。

这并没有提到在双引号内找到双引号的情况。为了解决这个问题,你应该要么完全删除引号,要么用单引号替换双引号。

另外,你可能还考虑使用 gocsv 包。我之前使用过这个包,它相当稳定。我不确定它对于这个特定问题会有何反应,但是花点时间去了解一下可能是值得的。

英文:

After examining your code, I decided to modify it slightly and then print the results:

package main

import (
	"encoding/csv"
	"fmt"
	"log"
	"strings"
)

func main() {
	parser := csv.NewReader(strings.NewReader(`x||""FOO""|x|x\n`))
	parser.Comma = '|'
	parser.LazyQuotes = true
	record, err := parser.Read()
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("record length: %v, Data: %v\n", len(record), strings.Join(record, ", "))
}

When you run this, the data is printed as x, , "FOO"||x|x\n". My thought is that when you end your entry with two double-quotes, the parser is assuming the string is still being quoted and therefore lumps the rest of the line into the third entry. This appears to be a bug with how lazy-quoting works in the csv package, however, when examining the documentation for LazyQuotes, you'll see this:

> If LazyQuotes is true, a quote may appear in an unquoted field and a non-doubled quote may appear in a quoted field.

This doesn't mention anything about finding double quotes within double quotes. To fix this, you should either remove the quotes altogether or replace the double double-quotes ("") with double quotes (").

One other thing you might consider would be using the gocsv package. I've worked with this package in the past and it's reasonably stable. I'm not sure how it would respond to this specific issue, but it might be worth your time checking it out.

答案3

得分: 0

引号是csv格式的一部分。

go/csv的shielding存在问题,你可以尝试像这样的代码:

package main

import (
	"encoding/csv"
	"fmt"
	"log"
	"strings"
)

func main() {
	parser := csv.NewReader(strings.NewReader(`||FOO||`))
	parser.Comma = '|'
	parser.LazyQuotes = true
	record, err := parser.Read()
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("record length: %v\n", len(record))
	fmt.Println(strings.Join(record, " /SEP/ "))
}

或者像这样的代码:

package main

import (
	"encoding/csv"
	"fmt"
	"log"
	"strings"
)

func main() {
	parser := csv.NewReader(strings.NewReader(`||"FOO"||`))
	parser.Comma = '|'
	parser.LazyQuotes = true
	record, err := parser.Read()
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("record length: %v\n", len(record))
	fmt.Println(strings.Join(record, " SEP "))
}

英文:

Quotes are a part of csv format.

There is a problem with go/csv shielding, you can try something like this:

package main

import (
	"encoding/csv"
	"fmt"
	"log"
	"strings"
)

func main() {
	parser := csv.NewReader(strings.NewReader(`||FOO||`))
	parser.Comma = '|'
	parser.LazyQuotes = true
	record, err := parser.Read()
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("record length: %v\n", len(record))
	fmt.Println(strings.Join(record, " /SEP/ "))
}

or like this:

package main

import (
	"encoding/csv"
	"fmt"
	"log"
	"strings"
)

func main() {
	parser := csv.NewReader(strings.NewReader(`||"""FOO"""||`))
	parser.Comma = '|'
	parser.LazyQuotes = true
	record, err := parser.Read()
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("record length: %v\n", len(record))
	fmt.Println(strings.Join(record, " SEP "))
}

huangapple
  • 本文由 发表于 2022年5月19日 22:12:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/72306163.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定