io.Reader和涉及CSV文件的换行符问题

huangapple go评论104阅读模式
英文:

io.Reader and Line Break issue involving a CSV file

问题

我有一个应用程序,通过RabbitMQ从许多不同的上游应用程序传递CSV文件,通常每个文件有5000-15000行。大多数时候它都工作得很好。然而,其中有几个上游应用程序是旧的(12-15年),编写它们的人早已离开。

由于换行符的原因,我无法从这些旧应用程序中读取CSV文件。我觉得这有点奇怪,因为换行符似乎对应UTF-8的回车符(http://www.fileformat.info/info/unicode/char/000d/index.htm)。通常,该应用程序只从这些旧文件中读取标题,而不读取其他内容。

如果我在文本编辑器中打开其中一个文件,并以UTF-8编码保存,覆盖现有文件,那么它就可以正常工作,没有任何问题。

我尝试过的一些我期望能够工作的方法:

  • 使用Reader:
ba := make([]byte, 262144000)
if _, err := file.Read(ba); err != nil {
    return nil, err
}
ba = bytes.Trim(ba, "\x00")
bb := bytes.NewBuffer(ba)
reader := csv.NewReader(bb)
records, err := reader.ReadAll()
if err != nil {
    return nil, err
}
  • 使用Scanner逐行读取(得到一个bufio.Scanner:token太长)
scanner := bufio.NewScanner(file)
var bb bytes.Buffer
for scanner.Scan() {
    bb.WriteString(fmt.Sprintf("%s\n", scanner.Text()))
}

// 检查错误
if err = scanner.Err(); err != nil {
    return nil, err
}

reader := csv.NewReader(&bb)
records, err := reader.ReadAll()
if err != nil {
    return nil, err
}

我尝试过的一些我期望不会工作的方法(确实没有工作):

  • 将文件内容写入新文件(.txt),然后重新读取文件(包括对创建的txt文件运行dos2unix)
  • 将文件读入标准字符串(希望Go的UTF-8编码会自动启动,但实际上并没有)
  • 将文件读入Rune切片,然后通过字节切片转换为字符串

我知道https://godoc.org/golang.org/x/text/transform包,但不太确定可行的方法——它似乎需要知道源编码才能进行转换。

我是不是愚蠢地忽视了什么?有没有建议如何将这些文件转换为UTF-8或更新行尾,而不知道文件编码,同时保持应用程序适用于所有其他有效的CSV文件?有没有其他选项,不需要逐字节地进行替换,我没有考虑到的?

抱歉,由于明显的原因,我不能分享CSV文件。

英文:

I have an application which deals with CSV's being delivered via RabbitMQ from many different upstream applications - typically 5000-15,000 rows per file. Most of the time it works great. However a couple of these upstream applications are old (12-15 years) and the people who wrote them are long gone.

I'm unable to read CSV files from these older aplications due to the line breaks. I'm finding this a bit weird as the line breaks see to map to UTF-8 Carriage Returns (http://www.fileformat.info/info/unicode/char/000d/index.htm). Typically the app reads in only the headers from those older files and nothing else.

If I open one of these files in a text editor and save as utf-8 encoding overwriting the exiting file then it works with no issues at all.

Things I've tried I expected to work:

-Using a Reader:

    ba := make([]byte, 262144000)
	if _, err := file.Read(ba); err != nil {
		return nil, err
	}
	ba = bytes.Trim(ba, "\x00")
	bb := bytes.NewBuffer(ba)
	reader := csv.NewReader(bb)
	records, err := reader.ReadAll()
	if err != nil {
		return nil, err
	}

-Using the Scanner to read line by line (get a bufio.Scanner: token too long)

    scanner := bufio.NewScanner(file)
	var bb bytes.Buffer
	for scanner.Scan() {
		bb.WriteString(fmt.Sprintf("%s\n", scanner.Text()))
	}

	// check for errors
	if err = scanner.Err(); err != nil {
		return nil, err
	}


reader := csv.NewReader(&bb)
records, err := reader.ReadAll()
if err != nil {
	return nil, err
}

Things I tried I expected not to work (and didn't):

  • Writing file contents to a new file (.txt) and reading the file back in (including running dos2unix against the created txt file)
  • Reading file into a standard string (hoping Go's UTF-8 encoding would magically kick in which of course it doesn't)
  • Reading file to Rune slice, then transforming to a string via byte slice

I'm aware of the https://godoc.org/golang.org/x/text/transform package but not too sure of a viable approach - it looks like the src encoding needs to be known to transform.

Am I stupidly overlooking something? Are there any suggestions how to transform these files into UTF-8 or update the line endings without knowing the file encoding whilst keeping the application working for all the other valid CSV files being delivered? Are there any options that don't involve me going byte to byte and doing a bytes.Replace I've not considered?
I'm hoping there's something really obvious I've overlooked.

Apologies - I can't share the CSV files for obvious reasons.

答案1

得分: 4

对于那些遇到这个问题并且不想使用strings.Replace的人,这里有一个方法可以包装一个io.Reader来替换单独的回车符。这个方法可能不是很高效,但对于大文件而言比基于strings.Replace的解决方案更好用。

链接:https://gist.github.com/b5/78edaae9e6a4248ea06b45d089c277d6

// ReplaceSoloCarriageReturns 包装一个io.Reader,在每次调用Read时,将孤立的\r替换为\r\n后再返回给最终用户
// 很多文件中的换行符可能不是“正确”的,这会影响到Go语言标准库中的csv包。通过包装传递给csv.NewReader的读取器来修复这个问题:
//    rdr, err := csv.NewReader(ReplaceSoloCarriageReturns(r))
//
func ReplaceSoloCarriageReturns(data io.Reader) io.Reader {
	return crlfReplaceReader{
		rdr: bufio.NewReader(data),
	}
}

// crlfReplaceReader 包装一个读取器
type crlfReplaceReader struct {
	rdr *bufio.Reader
}

// Read 实现crlfReplaceReader的io.Reader接口
func (c crlfReplaceReader) Read(p []byte) (n int, err error) {
	if len(p) == 0 {
		return
	}

	for {
		if n == len(p) {
			return
		}

		p[n], err = c.rdr.ReadByte()
		if err != nil {
			return
		}

		// 每当遇到\r并且还有空间时,检查下一个字符是否是\n
		// 如果下一个字符不是\n,则手动添加
		if p[n] == '\r' && n < len(p) {
			if pk, err := c.rdr.Peek(1); (err == nil && pk[0] != '\n') || (err != nil && err.Error() == io.EOF.Error()) {
				n++
				p[n] = '\n'
			}
		}

		n++
	}
	return
}

以上是一个包装io.Reader的方法,用于在每次读取时替换孤立的回车符。这个方法可以解决一些文件中没有正确换行符的问题,这会影响到Go语言标准库中的csv包。通过将读取器传递给csv.NewReader(ReplaceSoloCarriageReturns(r))来解决这个问题。

英文:

For anyone who's stumbled on this and wants an answer that doesn't involve strings.Replace, here's a method that wraps an io.Reader to replace solo carriage returns. It could probably be more efficient, but works better with huge files than a strings.Replace-based solution.

https://gist.github.com/b5/78edaae9e6a4248ea06b45d089c277d6

// ReplaceSoloCarriageReturns wraps an io.Reader, on every call of Read it
// for instances of lonely \r replacing them with \r\n before returning to the end customer
// lots of files in the wild will come without &quot;proper&quot; line breaks, which irritates go&#39;s
// standard csv package. This&#39;ll fix by wrapping the reader passed to csv.NewReader:
//    rdr, err := csv.NewReader(ReplaceSoloCarriageReturns(r))
//
func ReplaceSoloCarriageReturns(data io.Reader) io.Reader {
	return crlfReplaceReader{
		rdr: bufio.NewReader(data),
	}
}

// crlfReplaceReader wraps a reader
type crlfReplaceReader struct {
	rdr *bufio.Reader
}

// Read implements io.Reader for crlfReplaceReader
func (c crlfReplaceReader) Read(p []byte) (n int, err error) {
	if len(p) == 0 {
		return
	}

	for {
		if n == len(p) {
			return
		}

		p[n], err = c.rdr.ReadByte()
		if err != nil {
			return
		}

		// any time we encounter \r &amp; still have space, check to see if \n follows
		// if next char is not \n, add it in manually
		if p[n] == &#39;\r&#39; &amp;&amp; n &lt; len(p) {
			if pk, err := c.rdr.Peek(1); (err == nil &amp;&amp; pk[0] != &#39;\n&#39;) || (err != nil &amp;&amp; err.Error() == io.EOF.Error()) {
				n++
				p[n] = &#39;\n&#39;
			}
		}

		n++
	}
	return
}

答案2

得分: 1

你尝试过将所有的行尾符号从\r\n或\r替换为\n吗?

英文:

Have you tried to replace all line endings from \r\n or \r to \n ?

huangapple
  • 本文由 发表于 2017年7月6日 19:20:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/44947464.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定