Golang: While processing CSV, reformat single line?

huangapple go评论131阅读模式
英文:

Golang: While processing CSV, reformat single line?

问题

我的golang CSV处理程序几乎完全复制自Package CSV的示例代码:

func processCSV(path string) {
    file := utils.OpenFile(path)
    reader := csv.NewReader(file)
    reader.LazyQuotes = true

    cs := []*Collision{} // 在其他地方定义

    for {
        line, err := reader.Read()

        // 如果已经到达文件末尾,停止处理
        if err == io.EOF {
            break
        }

        c := get(line) // 在其他地方定义
        cs = append(cs, c)
    }

    // 做其他的事情...
}

这段代码在遇到格式错误的CSV行时会出现问题,一般情况下,这种行看起来像这样:

item1,item2,"item3,"has odd quoting"","item4",item5

csvReader.LazyQuotes = true选项似乎不能提供足够的容错性来正确读取这一行。

我的问题是:我能否向csv reader请求原始行,以便我可以对其进行处理并提取所需的内容?我正在处理的文件相对较大(约150MB),我不确定是否要重新处理它们,特别是因为每个文件只有少数几行存在此类问题。

谢谢任何提示!

英文:

My golang CSV processing routine copies almost exactly from the Package CSV example:

func processCSV(path string){

    file:= utils.OpenFile(path)
    reader:= csv.NewReader(file)
    reader.LazyQuotes = true

	cs:= []*Collision{} //defined elsewhere

    for {

	    line, err := reader.Read()

		//Kill processing if we're at EOF
    	if err == io.EOF {
	    	break
	    }

		c := get(line) //defined elsewhere
    	cs= append(cs, c)
    }
    
    //Do other stuff...
}

The code works great until it encounters a malformed (?) line of CSV, which generally looks something like this:

item1,item2,"item3,"has odd quoting"","item4",item5

The csvReader.LazyQuotes = true option doesn't seem to offer enough tolerance to read this line as I need it.

My question is this: can I ask the csv reader for the original line so that I can "massage" it to pull out what I need? The files I'm working with are moderately large (~150mb) and I'm not sure I want to re-do them, especially as only a few lines per file have such problems.

Thanks for any tips!

答案1

得分: 0

看了一下csv.Read()的实现,你无法使用csv包来实现你想要的功能。它使用了一个模块私有函数parseRecord()来完成繁重的工作。

我认为你需要编写自己的CSV读取器来处理这些情况,或者简单地逐行预处理文件,将格式错误的项例如从"替换为\"(这样csv包可以正确处理)。

英文:

Looking at the implementation of csv.Read() you cannot do what you are looking for with the csv package. It uses a module-private function parseRecord() which does the hard work.

I think what you need is write your own CSV reader which will handle this cases or simply preprocess the file line by line so that malformed items would be for example replaced from " to \" (which csv package could handle correctly).

答案2

得分: 0

据我所知,encoding/csv模块似乎没有提供这样的功能,所以你可以寻找一些第三方的csv包来实现,或者你可以自己实现一个解决方案。

如果你选择自己实现,我可以给你一个提示,是否采纳这个提示取决于你自己。

你可以实现一个包装你的文件并跟踪最后一行读取的io.Reader,每当因为格式错误的csv而遇到错误时,你可以使用你的读取器重新读取该行,修正它,将其添加到结果中,并使循环继续进行,就好像什么都没有发生过一样。

下面是processCSV函数如何改变的示例代码:

func processCSV(path string){

    file := utils.OpenFile(path)
    myreader := NewMyReader(file)
    reader := csv.NewReader(myreader)
    reader.LazyQuotes = true

    cs:= []*Collision{} //在其他地方定义

    for {

        line, err := reader.Read()

        //如果到达文件末尾,停止处理
        if err == io.EOF {
            break
        }
        
        // 格式错误的csv
        if err != nil {
            // 只需重新读取最后一行,在下一次循环中,myreader.Read应该继续返回此格式错误行之后的字节给csv.Reader。
            l, err := myreader.CurrentLine()
            if err != nil {
                panic(err)
            }

            // 修正格式错误的csv行
            line = fixcsv(l) 
        }

        c := get(line) //在其他地方定义
        cs= append(cs, c)
    }

    //做其他的事情...
}

希望对你有帮助!

英文:

As far as I can tell encoding/csv doesn't provide any such functionality, so you can either look for some 3rd party csv package that does that, or you can implement a solution yourself.

If you want to go the DIY route I can offer you a tip, whether it's a good tip that you should implement is up to you.

You could implement an io.Reader that wraps your file and tracks the last line read, then every time you encouter an error because of malformed csv you can use your reader to reread that line, massage it, add it to the results, and have the loop continue as if nothing happened.

Here's an example of how your processCSV would change:

func processCSV(path string){

    file := utils.OpenFile(path)
    myreader := NewMyReader(file)
    reader := csv.NewReader(myreader)
    reader.LazyQuotes = true

    cs:= []*Collision{} //defined elsewhere

    for {

        line, err := reader.Read()

        //Kill processing if we're at EOF
        if err == io.EOF {
            break
        }
        
        // malformed csv
        if err != nil {
            // Just reread the last line and on the next iteration of
            // the loop myreader.Read should continue returning bytes 
            // that come after this malformed line to the csv.Reader.
            l, err := myreader.CurrentLine()
            if err != nil {
                panic(err)
            }

            // massage the malformed csv line
            line = fixcsv(l) 
        }

        c := get(line) //defined elsewhere
        cs= append(cs, c)
    }

    //Do other stuff...
}

答案3

得分: 0

我使用了mkopriva的提示和Go的CSV解析代码的明显复制来“解决”这个问题。如果我理解正确,Go的CSV解析器对于它认为是一行的内容非常聪明。当我编写一个简单的CSV解析器时,我会按照换行符分割文件,然后逐行处理。Go的解析器更加智能,它考虑到了一个带引号的字段本身可能包含换行符。在这些情况下,我的代码会失败,而他们的代码会正常工作。

将“行”提供给Go的解析器有点棘手,因为它会读取流并寻找行的开头和结尾模式,并在此过程中提取字段。我所做的是劫持代码并添加一个变量,用于跟踪代码认为是一行的流的开头和结尾。我的修改可能存在问题,但对我来说似乎工作正常。如果有帮助的话,以下是我采取的步骤:

1)将CSV源代码复制并完整粘贴到我的项目中。

2)为Reader结构体添加一个新字段:

type Reader struct {
    ...
    // 第i个字段在lineBuffer中的偏移量为fieldIndexes[i]。
    fieldIndexes []int

    CurrentLine []byte // 添加的结构体字段,用于保存行的内容

    ...
}

3)在readRune()函数中,按照如下方式捕获字节:

func (r *Reader) readRune() (rune, error) {
    r1, _, err := r.r.ReadRune()
    r.CurrentLine = append(r.CurrentLine, byte(r1)) // 添加:将处理过的字节存储起来
    ...
}

4)在Read()函数中,为每一行重置CurrentLine,如下所示:

func (r *Reader) Read() (record []string, err error) {

    r.CurrentLine = []byte{} // 添加:重置行的内容

    ...
}

添加了这些内容后,当出现解析错误时,我可以获取当前行,如mkopriva建议的那样:

...
if err != nil {

	line = fixCSV(csvReader.CurrentLine)
	continue

}
...
英文:

I "solved" this problem using a hint from mkopriva and blatant copying from Go's CSV parsing code. If I read it right, Go's CSV parser is rather clever about what it considers a line. When I've written a naive CSV parser, I've split files on new lines, and then processed them. Go's parser is smarter, and includes the possibility that a quoted field might itself contain a new line. In those cases, my code would fail and theirs would work.

Feeding "lines" to Go's parser is a bit tricky, as it's reading through a stream looking for line-beginning-and-ending patterns and extracting fields along the way. What I did was hijack the code and add a variable that tracks the beginning and end of the stream that the code considers a line. My additions probably have problems, but seem to work for me. If it helps, here are the steps I took:

  1. Copy the CSV source and paste into my project in its entirety.

  2. Add a new field to type Reader struct {}:

    type Reader struct {
    ...
    // The i'th field starts at offset fieldIndexes[i] in lineBuffer.
    fieldIndexes []int

     CurrentLine []byte //Added struct field to hold onto the line
    
     ...
    

    }

  3. In readRune(), capture bytes as they come in, like so:

    func (r *Reader) readRune() (rune, error) {
    r1, _, err := r.r.ReadRune()
    r.CurrentLine = append(r.CurrentLine, byte(r1)) //added: stores bytes as processed
    ...
    }

  4. in Read(), reset CurrentLine for each line, like so:

    func (r *Reader) Read() (record []string, err error) {

     r.CurrentLine = []byte{} //added: reset line capturing
    
     ...
    

    }

With these items added, I can then grab the current line when there's a parsing error, as per mkopriva's suggestion:

...
if err != nil {

	line = fixCSV(csvReader.CurrentLine)
	continue

}
...

huangapple
  • 本文由 发表于 2017年5月17日 02:55:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/44009431.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定