处理带有双引号包裹的每一行的CSV。

huangapple go评论65阅读模式
英文:

Processing A CSV With Every Line Wrapped In Double Quotes

问题

哦,天哪,我的第一个Stack Overflow问题!我们的一个客户正在向我们发送一个CSV文件进行处理,但他们发送的方式是,每一行都用双引号括起来:

"示例,标题,数值\r\n"
"示例,第一行\r\n"
"示例,第二行\r\n"
...
"等等,等等,等等\r\n"

这反过来导致Ruby将每一行解析为单个字段,包括标题,这导致数据摄入脚本崩溃。

当前的代码将其打开为一个File对象,然后将其传递给具有一些可配置选项CSV.foreach枚举器:

CSV.foreach(<File Object>, <Options Hash>).with_index(1) do |line, index|
# 处理一条记录
end

有没有一种简单直接的方法告诉Ruby只是忽略这些引号,以便它能正确解析单个字段?

我尝试将CSV选项中的quote_char更改为单引号,但不知何故事情变得更糟了。我可以尝试在处理文件之前从文件中删除这些引号,但这将需要对旧代码进行大量更改,如果可能的话,我想避免这样做。我查看了一些有关CSV选项的文档,但我没有看到明显的解决办法。

供参考,CSV选项配置如下:

{
 headers: true,
 skip_blanks: true,
 encoding: 'bom|utf-8',
 liberal_parsing: true,
 header_converters: lambda { |f| f.downcase.strip },
 row_sep: "\r\n",
 quote_char: "'"
}
英文:

Oh boy, my first Stack Overflow question! One of our clients is sending us a CSV file to process, but the way they're sending it, every single line is wrapped in double quotes:

"example, header, values\r\n"
"example, first, line\r\n"
"example, second, line\r\n"
...
"etc, etc, etc\r\n"

This in turn is causing Ruby to parse every line as a single field, including the headers, which is causing this data ingestion script to crash.

The code currently opens this as a File object, which then gets passed to a CSV.foreach enumerator with some configurable options:

CSV.foreach(<File Object>, <Options Hash>).with_index(1) do |line, index|
# process a record
end

Is there a straightforward way to tell Ruby to just ignore these quotes so that it can correctly parse individual fields?

I've tried changing the quote_char in the CSV options to a single quote, but somehow that actually makes things worse. I could probably do all sorts of work to remove these quotes from the file before processing it, but that would require making a bunch of changes to legacy code, and I'd like to avoid it if I can. I've gone through some documentation about CSV options, but I'm not seeing any obvious silver bullet.

For reference, the CSV options are configured as such:

{
 headers: true,
 skip_blanks: true,
 encoding: 'bom|utf-8',
 liberal_parsing: true,
 header_converters: lambda { |f| f.downcase.strip },
 row_sep: "\r\n",
 quote_char: "'"
}

答案1

得分: 2

你需要在解析CSV文件之前对文件进行一些"预处理",如下所示:

#test.csv
"status,color,name\r\n"
"active,green,Norm\r\n"
"inactive,red,Herb"
# test.rb
require 'csv'

not_csv = File.readlines('test.csv')
real_csv = ""

not_csv.each{|line| real_csv += line.sub("\r\n","").gsub('"','') }

parsed_csv = CSV.parse(real_csv, headers: true)
puts parsed_csv[0]["status"] #=> active
puts parsed_csv[1]["name"]   #=> Herb

从控制台运行 ruby test.rb

英文:

You will have to do a little "pre-processing" on the file before parsing the csv. Like this:

#test.csv
"status,color,name\r\n"
"active,green,Norm\r\n"
"inactive,red,Herb"
# test.rb
require 'csv'

not_csv = File.readlines('test.csv')
real_csv = ""

not_csv.each{|line| real_csv += line.sub("\\r\\n","").gsub('"','') }

parsed_csv = CSV.parse(real_csv, headers: true)
puts parsed_csv[0]["status"] #=>active
puts parsed_csv[1]["name"]  #=>Herb

from the console run ruby test.rb

huangapple
  • 本文由 发表于 2023年6月1日 01:50:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/76376120.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定