解析CSV文件中的嵌套单元格,使用bash终端或R。

huangapple go评论76阅读模式
英文:

Parsing a nested cell in a CSV file using the bash terminal or R

问题

I understand your request. Here's the translated part of your content:

尝试通过将大于150GB的CSV文件分块处理为较小的部分,然后按顺序读取到R脚本中,以创建一种缩小的读取方法。

问题是,其中一个列变量类似于嵌套单元格,如下所示:

ID1,var1,var2,var3,var4,var5,"[[intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6]]"
ID2,var1,var2,var3,var4,var5,"[[intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6]]"
ID3,var1,var2,var3,var4,var5,"[[intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6]]"

我尝试通过运行以下命令来获取嵌套单元格的值:

cat file.csv | cut -d'"' -f2

但这会生成最后一列的所有值,我希望能够顺序调用每个出现的值(例如,一个ID中的所有内容在“[[和]]”之间),然后将其展开为单个文件中的一行/向量,通过>>

我尝试了此解决方案的各种变体:
https://stackoverflow.com/questions/36934452/how-to-print-a-single-cell-in-a-csv-file-using-bash-script-or-awk

但看起来其中一些返回值会阻止正确调用它(因为它要么通过head返回第一行,要么什么都没有)。

我确信有一种sedawkgrep的调用可以处理这个问题,但我无法想出。

编辑:
有人指出不清楚我在寻找什么,简短的回答是,我想提取CSV中单行/条目两个引号之间的所有内容。

这样,我可以将输出像这样的内容:

ID3,var1,var2,var3,var4,var5,"[[intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6]]"

写入其自己的文件,文件名中包含ID3的变体。

英文:

Attempting to create a scaled-down approach to reading in a LARGE (>150GB) CSV file into an R script by chunking it into smaller bits which can be read in sequentially.

Problem is, one of the column variables is kind of a nested cell, similar to:

ID1,var1,var2,var3,var4,var5,"[[intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6]]"
ID2,var1,var2,var3,var4,var5,"[[intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6]]"
ID3,var1,var2,var3,var4,var5,"[[intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6]]"

I've had some success getting just the nested cell values by running:

cat file.csv | cut -d'"' -f2

But this generates ALL the values in the final column, and I would like to just be able to sequentially call each occurrence (e.g. everything between "[[ and ]]" for one ID) and flatten it into a row/vector of some kind in a single file via >>

I tried variations of this solution:
https://stackoverflow.com/questions/36934452/how-to-print-a-single-cell-in-a-csv-file-using-bash-script-or-awk

But it looks like there are some returns in there that are preventing it from being called correctly (as it all comes up with either the first line via head or just blanks).

I'm sure there's a sed, awk, or grep call that can handle this but I'm drawing a blank.

Edit:
It has been brought to my attention that it is unclear what I'm asking for, the short answer is I want to extract everything between the two " for a single line/entry in the CSV.

So that I can pipe an out put like:

ID3,var1,var2,var3,var4,var5,"[[intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6],[intvar6,intvar6,intvar6,intvar6]]"

to it's own file with a name containing a variation of ID3 in the name.

答案1

得分: 2

这是您提供的代码的翻译部分:

这不太清楚你尝试做什么,但这是使用awk的一个起点:

$ awk -F'"' -v OFS='\t' '{
    gsub(/\[/,"")
    n = split($2,a,/][],]*/)
    for (i=1; i<n; i++) {
        print NR, a[i]
    }
}' file
1       intvar6,intvar6,intvar6
1       intvar6,intvar6,intvar6,intvar6,intvar6
1       intvar6,intvar6,intvar6,intvar6
2       intvar6,intvar6,intvar6
2       intvar6,intvar6,intvar6,intvar6,intvar6
2       intvar6,intvar6,intvar6,intvar6

尝试在此基础上进行构建,如果遇到困难,请提出具体问题。
英文:

It's not clear what you're trying to do but here's a start using any awk:

$ awk -F&#39;&quot;&#39; -v OFS=&#39;\t&#39; &#39;{
    gsub(/\[/,&quot;&quot;)
    n = split($2,a,/][],]*/)
    for (i=1; i&lt;n; i++) {
        print NR, a[i]
    }
}&#39; file
1       intvar6,intvar6,intvar6
1       intvar6,intvar6,intvar6,intvar6,intvar6
1       intvar6,intvar6,intvar6,intvar6
2       intvar6,intvar6,intvar6
2       intvar6,intvar6,intvar6,intvar6,intvar6
2       intvar6,intvar6,intvar6,intvar6

Try to build on that and then ask a specific question if you get stuck.

答案2

得分: 1

以下是您要翻译的内容:

对于包括带引号字段的非平凡CSV数据,实际上更容易使用真正了解格式的工具,例如始终方便使用的csv kit包中的实用程序。

提取您的数据的第7个记录并将每个条目保存到其自己的文件的示例脚本:

#!/usr/bin/env bash

rm -f output*.txt
while IFS=$'7' read -r -d $'6' lineno record; do
    # 跳过标题行
    if [[ $lineno = line_number ]]; then
        continue
    fi
    printf "%s\n" "$record" > "output$lineno.txt"
done < <(csvcut -l -H -c 7 input.csv | csvformat -M $'6' -D $'7' -U3)

csvcut -l -H -c 7 将输出具有记录编号和原始CSV文件的第7列的记录,而csvformat 部分将格式化内容,以便在每个记录之间使用ASCII记录分隔符字符而不是换行符(因此您无需担心记录中的换行符会干扰面向行的工具),并在每个字段之间使用单元分隔符字符而不是逗号,不引用任何内容。然后,while 循环使用新的分隔符逐个读取记录,并将其保存在自己按顺序编号的文件中供以后处理。

英文:

For non-trivial CSV data including data with quoted fields, it's ultimately easier to use tools that actually understand the format, for example the utilities in the ever-handy csv kit package.

Example script that extracts the 7th record of your data and saves each entry in its own file:

#!/usr/bin/env bash

rm -f output*.txt
while IFS=$&#39;7&#39; read -r -d $&#39;6&#39; lineno record; do
    # Skip header line
    if [[ $lineno = line_number ]]; then
        continue
    fi
    printf &quot;%s\n&quot; &quot;$record&quot; &gt; &quot;output$lineno.txt&quot;
done &lt; &lt;(csvcut -l -H -c 7 input.csv | csvformat -M $&#39;6&#39; -D $&#39;7&#39; -U3)

csvcut -l -H -c 7 will output records with a record number and the 7th column of the original CSV file, and the csvformat bit will format things so an ASCII record-separator character is between each record instead of a newline (So you don't have to worry about things like newlines in the record messing line-oriented tools up), and put a unit-separator character between each field instead of comma, and not quote anything. The while loop then reads a record at a time using the new delimiters and saves that in its own sequentially numbered file for later processing.

huangapple
  • 本文由 发表于 2023年5月18日 01:54:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/76274943.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定