Unix shell脚本根据另一个CSV文件中的标题选择CSV文件中的列。

huangapple go评论66阅读模式
英文:

Unix shell script select columns in csv file based on headers from another csv file

问题

我有一个包含 28,000 列的 CSV 文件,我想根据另一个 CSV 文件中的标题来选择特定的列,使用一个 Unix shell 脚本。由于我在服务器上工作并且没有管理员权限来安装新工具,所以不能使用像 csvkit 这样的工具。我已经阅读了很多帖子,但无法使我想要的工作,可能是因为以前的答案使用制表符分隔或空格分隔的文本,而不是 CSV。我对 shell 脚本是新手(通常在 R 或 Python 中工作,不需要经常使用 shell 脚本)。

标题文件看起来像这样:

$ cat headers.csv
eid
ABCD001
ABCD005
ABCD021
ABCD022

...等等(我需要大约 50 列,不连续的列)

数据文件是一个包含各种格式数据(数字,字符)的 CSV 文件,包含 28,000 列,其中包括我需要的 50 列,带有与标题文件相同的标题列。

我尝试了这个帖子:https://www.unix.com/shell-programming-and-scripting/269610-find-columns-file-based-header-print-new-file.html,以及帖子中的这段代码:

$ awk 'NR==FNR{a[$0]=NR;next}{for (i in a) printf "%s ", $a[i];print ""}' headers.csv data_file.csv > selected_data_file.csv

但它不起作用,可能是因为它正在查找制表符分隔或空格分隔的文本,而我有一个 CSV 文件。它生成了一个巨大的输出文件,因此没有完成任务。

我还阅读了这个帖子:https://stackoverflow.com/questions/28129463/create-csv-from-specific-columns-in-another-csv-using-shell-scripting,但我不能使用列索引,我需要使用另一个文件中的标题,因为输入数据文件中有这么多列。

有没有建议如何修改这段代码,以生成我需要的 50 列的所有数据文件行的文件,将不在标题.csv 文件中选择的列排除在外,将所有这些数据文件行输出到 selected_data_file.csv 文件中。请注意,我不能使用 csvkit

输出应该类似于这样:

$ cat selected_data_file.csv
eid,ABCD001,ABCD005,ABCD021,ABCD022
AB1,1,1,0.5556,XXXX
AB2,2,2,0.7687,YYYY
AB3,1,0,0.5362,ZZZ

对应于我在 headers.csv 文件中选择的列的所有行。

希望这有意义,非常感谢任何帮助!

英文:

I have a csv file with 28,000 columns and I want to select certain columns based on headers in another csv file, using a unix shell script. I cannot use tools like csvkit as I am working on a server and don't have admin rights to install new tools. I have read many posts on this but can't get what I want to work, possibly as the previous answers use tab delimited or space delimited text, not csv. I am new to shell script (and won't have to use it much, as I usually work in R or Python).

The header file looks like this:

$ cat headers.csv
eid
ABCD001
ABCD005
ABCD021
ABCD022

... etc (I need about 50 columns, not in sequence)

The data file is a csv file with data in a variety of formats (numeric, characters)
with 28,000 columns including all of the 50 columns I need, with a header column as in the header file.

I tried this post:
<https://www.unix.com/shell-programming-and-scripting/269610-find-columns-file-based-header-print-new-file.html>
and this code in the post:

$ awk &#39;NR==FNR{a[$0]=NR;next}{for (i in a) printf &quot;%s &quot;, $a[i];print &quot;&quot;}&#39; headers.csv data_file.csv &gt; selected_data_file.csv

But it doesn't work, probably as it's looking for tab delimited text or space delimited and I have a csv file. It produces a huge output file, so is not doing the job.

I also read this post:
<https://stackoverflow.com/questions/28129463/create-csv-from-specific-columns-in-another-csv-using-shell-scripting>
But I can't use the column indices, I need to use the headers from the other file, as there are so many columns in the input data file.

Any suggestions for how this code can be modified to produce the file of all rows of the data_file but just for the 50 columns I need would be really appreciated. Please note, I cannot use csvkit.

The output should be something like this:

$ cat selected_data_file.csv
eid,ABCD001,ABCD005,ABCD021,ABCD022
AB1, 1, 1, 0.5556, XXXX
AB2, 2, 2, 0.7687, YYYY
AB3, 1, 0, 0.5362, ZZZ

corresponding to all the rows for the columns whose headers I have selected in the headers.csv file.

I hope that makes sense, all help appreciated!

答案1

得分: 2

你已经相当接近了。在读取完标题文件后,需要做的是扫描数据文件的第一行,并选择与标题匹配的列号。另外,空格并不是一种宝贵的资源,可以多加使用。

awk '
    NR == FNR {wanted[$0] = 1; next}
    FNR == 1 {
        ncol = 0
        for (i = 1; i <= NR; i++)
            if ($i in wanted)
                columns[++ncol] = i
    }
    {
        for (i = 1; i <= ncol; i++)
            printf "%s%s", $columns[i], OFS
        print ""
    }
' headers.csv data_file.csv > selected_data_file.csv
英文:

You're pretty close. What you need to do after you've read the headers file is to scan the first line of the data file and select the column numbers that match the headers. Also, whitespace is not a precious resource, it's OK to use more.

awk &#39;
    NR == FNR {wanted[$0] = 1; next}
    FNR == 1 {
        ncol = 0
        for (i = 1; i &lt;= NR; i++)
            if ($i in wanted)
                columns[++ncol] = i
    }
    {
        for (i = 1; i &lt;= ncol; i++)
            printf &quot;%s%s&quot;, $columns[i], OFS
        print &quot;&quot;
    }
&#39; headers.csv data_file.csv &gt; selected_data_file.csv

huangapple
  • 本文由 发表于 2023年6月22日 19:34:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76531447.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定