2023年6月22日 19:34:02go评论95阅读模式

英文:

Unix shell script select columns in csv file based on headers from another csv file

问题

我有一个包含 28,000 列的 CSV 文件，我想根据另一个 CSV 文件中的标题来选择特定的列，使用一个 Unix shell 脚本。由于我在服务器上工作并且没有管理员权限来安装新工具，所以不能使用像 csvkit 这样的工具。我已经阅读了很多帖子，但无法使我想要的工作，可能是因为以前的答案使用制表符分隔或空格分隔的文本，而不是 CSV。我对 shell 脚本是新手（通常在 R 或 Python 中工作，不需要经常使用 shell 脚本）。

标题文件看起来像这样：

$ cat headers.csv
eid
ABCD001
ABCD005
ABCD021
ABCD022

...等等（我需要大约 50 列，不连续的列）

数据文件是一个包含各种格式数据（数字，字符）的 CSV 文件，包含 28,000 列，其中包括我需要的 50 列，带有与标题文件相同的标题列。

我尝试了这个帖子：https://www.unix.com/shell-programming-and-scripting/269610-find-columns-file-based-header-print-new-file.html，以及帖子中的这段代码：

$ awk 'NR==FNR{a[$0]=NR;next}{for (i in a) printf "%s ", $a[i];print ""}' headers.csv data_file.csv > selected_data_file.csv

但它不起作用，可能是因为它正在查找制表符分隔或空格分隔的文本，而我有一个 CSV 文件。它生成了一个巨大的输出文件，因此没有完成任务。

我还阅读了这个帖子：https://stackoverflow.com/questions/28129463/create-csv-from-specific-columns-in-another-csv-using-shell-scripting，但我不能使用列索引，我需要使用另一个文件中的标题，因为输入数据文件中有这么多列。

有没有建议如何修改这段代码，以生成我需要的 50 列的所有数据文件行的文件，将不在标题.csv 文件中选择的列排除在外，将所有这些数据文件行输出到 selected_data_file.csv 文件中。请注意，我不能使用 csvkit。

输出应该类似于这样：

$ cat selected_data_file.csv
eid,ABCD001,ABCD005,ABCD021,ABCD022
AB1,1,1,0.5556,XXXX
AB2,2,2,0.7687,YYYY
AB3,1,0,0.5362,ZZZ

对应于我在 headers.csv 文件中选择的列的所有行。

希望这有意义，非常感谢任何帮助！

英文:

I have a csv file with 28,000 columns and I want to select certain columns based on headers in another csv file, using a unix shell script. I cannot use tools like csvkit as I am working on a server and don't have admin rights to install new tools. I have read many posts on this but can't get what I want to work, possibly as the previous answers use tab delimited or space delimited text, not csv. I am new to shell script (and won't have to use it much, as I usually work in R or Python).

The header file looks like this:

$ cat headers.csv
eid
ABCD001
ABCD005
ABCD021
ABCD022

... etc (I need about 50 columns, not in sequence)

The data file is a csv file with data in a variety of formats (numeric, characters)
with 28,000 columns including all of the 50 columns I need, with a header column as in the header file.

I tried this post:
<https://www.unix.com/shell-programming-and-scripting/269610-find-columns-file-based-header-print-new-file.html>
and this code in the post:

$ awk &#39;NR==FNR{a[$0]=NR;next}{for (i in a) printf &quot;%s &quot;, $a[i];print &quot;&quot;}&#39; headers.csv data_file.csv &gt; selected_data_file.csv

But it doesn't work, probably as it's looking for tab delimited text or space delimited and I have a csv file. It produces a huge output file, so is not doing the job.

I also read this post:
<https://stackoverflow.com/questions/28129463/create-csv-from-specific-columns-in-another-csv-using-shell-scripting>
But I can't use the column indices, I need to use the headers from the other file, as there are so many columns in the input data file.

Any suggestions for how this code can be modified to produce the file of all rows of the data_file but just for the 50 columns I need would be really appreciated. Please note, I cannot use csvkit.

The output should be something like this:

$ cat selected_data_file.csv
eid,ABCD001,ABCD005,ABCD021,ABCD022
AB1, 1, 1, 0.5556, XXXX
AB2, 2, 2, 0.7687, YYYY
AB3, 1, 0, 0.5362, ZZZ

corresponding to all the rows for the columns whose headers I have selected in the headers.csv file.

I hope that makes sense, all help appreciated!

答案1

得分: 2

你已经相当接近了。在读取完标题文件后，需要做的是扫描数据文件的第一行，并选择与标题匹配的列号。另外，空格并不是一种宝贵的资源，可以多加使用。

awk '
    NR == FNR {wanted[$0] = 1; next}
    FNR == 1 {
        ncol = 0
        for (i = 1; i <= NR; i++)
            if ($i in wanted)
                columns[++ncol] = i
    }
    {
        for (i = 1; i <= ncol; i++)
            printf "%s%s", $columns[i], OFS
        print ""
    }
' headers.csv data_file.csv > selected_data_file.csv

英文:

You're pretty close. What you need to do after you've read the headers file is to scan the first line of the data file and select the column numbers that match the headers. Also, whitespace is not a precious resource, it's OK to use more.

awk &#39;
    NR == FNR {wanted[$0] = 1; next}
    FNR == 1 {
        ncol = 0
        for (i = 1; i &lt;= NR; i++)
            if ($i in wanted)
                columns[++ncol] = i
    }
    {
        for (i = 1; i &lt;= ncol; i++)
            printf &quot;%s%s&quot;, $columns[i], OFS
        print &quot;&quot;
    }
&#39; headers.csv data_file.csv &gt; selected_data_file.csv

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Unix shell脚本根据另一个CSV文件中的标题选择CSV文件中的列。

问题

答案1

AWK: print ALL rows with MAX value in one field Per the other field including Identical Rows with Max value AND multiple columns

awk或grep在迭代中查找2个单词并逐行输出。

“git clone inside bash script: fatal: protocol ‘https’ is not supported”

pandas and numby to read csv and convert it from 2d vector to 1d with ignoring diagonal values

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。