根据输入的列顺序重新排列 CSV 文件中的列/值。

huangapple go评论85阅读模式
英文:

Rearranging column/values in csv file based on input column order

问题

我需要你在Unix/AWK中的帮助,将CSV文件(逗号分隔文件)按照另一个文件中给定的顺序进行排序。我在这方面的知识有限。

示例:需要重新排列的文件。

PRODUCTCODE	SITE	BATCHID	LV1P_DESCRIPTION
PRD-0989	Static	BATCH01	Estract 
PRD-0990	Static	BATCH02	Xstract 
PRD-0991	Static	BATCH03	8stract 
PRD-0992	Static	BATCH04	errs
PRD-0993	Static	BATCH05	dgerrr
PRD-0994	Static	BATCH06	Narros
PRD-0995	Static	BATCH07	sef3r

如果将输入的列顺序传递为以下内容(可以存储在另一个文件中):

PRODUCTCODE,LV1P_DESCRIPTION,BATCHID,SITE

那么我的输出文件应该如下所示:

PRODUCTCODE	LV1P_DESCRIPTION	BATCHID	SITE
PRD-0989	Estract 	BATCH01	Static
PRD-0990	Xstract 	BATCH02	Static
PRD-0991	8stract 	BATCH03	Static
PRD-0992	errs	BATCH04	Static
PRD-0993	dgerrr	BATCH05	Static
PRD-0994	Narros	BATCH06	Static
PRD-0995	sef3r	BATCH07	Static
英文:

i need your help in unix/awk to order the data in csv file (comma separated file) to certain order given in another file. my knowledge is limited on this.

example: File to rearrange.

<!-- begin snippet: js hide: false console: true babel: false -->

<!-- language: lang-html -->

PRODUCTCODE	SITE	BATCHID	LV1P_DESCRIPTION
PRD-0989	Static	BATCH01	Estract 
PRD-0990	Static	BATCH02	Xstract 
PRD-0991	Static	BATCH03	8stract 
PRD-0992	Static	BATCH04	errs
PRD-0993	Static	BATCH05	dgerrr
PRD-0994	Static	BATCH06	Narros
PRD-0995	Static	BATCH07	sef3r

<!-- end snippet -->

and if pass input column order as [This could be stored in another file]

<!-- begin snippet: js hide: false console: true babel: false -->

<!-- language: lang-html -->

PRODUCTCODE,LV1P_DESCRIPTION,BATCHID,SITE

<!-- end snippet -->

then my output file should come as below

<!-- begin snippet: js hide: false console: true babel: false -->

<!-- language: lang-html -->

PRODUCTCODE	LV1P_DESCRIPTION	BATCHID	SITE
PRD-0989	Estract 	BATCH01	Static
PRD-0990	Xstract 	BATCH02	Static
PRD-0991	8stract 	BATCH03	Static
PRD-0992	errs	BATCH04	Static
PRD-0993	dgerrr	BATCH05	Static
PRD-0994	Narros	BATCH06	Static
PRD-0995	sef3r	BATCH07	Static

<!-- end snippet -->

答案1

得分: 0

TXR Lisp中的解决方案:

$ txr get.tl columns data
PRODUCTCODE LV1P_DESCRIPTION BATCHID SITE
PRD-0989 Estract BATCH01 Static
PRD-0990 Xstract BATCH02 Static
PRD-0991 8stract BATCH03 Static
PRD-0992 errs BATCH04 Static
PRD-0993 dgerrr BATCH05 Static
PRD-0994 Narros BATCH06 Static
PRD-0995 sef3r BATCH07 Static

文件内容如下:

$ cat columns
PRODUCTCODE,LV1P_DESCRIPTION,BATCHID,SITE
$ cat data
PRODUCTCODE SITE    BATCHID LV1P_DESCRIPTION
PRD-0989    Static  BATCH01 Estract
PRD-0990    Static  BATCH02 Xstract
PRD-0991    Static  BATCH03 8stract
PRD-0992    Static  BATCH04 errs
PRD-0993    Static  BATCH05 dgerrr
PRD-0994    Static  BATCH06 Narros
PRD-0995    Static  BATCH07 sef3r

get.tl中的代码如下:

(match (@colfile @datafile) *args*
  (awk
    (:inputs datafile)
    (:let col-indices)
    ((= nr 1) (let ((wanted-cols (flow colfile
                                   file-get-lines first (spl &quot;,&quot;)))
                    (col-index-hash (hash-zip f 0)))
                (set col-indices (mapcar col-index-hash wanted-cols))))
    (t [apply prn (mapcar f col-indices)])))

看起来数据可能是以制表符分隔的,而制表符被转换为空格,制表符宽度为四个空格。要输出以制表符分隔的内容,我们可以在awk宏中添加(:set ofs &quot;\t&quot;)子句。

英文:

Solution in TXR Lisp:

$ txr get.tl columns data
PRODUCTCODE LV1P_DESCRIPTION BATCHID SITE
PRD-0989 Estract BATCH01 Static
PRD-0990 Xstract BATCH02 Static
PRD-0991 8stract BATCH03 Static
PRD-0992 errs BATCH04 Static
PRD-0993 dgerrr BATCH05 Static
PRD-0994 Narros BATCH06 Static
PRD-0995 sef3r BATCH07 Static

Files look like:

$ cat columns
PRODUCTCODE,LV1P_DESCRIPTION,BATCHID,SITE
$ cat data
PRODUCTCODE SITE    BATCHID LV1P_DESCRIPTION
PRD-0989    Static  BATCH01 Estract
PRD-0990    Static  BATCH02 Xstract
PRD-0991    Static  BATCH03 8stract
PRD-0992    Static  BATCH04 errs
PRD-0993    Static  BATCH05 dgerrr
PRD-0994    Static  BATCH06 Narros
PRD-0995    Static  BATCH07 sef3r

Code in get.tl:

(match (@colfile @datafile) *args*
  (awk
    (:inputs datafile)
    (:let col-indices)
    ((= nr 1) (let ((wanted-cols (flow colfile
                                   file-get-lines first (spl &quot;,&quot;)))
                    (col-index-hash (hash-zip f 0)))
                (set col-indices (mapcar col-index-hash wanted-cols))))
    (t [apply prn (mapcar f col-indices)])))

It looks as if the data might have been tab separated, and the tabs got converted to spaces, with a tab width of four. To output tab-separated, we can add a (:set ofs &quot;\t&quot;) clause to the awk macro.

答案2

得分: 0

如果:

  • 你的CSV文件很简单(没有包含逗号的引用字段,每条记录占据一行...),
  • 每条记录的字段数完全相同,
  • 你用相同数量的字段指定了期望的输出顺序,

你可以尝试以下方法(使用任何POSIX的awk,已在GNU awk和macOS自带的awk上测试,输出以-|为前缀):

awk -F, -v OFS=, -v o='PRODUCTCODE,LV1P_DESCRIPTION,BATCHID,SITE' '
  BEGIN {n=split(o,h)}
  NR==1 {print o; for(i=1;i<=n;i++) a[$i]=i; next}
  {for(i=1;i<=n;i++) printf("%s%s", $(a[h[i]]), i==n?"\n":OFS)}
' foo.csv
-| PRODUCTCODE,LV1P_DESCRIPTION,BATCHID,SITE
-| PRD-0989,Estract,BATCH01,Static
-| PRD-0990,Xstract,BATCH02,Static
-| PRD-0991,8stract,BATCH03,Static
-| PRD-0992,errs,BATCH04,Static
-| PRD-0993,dgerrr,BATCH05,Static
-| PRD-0994,Narros,BATCH06,Static
-| PRD-0995,sef3r,BATCH07,Static
  1. 将逗号设置为输入和输出字段分隔符(-F, -v OFS=,)。
  2. 将期望的输出顺序作为变量o传递(-v o='...')。
  3. BEGIN块中,将变量o按输入字段分隔符拆分,将结果存储在数组h中,并将列数存储在变量n中。
  4. 对于第一行,在数组a中将字段作为键,字段号作为值进行存储(a[$i]=i)。然后跳到下一行(next)。
  5. 对于其他行,按照期望的顺序打印前n个字段($(a[h[i]])),字段之间用输出字段分隔符分隔,最后一个字段后面是换行符。

如果期望的输出顺序存储在另一个文件的第一行中,比如bar.csv

awk -F, -v OFS=, '
  NR==1 {o=$0; n=split(o,h); next}
  NR!= FNR && FNR==1 {print o; for(i=1;i<=n;i++) a[$i]=i; next}
  {for(i=1;i<=n;i++) printf("%s%s", $(a[h[i]]), i==n?"\n":OFS)}
' bar.csv foo.csv
  1. NR==1只对第一个文件的第一行成立。
  2. NR!= FNR && FNR==1只对第二个文件的第一行成立。
英文:

If:

  • your CSV is simple (no quoted fields containing commas, one line per record...),
  • and each record has exactly the same number of fields,
  • and you specify the desired output order with exactly the same number of fields,

you can try (with any POSIX awk, tested with GNU awk and the awk that comes with macOS, output prefixed with -| ):

awk -F, -v OFS=, -v o=&#39;PRODUCTCODE,LV1P_DESCRIPTION,BATCHID,SITE&#39; &#39;
  BEGIN {n=split(o,h)}
  NR==1 {print o; for(i=1;i&lt;=n;i++) a[$i]=i; next}
  {for(i=1;i&lt;=n;i++) printf(&quot;%s%s&quot;, $(a[h[i]]), i==n?&quot;\n&quot;:OFS)}
&#39; foo.csv
-| PRODUCTCODE,LV1P_DESCRIPTION,BATCHID,SITE
-| PRD-0989,Estract,BATCH01,Static
-| PRD-0990,Xstract,BATCH02,Static
-| PRD-0991,8stract,BATCH03,Static
-| PRD-0992,errs,BATCH04,Static
-| PRD-0993,dgerrr,BATCH05,Static
-| PRD-0994,Narros,BATCH06,Static
-| PRD-0995,sef3r,BATCH07,Static
  1. Set comma as input and output field separators (-F, -v OFS=,).
  2. Pass desired output order as variable o (-v o=&#39;...&#39;).
  3. In BEGIN block split variable o on input fields separator, store the result in array h and the number of columns in variable n.
  4. For the first line store in array a the fields as keys and the field numbers as values (a[$i]=i). Go to next line (next).
  5. For the other lines print the n first fields in desired order ($(a[h[i]])), followed by the output field separator except the last that is followed by a newline.

If the desired output order is stored in the first line of another file, say bar.csv:

awk -F, -v OFS=, &#39;
  NR==1 {o=$0; n=split(o,h); next}
  NR!= FNR &amp;&amp; FNR==1 {print o; for(i=1;i&lt;=n;i++) a[$i]=i; next}
  {for(i=1;i&lt;=n;i++) printf(&quot;%s%s&quot;, $(a[h[i]]), i==n?&quot;\n&quot;:OFS)}
&#39; bar.csv foo.csv
  1. NR==1 is true only for the first line of the first file.
  2. NR!= FNR &amp;&amp; FNR==1 is true only for the first line of the second file.

答案3

得分: 0

awk -F' *|,' '
    NR==FNR{
        for(i=1;i<=NF;i++) a[$i] = i
        next
    }
    FNR==1{
        for(i=1;i<=NF;i++) b[i] = a[$i]
    }
    {
        for (i=1;i<=length(b);i++) printf " %s", $(b[i])
        print ""
    }
 ' order_file input_file | column -t 

输出结果:

PRODUCTCODE  LV1P_DESCRIPTION  BATCHID  SITE
PRD-0989     Estract           BATCH01  Static
PRD-0990     Xstract           BATCH02  Static
PRD-0991     8stract           BATCH03  Static
PRD-0992     errs              BATCH04  Static
PRD-0993     dgerrr            BATCH05  Static
PRD-0994     Narros            BATCH06  Static
PRD-0995     sef3r             BATCH07  Static
英文:
$ head *file
==&gt; input_file &lt;==
PRODUCTCODE SITE    BATCHID LV1P_DESCRIPTION
PRD-0989    Static  BATCH01 Estract
PRD-0990    Static  BATCH02 Xstract
PRD-0991    Static  BATCH03 8stract
PRD-0992    Static  BATCH04 errs
PRD-0993    Static  BATCH05 dgerrr
PRD-0994    Static  BATCH06 Narros
PRD-0995    Static  BATCH07 sef3r

==&gt; order_file &lt;==
PRODUCTCODE,LV1P_DESCRIPTION,BATCHID,SITE

awk -F&#39; *|,&#39; &#39;
    NR==FNR{
        for(i=1;i&lt;=NF;i++) a[$i] = i
        next
    }
    FNR==1{
        for(i=1;i&lt;=NF;i++) b[i] = a[$i]
    }
    {
        for (i=1;i&lt;=length(b);i++) printf &quot; %s&quot;, $(b[i])
        print &quot;&quot;
    }
 &#39; order_file input_file | column -t 
PRODUCTCODE  LV1P_DESCRIPTION  BATCHID  SITE
PRD-0989     Estract           BATCH01  Static
PRD-0990     Xstract           BATCH02  Static
PRD-0991     8stract           BATCH03  Static
PRD-0992     errs              BATCH04  Static
PRD-0993     dgerrr            BATCH05  Static
PRD-0994     Narros            BATCH06  Static
PRD-0995     sef3r             BATCH07  Static

huangapple
  • 本文由 发表于 2023年8月9日 11:33:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76864396.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定