使用awk在制表符分隔的文件中更改列的值,而不更改其他列的值。

huangapple go评论47阅读模式
英文:

changing column values in tab delaminated file with awk without changing values in other columns

问题

你可以尝试使用awk命令的子字符串替换功能来仅更改第一列而不影响其他列。以下是修改后的awk命令:

awk -v OFS='\t' '{$1=gensub(/1-0039\.1/, "1", "1"); print}' 1-0039.gtf > 1-0039_modified.gtf

这个命令使用gensub函数来查找第一列中的"1-0039.1"并将其替换为"1",而不影响其他列。这应该能够保持其他列的空格分隔不变。修改后的输出将保持与原始文件相同的列分隔符。

英文:

My file looks like this :

1-0039.1        EMBL    transcript      1       1524    .       +       .       transcript_id "1-0039.1.2"; gene_id "1-0039.1.2"; gene_name "dnaA"
1-0039.1        EMBL    CDS     1       1524    .       +       0       transcript_id "1-0039.1.2"; gene_name "dnaA";
1-0039.1        EMBL    transcript      1646    1972    .       +       .       transcript_id "1-0039.1.5"; gene_id "1-0039.1.5"; gene_name "ORF0009"

I want to change all "1-0039.1" values in the first column to 1

so I have tried:
awk -vOFS='\t' '{$1="1"; print}' 1-0039.gtf > 1-0039_modified.gtf
And the output looks like this:

1       EMBL    transcript      1       1524    .       +       .       transcript_id   "1-0039.1.2";   gene_id "1-0039.1.2";   gene_name       "dnaA"
1       EMBL    CDS     1       1524    .       +       0       transcript_id   "1-0039.1.2";   gene_name       "dnaA";
1       EMBL    transcript      1646    1972    .       +       .       transcript_id   "1-0039.1.5";   gene_id "1-0039.1.5";   gene_name       "ORF0009"
1       EMBL    CDS     1646    1972    .       +       0       transcript_id   "1-0039.1.5";   gene_name       "ORF0009";
1       EMBL    transcript      2023    2940    .       +       .       transcript_id   "1-0039.1.7";   gene_id "1-0039.1.7";   gene_name       "ORF0586"
1       EMBL    CDS     2023    2940    .       +       0       transcript_id   "1-0039.1.7";   gene_name       "ORF0586";
1       EMBL    transcript      2897    3223    .       +       .       transcript_id   "1-0039.1.9";   gene_id "1-0039.1.9";   gene_name       "ORF0009"

As you can see values in the last column were space-separated but now they are tab separated. My question is how do I change the first column only without messing up other columns?

答案1

得分: 2

使用awk

awk 'BEGIN{ FS=OFS="\t" } $1=="1-0039.1"{ $1="1" } { print }' 1-0039.gtf > 1-0039_modified.gtf

输出:

<pre>
1       EMBL    transcript      1       1524    .       +       .       transcript_id "1-0039.1.2"; gene_id "1-0039.1.2"; gene_name "dnaA"
1       EMBL    CDS     1       1524    .       +       0       transcript_id "1-0039.1.2"; gene_name "dnaA";
1       EMBL    transcript      1646    1972    .       +       .       transcript_id "1-0039.1.5"; gene_id "1-0039.1.5"; gene_name "ORF0009"
</pre>

参见:8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

英文:

With awk:

awk &#39;BEGIN{ FS=OFS=&quot;\t&quot; } $1==&quot;1-0039.1&quot;{ $1=&quot;1&quot; } { print }&#39; 1-0039.gtf &gt; 1-0039_modified.gtf

Output:
<pre>
1 EMBL transcript 1 1524 . + . transcript_id "1-0039.1.2"; gene_id "1-0039.1.2"; gene_name "dnaA"
1 EMBL CDS 1 1524 . + 0 transcript_id "1-0039.1.2"; gene_name "dnaA";
1 EMBL transcript 1646 1972 . + . transcript_id "1-0039.1.5"; gene_id "1-0039.1.5"; gene_name "ORF0009"
</pre>

See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

答案2

得分: 2

解决 OP 的问题,即将最后一个字段中的空格转换为制表符...

如当前的编码方式:

  • 未定义输入字段分隔符,因此所有空白都被视为输入字段分隔符。
  • OP 认为的 "最后字段"(例如,transcript_id "1-0039.1.2"; gene_name "dnaA";)实际上将被视为 4 个单独的空格分隔字段。
  • 输出字段分隔符定义为制表符,因此所有连续的空格组(输入)将转换为制表符(输出),因此 OP 的 "最后字段"(实际上由 awk 视为 4 个单独字段)会被分割为制表符。

要保留 "最后字段" 中的空格,OP 需要告诉 awk 输入字段分隔符是什么。

如果输入字段分隔符是制表符,那么可以尝试修改 OP 当前的代码如下:

awk 'BEGIN { FS=OFS="\t"} {$1="1"; print}' 1-0039.gtf

如果输入字段分隔符是 2 个或更多空格,则有几种备选方案:

awk 'BEGIN { FS="[ ]{2,}"; OFS="\t"} {$1="1"; print}' 1-0039.gtf

或者

awk 'BEGIN { FS="[ ][ ]+"; OFS="\t"} {$1="1"; print}' 1-0039.gtf

英文:

Addressing OP's issue with the spaces in the last field being converted to tabs ...

As currently coded:

  • no input field delimiter is defined so all white space is treated as input field delimiters
  • what OP thinks of as the 'last field' (eg, transcript_id &quot;1-0039.1.2&quot;; gene_name &quot;dnaA&quot;;) will actually be treated as 4 separate space-delimited fields
  • the output field delimiter is defined as a tab so all contiguous groups of spaces (input) will be converted to tabs (output), hence the reason OP's 'last field' (which awk actually treats as 4 separate fields) is split apart with tabs

To maintain the spaces in the 'last field' OP needs to tell awk what the input field delimiter is.

If the input field delimiter is a tab then one idea for tweaking OP's current code:

awk &#39;BEGIN { FS=OFS=&quot;\t&quot;} {$1=&quot;1&quot;; print}&#39; 1-0039.gtf

If the input field delimiter is 2+ spaces then a couple alternatives:

awk &#39;BEGIN { FS=&quot;[ ]{2,}&quot;; OFS=&quot;\t&quot;} {$1=&quot;1&quot;; print}&#39; 1-0039.gtf

# or

awk &#39;BEGIN { FS=&quot;[ ][ ]+&quot;; OFS=&quot;\t&quot;} {$1=&quot;1&quot;; print}&#39; 1-0039.gtf

答案3

得分: 1

awk '{sub(/^1-0039.1/,1); print}'  1-0039.gtf > 1-0039_modified.gtf

但是评论中的 sed 解决方案将更快地完成相同的工作。

注释:

不幸的是,问题提供了矛盾的信息:

  1. 示例具有以空格分隔的字段,空格的数量不同
  2. 您写了关于字段之间的制表符,并希望保留最后一列的空格。

可以通过在每个字段使用一个制表符,制表符宽度为8个空格来创建相同的视图。

因此,解决方案必须处理这种冲突。

这就是为什么我的解决方案不使用awk的字段拆分功能,而只查看第一列的模式的原因。

这样,解决方案不依赖于对正确工作的假设。分隔符可以是任何类型和数量,解决方案都能完成工作。
特别是它不会改变当前列分隔符的状态。


感谢下面的评论。他们有他们的观点,但保持简单以便理解是首要考虑。

所以这里是一个替代版本,以在第一列获得更多灵活性:

awk '{sub(/^1-[^ \t]*/,1); print}'  1-0039.gtf > 1-0039_modified.gtf

由于这个变体将在可能不应该是分隔符的第一个空格处拆分,以下版本将将单个空格视为第一列字段内容的一部分:

awk '{sub(/^1- ?[^ \t]*/,1); print}'   1-0039.gtf > 1-0039_modified.gtf
英文:
awk &#39;{sub(/^1-0039.1/,1); print}&#39;  1-0039.gtf &gt; 1-0039_modified.gtf

But the sed solutions in the comments will do the same job faster.

Annotation:

Unfortunately the question gives contradictory information:

  1. The sample has space separated fields with varying count of spaces
  2. You write about tabs between the fields and want to keep the space at the last column.

The identical view can be created by tab separation at a tab width of 8 spaces using one tab per field.

So the solution has to deal with this conflict.

This is the reason why my solution does not use the field splitting feature of awk but just has a look at the pattern of the first column.

Like this the solution does not rely on an assumption for propper work. The delimiter can be of any type and count and the solution will do the job.
Especially it will not change the current state of the column delimiter(s).


Thanks for the comments below. They have their point, but keep it simple for understanding was the first thought.

So here an alternate edition to get more flexibility in the first column:

awk &#39;{sub(/^1-[^ \t]*/,1); print}&#39;  1-0039.gtf &gt; 1-0039_modified.gtf

As this variant will split at the first space that possibly should not be a delimiter the following version will respect a single space as part of the content of the first column field:

awk &#39;{sub(/^1- ?[^ \t]*/,1); print}&#39;   1-0039.gtf &gt; 1-0039_modified.gtf

huangapple
  • 本文由 发表于 2023年4月7日 05:47:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/75954011.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定