英文:
changing column values in tab delaminated file with awk without changing values in other columns
问题
你可以尝试使用awk
命令的子字符串替换功能来仅更改第一列而不影响其他列。以下是修改后的awk
命令:
awk -v OFS='\t' '{$1=gensub(/1-0039\.1/, "1", "1"); print}' 1-0039.gtf > 1-0039_modified.gtf
这个命令使用gensub
函数来查找第一列中的"1-0039.1"并将其替换为"1",而不影响其他列。这应该能够保持其他列的空格分隔不变。修改后的输出将保持与原始文件相同的列分隔符。
英文:
My file looks like this :
1-0039.1 EMBL transcript 1 1524 . + . transcript_id "1-0039.1.2"; gene_id "1-0039.1.2"; gene_name "dnaA"
1-0039.1 EMBL CDS 1 1524 . + 0 transcript_id "1-0039.1.2"; gene_name "dnaA";
1-0039.1 EMBL transcript 1646 1972 . + . transcript_id "1-0039.1.5"; gene_id "1-0039.1.5"; gene_name "ORF0009"
I want to change all "1-0039.1" values in the first column to 1
so I have tried:
awk -vOFS='\t' '{$1="1"; print}' 1-0039.gtf > 1-0039_modified.gtf
And the output looks like this:
1 EMBL transcript 1 1524 . + . transcript_id "1-0039.1.2"; gene_id "1-0039.1.2"; gene_name "dnaA"
1 EMBL CDS 1 1524 . + 0 transcript_id "1-0039.1.2"; gene_name "dnaA";
1 EMBL transcript 1646 1972 . + . transcript_id "1-0039.1.5"; gene_id "1-0039.1.5"; gene_name "ORF0009"
1 EMBL CDS 1646 1972 . + 0 transcript_id "1-0039.1.5"; gene_name "ORF0009";
1 EMBL transcript 2023 2940 . + . transcript_id "1-0039.1.7"; gene_id "1-0039.1.7"; gene_name "ORF0586"
1 EMBL CDS 2023 2940 . + 0 transcript_id "1-0039.1.7"; gene_name "ORF0586";
1 EMBL transcript 2897 3223 . + . transcript_id "1-0039.1.9"; gene_id "1-0039.1.9"; gene_name "ORF0009"
As you can see values in the last column were space-separated but now they are tab separated. My question is how do I change the first column only without messing up other columns?
答案1
得分: 2
使用awk
:
awk 'BEGIN{ FS=OFS="\t" } $1=="1-0039.1"{ $1="1" } { print }' 1-0039.gtf > 1-0039_modified.gtf
输出:
<pre>
1 EMBL transcript 1 1524 . + . transcript_id "1-0039.1.2"; gene_id "1-0039.1.2"; gene_name "dnaA"
1 EMBL CDS 1 1524 . + 0 transcript_id "1-0039.1.2"; gene_name "dnaA";
1 EMBL transcript 1646 1972 . + . transcript_id "1-0039.1.5"; gene_id "1-0039.1.5"; gene_name "ORF0009"
</pre>
参见:8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
英文:
With awk
:
awk 'BEGIN{ FS=OFS="\t" } $1=="1-0039.1"{ $1="1" } { print }' 1-0039.gtf > 1-0039_modified.gtf
Output:
<pre>
1 EMBL transcript 1 1524 . + . transcript_id "1-0039.1.2"; gene_id "1-0039.1.2"; gene_name "dnaA"
1 EMBL CDS 1 1524 . + 0 transcript_id "1-0039.1.2"; gene_name "dnaA";
1 EMBL transcript 1646 1972 . + . transcript_id "1-0039.1.5"; gene_id "1-0039.1.5"; gene_name "ORF0009"
</pre>
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
答案2
得分: 2
解决 OP 的问题,即将最后一个字段中的空格转换为制表符...
如当前的编码方式:
- 未定义输入字段分隔符,因此所有空白都被视为输入字段分隔符。
- OP 认为的 "最后字段"(例如,
transcript_id "1-0039.1.2"; gene_name "dnaA";
)实际上将被视为 4 个单独的空格分隔字段。 - 输出字段分隔符定义为制表符,因此所有连续的空格组(输入)将转换为制表符(输出),因此 OP 的 "最后字段"(实际上由
awk
视为 4 个单独字段)会被分割为制表符。
要保留 "最后字段" 中的空格,OP 需要告诉 awk
输入字段分隔符是什么。
如果输入字段分隔符是制表符,那么可以尝试修改 OP 当前的代码如下:
awk 'BEGIN { FS=OFS="\t"} {$1="1"; print}' 1-0039.gtf
如果输入字段分隔符是 2 个或更多空格,则有几种备选方案:
awk 'BEGIN { FS="[ ]{2,}"; OFS="\t"} {$1="1"; print}' 1-0039.gtf
或者
awk 'BEGIN { FS="[ ][ ]+"; OFS="\t"} {$1="1"; print}' 1-0039.gtf
英文:
Addressing OP's issue with the spaces in the last field being converted to tabs ...
As currently coded:
- no input field delimiter is defined so all white space is treated as input field delimiters
- what OP thinks of as the 'last field' (eg,
transcript_id "1-0039.1.2"; gene_name "dnaA";
) will actually be treated as 4 separate space-delimited fields - the output field delimiter is defined as a tab so all contiguous groups of spaces (input) will be converted to tabs (output), hence the reason OP's 'last field' (which
awk
actually treats as 4 separate fields) is split apart with tabs
To maintain the spaces in the 'last field' OP needs to tell awk
what the input field delimiter is.
If the input field delimiter is a tab then one idea for tweaking OP's current code:
awk 'BEGIN { FS=OFS="\t"} {$1="1"; print}' 1-0039.gtf
If the input field delimiter is 2+ spaces then a couple alternatives:
awk 'BEGIN { FS="[ ]{2,}"; OFS="\t"} {$1="1"; print}' 1-0039.gtf
# or
awk 'BEGIN { FS="[ ][ ]+"; OFS="\t"} {$1="1"; print}' 1-0039.gtf
答案3
得分: 1
awk '{sub(/^1-0039.1/,1); print}' 1-0039.gtf > 1-0039_modified.gtf
但是评论中的 sed
解决方案将更快地完成相同的工作。
注释:
不幸的是,问题提供了矛盾的信息:
- 示例具有以空格分隔的字段,空格的数量不同
- 您写了关于字段之间的制表符,并希望保留最后一列的空格。
可以通过在每个字段使用一个制表符,制表符宽度为8个空格来创建相同的视图。
因此,解决方案必须处理这种冲突。
这就是为什么我的解决方案不使用awk的字段拆分功能,而只查看第一列的模式的原因。
这样,解决方案不依赖于对正确工作的假设。分隔符可以是任何类型和数量,解决方案都能完成工作。
特别是它不会改变当前列分隔符的状态。
感谢下面的评论。他们有他们的观点,但保持简单以便理解是首要考虑。
所以这里是一个替代版本,以在第一列获得更多灵活性:
awk '{sub(/^1-[^ \t]*/,1); print}' 1-0039.gtf > 1-0039_modified.gtf
由于这个变体将在可能不应该是分隔符的第一个空格处拆分,以下版本将将单个空格视为第一列字段内容的一部分:
awk '{sub(/^1- ?[^ \t]*/,1); print}' 1-0039.gtf > 1-0039_modified.gtf
英文:
awk '{sub(/^1-0039.1/,1); print}' 1-0039.gtf > 1-0039_modified.gtf
But the sed
solutions in the comments will do the same job faster.
Annotation:
Unfortunately the question gives contradictory information:
- The sample has space separated fields with varying count of spaces
- You write about tabs between the fields and want to keep the space at the last column.
The identical view can be created by tab separation at a tab width of 8 spaces using one tab per field.
So the solution has to deal with this conflict.
This is the reason why my solution does not use the field splitting feature of awk but just has a look at the pattern of the first column.
Like this the solution does not rely on an assumption for propper work. The delimiter can be of any type and count and the solution will do the job.
Especially it will not change the current state of the column delimiter(s).
Thanks for the comments below. They have their point, but keep it simple for understanding was the first thought.
So here an alternate edition to get more flexibility in the first column:
awk '{sub(/^1-[^ \t]*/,1); print}' 1-0039.gtf > 1-0039_modified.gtf
As this variant will split at the first space that possibly should not be a delimiter the following version will respect a single space as part of the content of the first column field:
awk '{sub(/^1- ?[^ \t]*/,1); print}' 1-0039.gtf > 1-0039_modified.gtf
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论