英文:
How do I remove the last 2 columns from a tsv file?
问题
I am coding using bash on terminal through a docker container on my mac. 我正在使用终端上的Bash编码,通过我的Mac上的Docker容器。 I am struggling to figure out how to remove the last 2 columns on my TSV file. 我正在努力找出如何删除我的TSV文件中的最后两列。 It has 7 total and the last 2 are not needed for my work and are required to be removed. 它总共有7列,最后两列对我的工作没有用处,需要被删除。
Edit: 编辑:
The first picture is the original data file, the second is what the code is doing and it is deleting some random entries from the column. 第一张图片是原始数据文件,第二张图片是代码正在执行的操作,它正在从列中删除一些随机条目。 The third picture is what the end result of this program should do. 第三张图片是该程序的最终结果应该是什么。
I tried using awk and using NF = NF - 2 which does remove the last 2 columns but for some reason deletes some of the data I have in my 5th column which I need. 我尝试使用awk并使用NF = NF - 2来删除最后两列,但出于某种原因,它会删除我在第五列中的一些数据,而我需要这些数据。 So whilst I got the column deletion I needed, the code did a little extra. 所以虽然我得到了我需要的列删除,但代码做了一点额外的工作。
I Have a few other lines but they shouldn't cause any issues. They just check the file exists etc. 我还有几行其他代码,但不应该引起任何问题。它们只是检查文件是否存在等。
英文:
I am coding using bash on terminal through a docker container on my mac. I am struggling to figure out how to remove the last 2 columns on my TSV file. It has 7 total and the last 2 are not needed for my work and are required to be removed.
Edit:
The first picture is the original data file, the second is what the code is doing and it is deleting some random entries from the column. The third picture is what the end result of this program should do. The month and year columns I am struggling with also but I deleted the code and tried to simplify the data first.
I tried using awk and using NF = NF - 2 which does remove the last 2 columns but for some reason deletes some of the data I have in my 5th column which I need. So whilst I got the column deletion I needed, the code did a little extra. Here is the code:
preprocess() {
31 input_file="$1"
32
33 # Extract the base name of the input file
34 base_name=$(basename "$input_file" .tsv)
35
36 # Create the new output file name
37 output_file="${base_name}_clean.tsv"
38
39 awk -F'\t' 'BEGIN{OFS=FS}
40 {
41 NF = NF - 2
42
43 print
44 }' "$input_file" > "$output_file"
45 }
I Have a few other lines but they shouldn't cause any issues. They just check the file exists etc.
答案1
得分: 2
使用AWK,以下是该脚本的工作方式:
输入数据(为了清晰起见,在此处使用“;”分隔,但也可以是制表符)。
F1;F2;F3;F4
V11;V12;V13;V14
V21;V22;V23;V24
转换程序。每个人都可以遵循注释,即使是对AWK不熟悉的人也可以。
BEGIN{
FS=";" # 将分隔符更改为制表符
OFS=";" # 根据需要设置
skipcolcount=2
}
{
# 在每行中,循环遍历字段
for (i=1;i<=NF-skipcolcount;i++) {
printf $i # 通过索引变量引用字段
if (i<NF-skipcolcount) { # 在最后一个字段后面没有分隔符
printf OFS
}
}
printf "\n" # 每行后换行
}
结果:
F1;F2
V11;V12
V21;V22
英文:
With AWK a script like this works:
Inputdata (separated by ;
for clarity here, but could be tab also).
F1;F2;F3;F4
V11;V12;V13;V14
V21;V22;V23;V24
Program to convert. Comments to everyone can follow, even those new to awk.
BEGIN{
FS=";" # Convert til "\t" for TAB separation
OFS=";" # set as desired
skipcolcount=2
}
{
# In each line, loop over the fields
for (i=1;i<=NF-skipcolcount;i++) {
printf $i # reference field by index variable
if (i<NF-skipcolcount) { # no separator after last field
printf OFS
}
}
printf "\n" # linefeed after each line
}
Result:
F1;F2
V11;V12
V21;V22
答案2
得分: 2
我尝试使用awk并使用NF = NF - 2,它确实删除了最后两列,但出乎意料地删除了我第五列中需要的一些数据。
这对我来说是意外的,我确实运行了你的代码,使用的是GNU Awk 5.1.0,它运行正常,然而你正在使用docker,所以可能强制使用不稳定版本的awk
?不管怎样,如果你的任务是这样给出的:如何删除TSV文件的最后2列。总共有7列,最后2列对我的工作没有用,需要删除。
这可能简化为:获取制表符分隔文件的前5列,可以用awk
表示为:
awk 'BEGIN{FS=OFS="\t"}{print $1,$2,$3,$4,$5}' file.tsv
请运行它并写下输出是否如所需。
英文:
> I tried using awk and using NF = NF - 2 which does remove the last 2
> columns but for some reason deletes some of the data I have in my 5th
> column which I need.
This is unexpected for me, I did run your code using GNU Awk 5.1.0 and it works fine, however you are using
> docker
so maybe this force usage of erratic version of awk
? Anyway, if your task is given as
> how to remove the last 2 columns on my TSV file. It has 7 total and
> the last 2 are not needed for my work and are required to be removed.
this might be simplified to: get first 5 columns of tab-separated file, which can be expressed in awk
as
awk 'BEGIN{FS=OFS="\t"}{print $1,$2,$3,$4,$5}' file.tsv
Please run it and write if output is as desired.
答案3
得分: 1
The easiest will be rev/cut/rev combination
$ rev inputfile | cut -f3- | rev > output.file
英文:
The easiest will be rev/cut/rev combination
$ rev inputfile | cut -f3- | rev > output.file
答案4
得分: 1
你靠近了。
给定以下的TSV文件:
cat file
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
你可以使用awk来实现:
awk 'BEGIN{FS=OFS="\t";}
{NF=3} 1
' file
输出:
1 2 3
6 7 8
11 12 13
或者Ruby:
ruby -ne 'puts $_.split("\t")[0..2].join("\t")' file
# 相同
或者Perl:
perl -nE 'say join("\t", (split "\t")[0..2])' file
# 相同
英文:
You are close.
Given the following TSV file:
cat file
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
You can do this in awk:
awk 'BEGIN{FS=OFS="\t"}
{NF=3} 1
' file
Prints:
1 2 3
6 7 8
11 12 13
Or Ruby:
ruby -ne 'puts $_.split("\t")[0..2].join("\t")' file
# same
Or Perl:
perl -nE 'say join("\t", (split "\t")[0..2])' file
# same
答案5
得分: 1
请注意,以下是代码部分的翻译:
一种方法是使用 [tag:sed]:
sed 's/\t[^\t]*\t[^\t]*$//' "$input_file" > "$output_file"
匹配说明:
\t
- 制表符[^\t]*
- 零个或多个非制表符字符\t
- 制表符[^\t]*
- 零个或多个非制表符字符$
- 行尾锚点
用空字符串替代(//
部分)。
另一种方法:
sed -E 's/(\t[^\t]*){2}$//' "$input_file" > "$output_file"
在这里,匹配 \t[^\t]*
在一个分组 (
...)
中重复两次 {2}
。
英文:
One way could be to use [tag:sed]:
sed 's/\t[^\t]*\t[^\t]*$//' "$input_file" > "$output_file"
Match explained:
\t
- a tab character[^\t]*
- zero or more non-tab characters\t
- a tab character[^\t]*
- zero or more non-tab characters$
- end of line anchor
Substitute (the //
part) with an empty string.
An alternative:
sed -E 's/(\t[^\t]*){2}$//' "$input_file" > "$output_file"
Here the match \t[^\t]*
is in a group (
...)
which is repeated twice {2}
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论