在文件中移除前导重复的数字 – bash

huangapple go评论49阅读模式
英文:

Remove preceding duplicate numbers in a file - bash

问题

在下面的文本文件“BEFORE FILE”中,我应该如何去除重复的数字,使其看起来像下面的“AFTER FILE”?其中的“_PRODxxxx,”中的x代表数字,将保持在这种格式中。

BEFORE FILE

NET_SalesD_PROD1111,mexico
NET_Sales4_PROD22,newjersy
NET_SalesG_PROD333,bull

AFTER FILE

NET_SalesD_PROD1,mexico
NET_Sales4_PROD2,newjersy
NET_SalesG_PROD3,bull


我尝试使用sed和一个正则表达式捕获组,类似于“PROD[1-9]{2,4}”,但无法使其起作用。
英文:

In the text file below "BEFORE FILE", how would I remove the duplicate numbers to make it look like the "AFTER FILE" below? The "_PRODxxxx," where the x's are the numbers, will stay in that format.

BEFORE FILE

NET_SalesD_PROD1111,mexico
NET_Sales4_PROD22,newjersy
NET_SalesG_PROD333,bull

AFTER FILE

NET_SalesD_PROD1,mexico
NET_Sales4_PROD2,newjersy
NET_SalesG_PROD3,bull

I have tried using sed and a regex capture group like "PROD[1-9]{2,4}" but cannot get it to work.

答案1

得分: 5

使用捕获组来捕获第一个数字,然后使用反向引用来匹配它的重复。然后在替换中使用相同的反向引用来生成只有一个的数字。

sed -E 's/PROD([1-9])+,/PROD,/'
英文:

Use a capture group to capture the first digit, and a back-reference to match repetitions of it. Then use the same back-reference in the replacement to produce just one of it.

sed -E 's/PROD([1-9])+,/PROD,/'

答案2

得分: 2

***第一种解决方案:*** 如果您可以接受使用Perl,可以按照以下方式操作,使用正则表达式和捕获组功能,以及在正则表达式中使用贪婪匹配和懒惰匹配功能来实现所需的输出。

```perl
perl -pe 's|^(.*_)(.*?)(\d)*(,.*)$||'  Input_file

第二种解决方案: 在Perl中使用简单的替换,使用捕获组查找重复项,并将其替换为自身后跟一个“,”。

perl -pe 's|([0-9])*,|,|'  Input_file
英文:

1st solution: In case you are ok with Perl have it like this way then using regex and capturing group capability and using greedy match and then Lazy match capabilities in regex to achieve the required output.

perl -pe 's|^(.*_)(.*?)(\d)*(,.*)$||'  Input_file

2nd solution: Using simple substitution in perl using capturing group to find duplicates and substitute it with itself followed by a ,.

perl -pe 's|([0-9])*,|,|'  Input_file

答案3

得分: 0

$ sed -E 's/(_PROD[0-9])[0-9]*/\1/' x
NET_SalesD_PROD1,mexico
NET_Sales4_PROD2,newjersy
NET_SalesG_PROD3,bull

英文:

Assumptions:

  • all lines contain the string _PROD[0-9]+,
  • we (effectively) want to keep the first number that comes after _PROD

One sed approach:

$ sed -E 's/(_PROD[0-9])[0-9]*//' x
NET_SalesD_PROD1,mexico
NET_Sales4_PROD2,newjersy
NET_SalesG_PROD3,bull

Where:

  • (_PROD[0-9]) - (first) capture group matches on the string _PROD<single_digit> followed by ...
  • [0-9]* - zero or more digits
  • \1 - replace the match with the (first) capture group

答案4

得分: 0

如果你想使用`awk`,需要经过一番漫长的步骤:

     awk -vc="PROD" '{

          split($1, h1, c)
          split(h1[2], h2, ",")

          print h1[1] "c" substr(h2[1], 1, 1) "," h2[2]

     }'
英文:

A long way if you want to awk it:

 awk -vc="PROD" '{

      split($1,h1,c)
      split(h1[2],h2,",")

      print h1[1]""c""substr(h2[1],1,1)","h2[2]

 }'

huangapple
  • 本文由 发表于 2023年6月8日 06:25:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/76427464.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定