提取数百个文件中的模式

huangapple go评论50阅读模式
英文:

Extract pattern from several hundreds files

问题

I have translated the content as requested:

我有数百个由“rpm -q --changelog”生成的`*.changelog`文件,命名为`package01.changelog`,`package02.changelog`,`package03.changelog`等等...

我想从每个文件中提取CVE(公共漏洞和曝光)编号,并创建两个单独的文件:
1. 一个文件列出了每个`*.changelog`文件中的所有CVE,例如:

CVE-2014-3513 | CVE-2014-3567 | CVE-2014-3568


2. 第二个文件将来自*所有*`*.changelog`文件的所有CVE连接在一起,例如:

CVE-2014-3513:package01 | CVE-2014-3567:package02 | CVE-2014-3568:package03


`*.changelog`文件中列出的CVE未按特定顺序或特定结构放置,例如:

references (add CVE-2022-28356 bsc#1197391).
(CVE-2022-1016 bsc#1197227).
in error path (CVE-2022-28389 bsc#1198033).

  • xprtrdma: fix incorrect header size calculations (CVE-2022-0812
    bsc#1196761 CVE-2022-0850).

(add CVE-2022-28356 bsc#1197391)
(CVE-2022-1016 bsc#1197227).
(CVE-2022-0812
CVE-2022-0850).
(bsc#1196836 CVE-2022-26966).


我正在使用循环来处理*每个*`*.changelog`文件,提取CVE并将它们写入上述两个输出文件,例如:

for chlg in /path/to/*.changelog
do
pkg_name="$(echo $(basename $chlg) | sed 's/.changelog//')"
cve_file="$(echo $(basename $chlg) | sed 's/.changelog/.cve.html/')"

grep -o 'CVE-[[:digit:]]-[[:digit:]]' $chlg | sort -f | uniq | awk -v pkg_name="$pkg_name" -v cve_file="/path/to/${cve_file}" -v all_cve_file="/path/to/cve.html" '
{
print "<a href=&quot;https://cve.mitre.org/cgi-bin/cvename.cgi?name="$1"&quot; target=&quot;_blank&quot; title=&quot;"$1"&quot;>"$1"</a> | " > cve_file
print "<a href=&quot;https://cve.mitre.org/cgi-bin/cvename.cgi?name="$1"&quot; target=&quot;_blank&quot; title=&quot;"$1"&quot;>"$1"</a>:<span style=&quot;color:red; font-family:consolas;&quot;>"pkg_name"</span> | " >> all_cve_file
}'
done


我得到了我想要的结果,尽管处理约49000个文件需要超过一个小时的时间。另一个问题(这是另一个话题的讨论)是生成的`cve.html`文件太大,无法加载到浏览器中,最终导致浏览器崩溃。

我的问题是,我是否使用的方法来从这约49000个文件中提取CVE编号足够好,还是是否有其他更“优雅”和高效的方法。
英文:

I have several hundreds of *.changelog files named generated by rpm -q --changelog named package01.changelog, package02.changelog, package03.changelog, and so on...

I want to extract the CVE (Common Vulnerabilities and Exposures) numbers from each of them and create two separate files:

  1. One file that lists all CVEs for each *.changelog file, for example:
CVE-2014-3513 | CVE-2014-3567 | CVE-2014-3568

  1. A second file that concatenates all CVE's from all *.changelog files, for example:
CVE-2014-3513:package01 | CVE-2014-3567:package02 | CVE-2014-3568:package03

The CVEs listed in the the *.changelog files are not placed in a specific order or have a specific structure, for example:

  references (add CVE-2022-28356 bsc#1197391).
  (CVE-2022-1016 bsc#1197227).
  in error path (CVE-2022-28389 bsc#1198033).
- xprtrdma: fix incorrect header size calculations (CVE-2022-0812
  bsc#1196761 CVE-2022-0850).


(add CVE-2022-28356 bsc#1197391)
(CVE-2022-1016 bsc#1197227).
(CVE-2022-0812
CVE-2022-0850).
  (bsc#1196836 CVE-2022-26966).

I am using a loop to go through each *.changelog file, extract the CVEs and write them to the two output files mentioned above, for example:

for chlg in /path/to/*.changelog
do
  pkg_name=&quot;$(echo $(basename $chlg) | sed &#39;s/.changelog//&#39;)&quot;
  cve_file=&quot;$(echo $(basename $chlg) | sed &#39;s/.changelog/.cve.html/&#39;)&quot;
 
  grep -o &#39;CVE-[[:digit:]]*-[[:digit:]]*&#39; $chlg | sort -f | uniq | awk -v pkg_name=&quot;$pkg_name&quot; -v cve_file=&quot;/path/to/${cve_file}&quot; -v all_cve_file=&quot;/path/to/cve.html&quot; &#39;
        {
          print &quot;&lt;a href=\&quot;https://cve.mitre.org/cgi-bin/cvename.cgi?name=&quot;$1&quot;\&quot; target=\&quot;_blank\&quot; title=\&quot;&quot;$1&quot;\&quot;&gt;&quot;$1&quot;&lt;/a&gt; | &quot; &gt; cve_file
          print &quot;&lt;a href=\&quot;https://cve.mitre.org/cgi-bin/cvename.cgi?name=&quot;$1&quot;\&quot; target=\&quot;_blank\&quot; title=\&quot;&quot;$1&quot;\&quot;&gt;&quot;$1&quot;&lt;/a&gt;:&lt;span style=\&quot;color:red; font-family:consolas;\&quot;&gt;&quot;pkg_name&quot;&lt;/span&gt; | &quot; &gt;&gt; all_cve_file
        }&#39;
done

I get what the results I want, even though it's taking more than an hour to finish considering that I'm processing ~ 49000 files. Another issue (that's a discussion for a different topic) is that the resulting cve.html file is too big to be loaded by a browser which eventually crashes.

My question is if the approach I'm using to extract the CVE numbers from these ~49000 files is good enough or if there is another more "elegant" and efficient way.

答案1

得分: 2

你可以在GNU AWK中实现获取唯一值,然后使用sort -f对它们进行排序,如果CVE的总数远大于唯一CVE的数量,这可以节省时间。可以按照以下方式进行操作,假设file.txt的内容如下:

references (add CVE-2022-28356 bsc#1197391).
(CVE-2022-1016 bsc#1197227).
in error path (CVE-2022-28389 bsc#1198033).
- xprtrdma: fix incorrect header size calculations (CVE-2022-0812
bsc#1196761 CVE-2022-0850).

(add CVE-2022-28356 bsc#1197391)
(CVE-2022-1016 bsc#1197227).
(CVE-2022-0812
CVE-2022-0850).
(bsc#1196836 CVE-2022-26966).

然后运行以下命令:

awk 'BEGIN{FPAT="CVE-[[:digit:]]*-[[:digit:]]*"}$1&&!arr[$1]++{print $1}' file.txt | sort -f

输出结果如下:

CVE-2022-0812
CVE-2022-0850
CVE-2022-1016
CVE-2022-26966
CVE-2022-28356
CVE-2022-28389

解释:我告诉GNU AWK字段应以CVE-开头,然后是零个或多个数字,然后是-,然后是零个或多个数字。然后我打印第一个字段,如果存在第一个字段并且之前没有看到它,这个信息存储在数组arr中。

(在GNU Awk 5.0.1中测试过)

英文:

You might implement getting unique values in GNU AWK itself and then sort -f them which might shorten time if number of CVEs total is much greater that number of unique CVEs, this can be done following way, let file.txt content be

  references (add CVE-2022-28356 bsc#1197391).
  (CVE-2022-1016 bsc#1197227).
  in error path (CVE-2022-28389 bsc#1198033).
- xprtrdma: fix incorrect header size calculations (CVE-2022-0812
  bsc#1196761 CVE-2022-0850).


(add CVE-2022-28356 bsc#1197391)
(CVE-2022-1016 bsc#1197227).
(CVE-2022-0812
CVE-2022-0850).
  (bsc#1196836 CVE-2022-26966).

then

awk &#39;BEGIN{FPAT=&quot;CVE-[[:digit:]]*-[[:digit:]]*&quot;}$1&amp;&amp;!arr[$1]++{print $1}&#39; file.txt | sort -f

gives output

CVE-2022-0812
CVE-2022-0850
CVE-2022-1016
CVE-2022-26966
CVE-2022-28356
CVE-2022-28389

Explanation: I inform GNU AWK that field is CVE- followed by zero-or-more (*) digits followed by - followed by zero-or-more digits. Then I print 1st field if there is 1st field and it was not seen earlier which fact is stored in array arr.

(tested in GNU Awk 5.0.1)

huangapple
  • 本文由 发表于 2023年2月24日 09:39:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/75551900.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定