英文:
Extract pattern from several hundreds files
问题
I have translated the content as requested:
我有数百个由“rpm -q --changelog”生成的`*.changelog`文件,命名为`package01.changelog`,`package02.changelog`,`package03.changelog`等等...
我想从每个文件中提取CVE(公共漏洞和曝光)编号,并创建两个单独的文件:
1. 一个文件列出了每个`*.changelog`文件中的所有CVE,例如:
CVE-2014-3513 | CVE-2014-3567 | CVE-2014-3568
2. 第二个文件将来自*所有*`*.changelog`文件的所有CVE连接在一起,例如:
CVE-2014-3513:package01 | CVE-2014-3567:package02 | CVE-2014-3568:package03
`*.changelog`文件中列出的CVE未按特定顺序或特定结构放置,例如:
references (add CVE-2022-28356 bsc#1197391).
(CVE-2022-1016 bsc#1197227).
in error path (CVE-2022-28389 bsc#1198033).
- xprtrdma: fix incorrect header size calculations (CVE-2022-0812
bsc#1196761 CVE-2022-0850).
(add CVE-2022-28356 bsc#1197391)
(CVE-2022-1016 bsc#1197227).
(CVE-2022-0812
CVE-2022-0850).
(bsc#1196836 CVE-2022-26966).
我正在使用循环来处理*每个*`*.changelog`文件,提取CVE并将它们写入上述两个输出文件,例如:
for chlg in /path/to/*.changelog
do
pkg_name="$(echo $(basename $chlg) | sed 's/.changelog//')"
cve_file="$(echo $(basename $chlg) | sed 's/.changelog/.cve.html/')"
grep -o 'CVE-[[:digit:]]-[[:digit:]]' $chlg | sort -f | uniq | awk -v pkg_name="$pkg_name" -v cve_file="/path/to/${cve_file}" -v all_cve_file="/path/to/cve.html" '
{
print "<a href="https://cve.mitre.org/cgi-bin/cvename.cgi?name="$1"" target="_blank" title=""$1"">"$1"</a> | " > cve_file
print "<a href="https://cve.mitre.org/cgi-bin/cvename.cgi?name="$1"" target="_blank" title=""$1"">"$1"</a>:<span style="color:red; font-family:consolas;">"pkg_name"</span> | " >> all_cve_file
}'
done
我得到了我想要的结果,尽管处理约49000个文件需要超过一个小时的时间。另一个问题(这是另一个话题的讨论)是生成的`cve.html`文件太大,无法加载到浏览器中,最终导致浏览器崩溃。
我的问题是,我是否使用的方法来从这约49000个文件中提取CVE编号足够好,还是是否有其他更“优雅”和高效的方法。
英文:
I have several hundreds of *.changelog
files named generated by rpm -q --changelog
named package01.changelog
, package02.changelog
, package03.changelog
, and so on...
I want to extract the CVE (Common Vulnerabilities and Exposures) numbers from each of them and create two separate files:
- One file that lists all CVEs for each
*.changelog
file, for example:
CVE-2014-3513 | CVE-2014-3567 | CVE-2014-3568
- A second file that concatenates all CVE's from all
*.changelog
files, for example:
CVE-2014-3513:package01 | CVE-2014-3567:package02 | CVE-2014-3568:package03
The CVEs listed in the the *.changelog
files are not placed in a specific order or have a specific structure, for example:
references (add CVE-2022-28356 bsc#1197391).
(CVE-2022-1016 bsc#1197227).
in error path (CVE-2022-28389 bsc#1198033).
- xprtrdma: fix incorrect header size calculations (CVE-2022-0812
bsc#1196761 CVE-2022-0850).
(add CVE-2022-28356 bsc#1197391)
(CVE-2022-1016 bsc#1197227).
(CVE-2022-0812
CVE-2022-0850).
(bsc#1196836 CVE-2022-26966).
I am using a loop to go through each *.changelog
file, extract the CVEs and write them to the two output files mentioned above, for example:
for chlg in /path/to/*.changelog
do
pkg_name="$(echo $(basename $chlg) | sed 's/.changelog//')"
cve_file="$(echo $(basename $chlg) | sed 's/.changelog/.cve.html/')"
grep -o 'CVE-[[:digit:]]*-[[:digit:]]*' $chlg | sort -f | uniq | awk -v pkg_name="$pkg_name" -v cve_file="/path/to/${cve_file}" -v all_cve_file="/path/to/cve.html" '
{
print "<a href=\"https://cve.mitre.org/cgi-bin/cvename.cgi?name="$1"\" target=\"_blank\" title=\""$1"\">"$1"</a> | " > cve_file
print "<a href=\"https://cve.mitre.org/cgi-bin/cvename.cgi?name="$1"\" target=\"_blank\" title=\""$1"\">"$1"</a>:<span style=\"color:red; font-family:consolas;\">"pkg_name"</span> | " >> all_cve_file
}'
done
I get what the results I want, even though it's taking more than an hour to finish considering that I'm processing ~ 49000 files. Another issue (that's a discussion for a different topic) is that the resulting cve.html
file is too big to be loaded by a browser which eventually crashes.
My question is if the approach I'm using to extract the CVE numbers from these ~49000 files is good enough or if there is another more "elegant" and efficient way.
答案1
得分: 2
你可以在GNU AWK
中实现获取唯一值,然后使用sort -f
对它们进行排序,如果CVE的总数远大于唯一CVE的数量,这可以节省时间。可以按照以下方式进行操作,假设file.txt
的内容如下:
references (add CVE-2022-28356 bsc#1197391).
(CVE-2022-1016 bsc#1197227).
in error path (CVE-2022-28389 bsc#1198033).
- xprtrdma: fix incorrect header size calculations (CVE-2022-0812
bsc#1196761 CVE-2022-0850).
(add CVE-2022-28356 bsc#1197391)
(CVE-2022-1016 bsc#1197227).
(CVE-2022-0812
CVE-2022-0850).
(bsc#1196836 CVE-2022-26966).
然后运行以下命令:
awk 'BEGIN{FPAT="CVE-[[:digit:]]*-[[:digit:]]*"}$1&&!arr[$1]++{print $1}' file.txt | sort -f
输出结果如下:
CVE-2022-0812
CVE-2022-0850
CVE-2022-1016
CVE-2022-26966
CVE-2022-28356
CVE-2022-28389
解释:我告诉GNU AWK
字段应以CVE-
开头,然后是零个或多个数字,然后是-
,然后是零个或多个数字。然后我打印第一个字段,如果存在第一个字段并且之前没有看到它,这个信息存储在数组arr
中。
(在GNU Awk 5.0.1中测试过)
英文:
You might implement getting unique values in GNU AWK
itself and then sort -f
them which might shorten time if number of CVEs total is much greater that number of unique CVEs, this can be done following way, let file.txt
content be
references (add CVE-2022-28356 bsc#1197391).
(CVE-2022-1016 bsc#1197227).
in error path (CVE-2022-28389 bsc#1198033).
- xprtrdma: fix incorrect header size calculations (CVE-2022-0812
bsc#1196761 CVE-2022-0850).
(add CVE-2022-28356 bsc#1197391)
(CVE-2022-1016 bsc#1197227).
(CVE-2022-0812
CVE-2022-0850).
(bsc#1196836 CVE-2022-26966).
then
awk 'BEGIN{FPAT="CVE-[[:digit:]]*-[[:digit:]]*"}$1&&!arr[$1]++{print $1}' file.txt | sort -f
gives output
CVE-2022-0812
CVE-2022-0850
CVE-2022-1016
CVE-2022-26966
CVE-2022-28356
CVE-2022-28389
Explanation: I inform GNU AWK
that field is CVE-
followed by zero-or-more (*
) digits followed by -
followed by zero-or-more digits. Then I print
1st field if there is 1st field and it was not seen earlier which fact is stored in array arr
.
(tested in GNU Awk 5.0.1)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论