英文:
Parsing the html based on tr and td tag values
问题
我想用bash解析html,在整个html页面中找到包含class为"error"的tr元素,如下所示。
输出结果应该是"Test failed for AAA"。
我尝试过用sed做了一些尝试,但结果不如预期,并且得到了NULL值。
任何输入都可能会有帮助。
英文:
I want to parse the html with bash where tr containing class as error like below in my whole html page.
<tr class="error">
<td>
<a href="https://exmple.com">Test failed for AAA</a>
</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0.0%</td>
<td>3.640 seconds</td>
</tr>
Output like "Test failed for AAA"
I tried few things with sed but not working as expected & getting NULL values.
Any input could be helpful
答案1
得分: 3
以下是翻译好的部分:
"一如既往,对于处理基于树结构的文档,如HTML和XML,使用面向行的正则表达式工具是错误的方法。使用了解文档格式的工具更容易,更少出错,更简单维护,以适应输入数据的任何潜在未来更改。
例如,使用xmllint
和XPath查询:
$ xmllint --html --xpath ''//tr[@class="error"]/td[1]/a/text()'' input.html
Test failed for AAA
或者使用W3C的HTML-XML Utils包和CSS选择器:
$ hxselect -c ''tr.error td:first-child a'' < input.html
Test failed for AAA
(这些可能不会在末尾打印换行符,如果与变量等交互使用而不是捕获结果,可能会令人困惑)"
英文:
As always, using line based regular-expression oriented tools for working with tree-based documents like HTML and XML is the wrong approach. Use tools aware of the format; much easier, less error prone and simpler to maintain to accommodate any potential future changes in the input data.
For example, using xmllint
and an XPath query:
$ xmllint --html --xpath '//tr[@class="error"]/td[1]/a/text()' input.html
Test failed for AAA
Or with W3C's HTML-XML Utils package and CSS selectors:
$ hxselect -c 'tr.error td:first-child a' < input.html
Test failed for AAA
(These might not print a trailing newline at the end, which might be confusing if used interactively instead of capturing the result in a variable or whatever)
答案2
得分: 1
以下是翻译好的部分:
"Look it isn't the prettiest but you can do this:"
"这可能不是最漂亮的,但你可以这样做:"
"Find the lines in between tr class="error
and </tr>
and print only the lines that contain href
:"
"查找位于tr class="error
和</tr>
之间的行,并仅打印包含href
的行:"
"awk '/tr class="error/{found=1};/</tr>/{found=0};found==1 && /href/' test.html"
"awk '/tr class="error/{found=1};/</tr>/{found=0};found==1 && /href/' test.html"
"There is 3 scripts in here:"
"这里有3个脚本:"
"Search for tr class="error
and set variable found
to 1"
"搜索tr class="error
并将变量found
设置为1"
"Search for </tr>
and set variable found
to 0"
"搜索</tr>
并将变量found
设置为0"
"If found is 1 search for href and if that matches print $0 (which is current line)"
"如果found为1,则搜索href,如果匹配则打印$0(当前行)"
"Only print the stuff in between >
and <
:"
"只打印位于>
和<
之间的内容:"
"grep -Po ">\K.(?=<)""
"grep -Po ">\K.(?=<)""
"Some things to note:"
"请注意以下一些事项:"
"- -P perl regexes for more regex magic"
"- -P用于更多的正则表达式魔法"
"- -o for only the matched parts and not the whole line"
"- -o用于仅匹配的部分,而不是整行"
"- \K in the regex is lookbehind (has to match but will not in the matched part) and all the stuff before the \K is lookbehind (its actually called reset capture group, but the difference is small)"
"- 正则表达式中的\K是后行断言(必须匹配但不在匹配部分中),\K之前的所有内容都是后行断言(实际上被称为重置捕获组,但差异很小)"
"- (?=<) The (?=...) construct is lookahead (has to match but will not be in matched part) and in this case it will look for <
"
"- (?=<)中的(?=...)结构是前瞻(必须匹配但不会出现在匹配部分中),在这种情况下,它将查找<
"
"Putting it together:"
"将它们组合起来:"
"awk '/tr class="error/{found=1};/</tr>/{found=0};found==1 && /href/{print $0}' test.html | grep -Po ">\K.(?=<)""
"awk '/tr class="error/{found=1};/</tr>/{found=0};found==1 && /href/{print $0}' test.html | grep -Po ">\K.(?=<)""
"Another way would be to remove all newlines and use some regex in grep:"
"另一种方法是删除所有换行符并在grep中使用一些正则表达式:"
"tr -d "\n" < test.html | grep -Po "class="error.?href=.?>\K.?(?=<)""
"tr -d "\n" < test.html | grep -Po "class="error.?href=.?>\K.?(?=<)""
英文:
Look it isn't the prettiest but you can do this:
Find the lines in between tr class="error
and </tr>
and print only the lines that contain href
:
awk '/tr class="error/{found=1};/<\/tr>/{found=0};found==1 && /href/' test.html
There is 3 scripts in here:
- Search for
tr class="error
and set variablefound
to 1 - Search for
</tr>
and set variablefound
to 0 - If found is 1 search for href and if that matches print $0 (which is current line)
Only print the stuff in between >
and <
:
grep -Po ">\K.*(?=<)"
Some things to note:
- -P perl regexes for more regex magic
- -o for only the matched parts and not the whole line
- \K in the regex is lookbehind (has to match but will not in the matched part) and all the stuff before the \K is lookbehind (its actually called reset capture group, but the difference is small)
- (?=<) The (?=...) construct is lookahead (has to match but will not be in matched part) and in this case it will look for
<
Putting it together:
awk '/tr class="error/{found=1};/<\/tr>/{found=0};found==1 && /href/{print $0}' test.html | grep -Po ">\K.*(?=<)"
Another way would be to remove all newlines and use some regex in grep:
tr -d "\n" < test.html | grep -Po "class=\"error.*?href=.*?>\K.*?(?=<)"
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论