解析基于tr和td标签值的HTML。

huangapple go评论79阅读模式
英文:

Parsing the html based on tr and td tag values

问题

我想用bash解析html,在整个html页面中找到包含class为"error"的tr元素,如下所示。

输出结果应该是"Test failed for AAA"。

我尝试过用sed做了一些尝试,但结果不如预期,并且得到了NULL值。

任何输入都可能会有帮助。
英文:

I want to parse the html with bash where tr containing class as error like below in my whole html page.

<tr class="error">
<td>
<a href="https://exmple.com">Test failed for AAA</a>
</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0.0%</td>
<td>3.640 seconds</td>
</tr>

Output like "Test failed for AAA"

I tried few things with sed but not working as expected & getting NULL values.

Any input could be helpful

答案1

得分: 3

以下是翻译好的部分:

"一如既往,对于处理基于树结构的文档,如HTML和XML,使用面向行的正则表达式工具是错误的方法。使用了解文档格式的工具更容易,更少出错,更简单维护,以适应输入数据的任何潜在未来更改。

例如,使用xmllint和XPath查询:

 $ xmllint --html --xpath ''//tr[@class="error"]/td[1]/a/text()'' input.html
Test failed for AAA

或者使用W3C的HTML-XML Utils包和CSS选择器:

 $ hxselect -c ''tr.error td:first-child a'' < input.html
Test failed for AAA

(这些可能不会在末尾打印换行符,如果与变量等交互使用而不是捕获结果,可能会令人困惑)"

英文:

As always, using line based regular-expression oriented tools for working with tree-based documents like HTML and XML is the wrong approach. Use tools aware of the format; much easier, less error prone and simpler to maintain to accommodate any potential future changes in the input data.

For example, using xmllint and an XPath query:

 $ xmllint --html --xpath '//tr[@class="error"]/td[1]/a/text()' input.html
Test failed for AAA

Or with W3C's HTML-XML Utils package and CSS selectors:

 $ hxselect -c 'tr.error td:first-child a' < input.html
Test failed for AAA

(These might not print a trailing newline at the end, which might be confusing if used interactively instead of capturing the result in a variable or whatever)

答案2

得分: 1

以下是翻译好的部分:

"Look it isn't the prettiest but you can do this:"
"这可能不是最漂亮的,但你可以这样做:"

"Find the lines in between tr class="error and </tr> and print only the lines that contain href :"
"查找位于tr class="error</tr>之间的行,并仅打印包含href的行:"

"awk '/tr class="error/{found=1};/</tr>/{found=0};found==1 && /href/' test.html"
"awk '/tr class="error/{found=1};/</tr>/{found=0};found==1 && /href/' test.html"

"There is 3 scripts in here:"
"这里有3个脚本:"

"Search for tr class=&quot;error and set variable found to 1"
"搜索tr class=&quot;error并将变量found设置为1"

"Search for &lt;/tr&gt; and set variable found to 0"
"搜索&lt;/tr&gt;并将变量found设置为0"

"If found is 1 search for href and if that matches print $0 (which is current line)"
"如果found为1,则搜索href,如果匹配则打印$0(当前行)"

"Only print the stuff in between &gt; and &lt; :"
"只打印位于&gt;&lt;之间的内容:"

"grep -Po ">\K.(?=<)""
"grep -Po ">\K.
(?=<)""

"Some things to note:"
"请注意以下一些事项:"

"- -P perl regexes for more regex magic"
"- -P用于更多的正则表达式魔法"

"- -o for only the matched parts and not the whole line"
"- -o用于仅匹配的部分,而不是整行"

"- \K in the regex is lookbehind (has to match but will not in the matched part) and all the stuff before the \K is lookbehind (its actually called reset capture group, but the difference is small)"
"- 正则表达式中的\K是后行断言(必须匹配但不在匹配部分中),\K之前的所有内容都是后行断言(实际上被称为重置捕获组,但差异很小)"

"- (?=<) The (?=...) construct is lookahead (has to match but will not be in matched part) and in this case it will look for &lt;"
"- (?=<)中的(?=...)结构是前瞻(必须匹配但不会出现在匹配部分中),在这种情况下,它将查找&lt;"

"Putting it together:"
"将它们组合起来:"

"awk '/tr class="error/{found=1};/</tr>/{found=0};found==1 && /href/{print $0}' test.html | grep -Po ">\K.(?=<)""
"awk '/tr class="error/{found=1};/</tr>/{found=0};found==1 && /href/{print $0}' test.html | grep -Po ">\K.
(?=<)""

"Another way would be to remove all newlines and use some regex in grep:"
"另一种方法是删除所有换行符并在grep中使用一些正则表达式:"

"tr -d "\n" < test.html | grep -Po "class=&quot;error.?href=.?>\K.?(?=<)""
"tr -d "\n" < test.html | grep -Po "class=&quot;error.
?href=.?>\K.?(?=<)""

英文:

Look it isn't the prettiest but you can do this:

Find the lines in between tr class=&quot;error and &lt;/tr&gt; and print only the lines that contain href :

awk &#39;/tr class=&quot;error/{found=1};/&lt;\/tr&gt;/{found=0};found==1 &amp;&amp; /href/&#39; test.html

There is 3 scripts in here:

  • Search for tr class=&quot;error and set variable found to 1
  • Search for &lt;/tr&gt; and set variable found to 0
  • If found is 1 search for href and if that matches print $0 (which is current line)

Only print the stuff in between &gt; and &lt; :

grep -Po &quot;&gt;\K.*(?=&lt;)&quot;

Some things to note:

  • -P perl regexes for more regex magic
  • -o for only the matched parts and not the whole line
  • \K in the regex is lookbehind (has to match but will not in the matched part) and all the stuff before the \K is lookbehind (its actually called reset capture group, but the difference is small)
  • (?=<) The (?=...) construct is lookahead (has to match but will not be in matched part) and in this case it will look for &lt;

Putting it together:

awk &#39;/tr class=&quot;error/{found=1};/&lt;\/tr&gt;/{found=0};found==1 &amp;&amp; /href/{print $0}&#39; test.html | grep -Po &quot;&gt;\K.*(?=&lt;)&quot;

Another way would be to remove all newlines and use some regex in grep:

tr -d &quot;\n&quot; &lt; test.html | grep -Po &quot;class=\&quot;error.*?href=.*?&gt;\K.*?(?=&lt;)&quot;

huangapple
  • 本文由 发表于 2023年5月15日 14:04:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/76251255.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定