How do I use bs4 to parse the text description of an anchor tag, especially when the href link is broken?

huangapple go评论64阅读模式
英文:

How do I use bs4 to parse the text description of an anchor tag, especially when the href link is broken?

问题

我正在练习使用BS4解析HTML文件。我遇到了一个问题,似乎找不到解决方案。我应该如何解析锚标记内部?我尝试指定“href”标记,但链接中有一些额外字符会破坏href标记。

例如,我试图解析这个链接到我的旧问题之一:

<a href = "https://stackoverflow.com/questions/61925957/using-an-api-to-create-data-in-a-react-table" style=
="color: #FFFFFF;font-size: 15px;"> &gt;

但实际上它包含一些破坏标记的字符:

<a href = "https://stackoverflow.com/&amp;amp=3D;questions/61925957&quot;=3D&quot;/using-an-api-to-create-data-in-a-react-table" style=
"color: #FFFFFF;font-size: 15px;"> &gt;

我应该如何使用bs4获取此标记的内部内容,以便我可以修整它并获取最终链接?我还想忽略样式、颜色和字体大小描述符。

英文:

I'm practicing using BS4 to parse HTML files. I've encountered a certain issue and I can't seem to find the solution anywhere. How would I parse the inside of an an anchor tag? I've tried specifying the "href" tag but the link has some added characters which breaks the href tag.

For instance, I am trying to parse this link to one of my older questions:

&lt;a href = &quot;https://stackoverflow.com/questions/61925957/using-an-api-to-create-data-in-a-react-table&quot; style=
=3D&quot;color: #FFFFFF;font-size: 15px;&quot;&gt; &gt;

But, instead it has some characters which breaks the tag:

&lt;a href = &quot;https://stackoverflow.com/&amp;amp=3D&quot;questions/61925957&quot;=3D&quot;/using-an-api-to-create-data-in-a-react-table&quot; style=
=3D&quot;color: #FFFFFF;font-size: 15px;&quot; &gt;

How would I get the inside of this tag using bs4 so that I can trim it and get my final link? I want to also ignore the style, color and font-size descriptors.

答案1

得分: 1

from bs4 import BeautifulSoup

html_sample = &quot;&quot;&quot;&lt;a href = &quot;https://stackoverflow.com/questions/61925957/using-an-api-to-create-data-in-a-react-table&quot; style=
=3D&quot;color: #FFFFFF;font-size: 15px;&quot;&gt; &gt;&quot;&quot;&quot;

soup = BeautifulSoup(html_sample, &quot;lxml&quot;).select_one(&quot;a&quot;)[&quot;href&quot;]
print(soup)

输出:

https://stackoverflow.com/questions/61925957/using-an-api-to-create-data-in-a-react-table
英文:

I can't reproduce the issue, this works just fine:

from bs4 import BeautifulSoup

html_sample = &quot;&quot;&quot;&lt;a href = &quot;https://stackoverflow.com/questions/61925957/using-an-api-to-create-data-in-a-react-table&quot; style=
=3D&quot;color: #FFFFFF;font-size: 15px;&quot;&gt; &gt;&quot;&quot;&quot;

soup = BeautifulSoup(html_sample, &quot;lxml&quot;).select_one(&quot;a&quot;)[&quot;href&quot;]
print(soup)

Output:

https://stackoverflow.com/questions/61925957/using-an-api-to-create-data-in-a-react-table

huangapple
  • 本文由 发表于 2023年3月7日 09:08:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/75657208.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定