2023年7月24日 15:05:43go评论89阅读模式

英文:

BeautifulSoup - Find <h1> <h2> & <h3> tags elements which is placed above <a> tag

问题

# 我尝试了上面的代码，它只显示了 `&lt;a&gt;` 标签的文本。我还想获取位于 `&lt;a&gt;` 标签之上的 `&lt;h1&gt;`、`&lt;h2&gt;` 和 `&lt;h3&gt;` 标签。

英文:

How can I scrape the following structure to only get h1, h2 & h3 elements above <a> tag

I would like to get all <a> tag heading which are placed above by targeting the <a> tag in beautiful soup.

HTML Code:

&lt;!DOCTYPE html&gt;
&lt;html lang=&quot;en&quot;&gt;
&lt;head&gt;
    &lt;meta charset=&quot;UTF-8&quot;&gt;
    &lt;meta name=&quot;viewport&quot; content=&quot;width=device-width, initial-scale=1.0&quot;&gt;
    &lt;meta http-equiv=&quot;X-UA-Compatible&quot; content=&quot;ie=edge&quot;&gt;
    &lt;title&gt;Document&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
    &lt;h1&gt;Heading H1&lt;/h1&gt;
    &lt;p&gt;Lorem Ipsum is simply dummy text of the printing and typesetting industry.&lt;/p&gt;
    &lt;a href=&quot;#&quot;&gt;Button&lt;/a&gt;

    &lt;hr&gt;

    &lt;h2&gt;Heading H2&lt;/h2&gt;
    &lt;p&gt;Lorem Ipsum is simply dummy text of the printing and typesetting industry.&lt;/p&gt;
    &lt;p&gt;
        &lt;a href=&quot;#&quot;&gt;Button&lt;/a&gt;
    &lt;/p&gt;

    &lt;hr&gt;

    &lt;h3&gt;Heading H3&lt;/h3&gt;
    &lt;p&gt;Lorem Ipsum is simply dummy text of the printing and typesetting industry.&lt;/p&gt;
    &lt;p&gt;
        &lt;a href=&quot;#&quot;&gt;Button&lt;/a&gt;
    &lt;/p&gt;
    
    &lt;hr&gt;
&lt;/body&gt;
&lt;/html&gt;

My Code:

from bs4 import BeautifulSoup
import requests

website = &#39;http://127.0.0.1:5500/test.html&#39;
result = requests.get(website)
content = result.text

soup = BeautifulSoup(result.text)
# print(soup.prettify())

href_tags = [&quot;a&quot;]
for tags in soup.find_all(href_tags):
    print(tags.name + &#39; -&gt; &#39; + tags.text.strip())

Tried with above code it's displaying <a> tag text only. I would also like to get the <h1>, <h2> & <h3> tags which are placed above the <a> tag.

答案1

得分: 0

以下是获取信息的一种方法：

from bs4 import BeautifulSoup as bs
import pandas as pd

html = '''
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta http-equiv="X-UA-Compatible" content="ie=edge">
    <title>Document</title>
</head>
<body>
    <h1>Heading H1</h1>
    <p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
    <a href="#">Button</a>

    <hr>

    <h2>Heading H2</h2>
    <p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
    <p>
        <a href="#">Button</a>
    </p>

    <hr>

    <h3>Heading H3</h3>
    <p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
    <p>
        <a href="#">Button</a>
    </p>
    
    <hr>
</body>
</html>
'''
big_list = []
soup = bs(html, 'html.parser')

for link in soup.select('a'):
    link_text = link.get_text(strip=True)
    link_url = link.get('href')
    previous_header = [x.get_text(strip=True) for x in link.find_all_previous() if x.name in ['h1', 'h2', 'h3']][0]
    big_list.append((link_text, link_url, previous_header))
df = pd.DataFrame(big_list, columns=['link_text', 'link_url', 'previous_header_text'])
print(df)

在终端中的结果：

  link_text link_url previous_header_text
0    Button        #           Heading H1
1    Button        #           Heading H2
2    Button        #           Heading H3

请查看BeautifulSoup文档这里。

英文:

Here is one way of getting that information:

from bs4 import BeautifulSoup as bs
import pandas as pd

html = &#39;&#39;&#39;
&lt;!DOCTYPE html&gt;
&lt;html lang=&quot;en&quot;&gt;
&lt;head&gt;
    &lt;meta charset=&quot;UTF-8&quot;&gt;
    &lt;meta name=&quot;viewport&quot; content=&quot;width=device-width, initial-scale=1.0&quot;&gt;
    &lt;meta http-equiv=&quot;X-UA-Compatible&quot; content=&quot;ie=edge&quot;&gt;
    &lt;title&gt;Document&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
    &lt;h1&gt;Heading H1&lt;/h1&gt;
    &lt;p&gt;Lorem Ipsum is simply dummy text of the printing and typesetting industry.&lt;/p&gt;
    &lt;a href=&quot;#&quot;&gt;Button&lt;/a&gt;

    &lt;hr&gt;

    &lt;h2&gt;Heading H2&lt;/h2&gt;
    &lt;p&gt;Lorem Ipsum is simply dummy text of the printing and typesetting industry.&lt;/p&gt;
    &lt;p&gt;
        &lt;a href=&quot;#&quot;&gt;Button&lt;/a&gt;
    &lt;/p&gt;

    &lt;hr&gt;

    &lt;h3&gt;Heading H3&lt;/h3&gt;
    &lt;p&gt;Lorem Ipsum is simply dummy text of the printing and typesetting industry.&lt;/p&gt;
    &lt;p&gt;
        &lt;a href=&quot;#&quot;&gt;Button&lt;/a&gt;
    &lt;/p&gt;
    
    &lt;hr&gt;
&lt;/body&gt;
&lt;/html&gt;
&#39;&#39;&#39;
big_list = []
soup = bs(html, &#39;html.parser&#39;)

for link in soup.select(&#39;a&#39;):
    link_text = link.get_text(strip=True)
    link_url = link.get(&#39;href&#39;)
    previous_header = [x.get_text(strip=True) for x in link.find_all_previous() if x.name in [&#39;h1&#39;, &#39;h2&#39;, &#39;h3&#39;]][0]
    big_list.append((link_text, link_url, previous_header))
df = pd.DataFrame(big_list, columns=[&#39;link_text&#39;, &#39;link_url&#39;, &#39;previous_header_text&#39;])
print(df)

Result in terminal:

 	link_text 	link_url 	previous_header_text
0 	Button 	# 	Heading H1
1 	Button 	# 	Heading H2
2 	Button 	# 	Heading H3

See BeautifulSoup documentation here.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

BeautifulSoup – 查找位于标签之前的
、
和
标签元素

、

和

标签元素

问题

答案1

如何用同一行中的列值替换列表中的列名？

禁用使用Lambda的CloudWatch警报操作

无法在Python的Selenium中找到嵌套的阴影DOM元素。

更改ttkbootstrap ScrolledFrame的内容

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论