BeautifulSoup – 查找位于 标签之前的

标签元素

huangapple go评论82阅读模式
英文:

BeautifulSoup - Find <h1> <h2> & <h3> tags elements which is placed above <a> tag

问题

# 我尝试了上面的代码,它只显示了 `&lt;a&gt;` 标签的文本。我还想获取位于 `&lt;a&gt;` 标签之上的 `&lt;h1&gt;`、`&lt;h2&gt;` 和 `&lt;h3&gt;` 标签。
英文:

How can I scrape the following structure to only get h1, h2 & h3 elements above &lt;a&gt; tag

I would like to get all &lt;a&gt; tag heading which are placed above by targeting the &lt;a&gt; tag in beautiful soup.

HTML Code:

&lt;!DOCTYPE html&gt;
&lt;html lang=&quot;en&quot;&gt;
&lt;head&gt;
    &lt;meta charset=&quot;UTF-8&quot;&gt;
    &lt;meta name=&quot;viewport&quot; content=&quot;width=device-width, initial-scale=1.0&quot;&gt;
    &lt;meta http-equiv=&quot;X-UA-Compatible&quot; content=&quot;ie=edge&quot;&gt;
    &lt;title&gt;Document&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
    &lt;h1&gt;Heading H1&lt;/h1&gt;
    &lt;p&gt;Lorem Ipsum is simply dummy text of the printing and typesetting industry.&lt;/p&gt;
    &lt;a href=&quot;#&quot;&gt;Button&lt;/a&gt;

    &lt;hr&gt;

    &lt;h2&gt;Heading H2&lt;/h2&gt;
    &lt;p&gt;Lorem Ipsum is simply dummy text of the printing and typesetting industry.&lt;/p&gt;
    &lt;p&gt;
        &lt;a href=&quot;#&quot;&gt;Button&lt;/a&gt;
    &lt;/p&gt;

    &lt;hr&gt;

    &lt;h3&gt;Heading H3&lt;/h3&gt;
    &lt;p&gt;Lorem Ipsum is simply dummy text of the printing and typesetting industry.&lt;/p&gt;
    &lt;p&gt;
        &lt;a href=&quot;#&quot;&gt;Button&lt;/a&gt;
    &lt;/p&gt;
    
    &lt;hr&gt;
&lt;/body&gt;
&lt;/html&gt;

My Code:

from bs4 import BeautifulSoup
import requests

website = &#39;http://127.0.0.1:5500/test.html&#39;
result = requests.get(website)
content = result.text

soup = BeautifulSoup(result.text)
# print(soup.prettify())

href_tags = [&quot;a&quot;]
for tags in soup.find_all(href_tags):
    print(tags.name + &#39; -&gt; &#39; + tags.text.strip())

Tried with above code it's displaying &lt;a&gt; tag text only. I would also like to get the &lt;h1&gt;, &lt;h2&gt; & &lt;h3&gt; tags which are placed above the &lt;a&gt; tag.

答案1

得分: 0

以下是获取信息的一种方法:

from bs4 import BeautifulSoup as bs
import pandas as pd

html = '''
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta http-equiv="X-UA-Compatible" content="ie=edge">
    <title>Document</title>
</head>
<body>
    <h1>Heading H1</h1>
    <p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
    <a href="#">Button</a>

    <hr>

    <h2>Heading H2</h2>
    <p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
    <p>
        <a href="#">Button</a>
    </p>

    <hr>

    <h3>Heading H3</h3>
    <p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
    <p>
        <a href="#">Button</a>
    </p>
    
    <hr>
</body>
</html>
'''
big_list = []
soup = bs(html, 'html.parser')

for link in soup.select('a'):
    link_text = link.get_text(strip=True)
    link_url = link.get('href')
    previous_header = [x.get_text(strip=True) for x in link.find_all_previous() if x.name in ['h1', 'h2', 'h3']][0]
    big_list.append((link_text, link_url, previous_header))
df = pd.DataFrame(big_list, columns=['link_text', 'link_url', 'previous_header_text'])
print(df)

在终端中的结果:

  link_text link_url previous_header_text
0    Button        #           Heading H1
1    Button        #           Heading H2
2    Button        #           Heading H3

请查看BeautifulSoup文档 这里

英文:

Here is one way of getting that information:

from bs4 import BeautifulSoup as bs
import pandas as pd

html = &#39;&#39;&#39;
&lt;!DOCTYPE html&gt;
&lt;html lang=&quot;en&quot;&gt;
&lt;head&gt;
    &lt;meta charset=&quot;UTF-8&quot;&gt;
    &lt;meta name=&quot;viewport&quot; content=&quot;width=device-width, initial-scale=1.0&quot;&gt;
    &lt;meta http-equiv=&quot;X-UA-Compatible&quot; content=&quot;ie=edge&quot;&gt;
    &lt;title&gt;Document&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
    &lt;h1&gt;Heading H1&lt;/h1&gt;
    &lt;p&gt;Lorem Ipsum is simply dummy text of the printing and typesetting industry.&lt;/p&gt;
    &lt;a href=&quot;#&quot;&gt;Button&lt;/a&gt;

    &lt;hr&gt;

    &lt;h2&gt;Heading H2&lt;/h2&gt;
    &lt;p&gt;Lorem Ipsum is simply dummy text of the printing and typesetting industry.&lt;/p&gt;
    &lt;p&gt;
        &lt;a href=&quot;#&quot;&gt;Button&lt;/a&gt;
    &lt;/p&gt;

    &lt;hr&gt;

    &lt;h3&gt;Heading H3&lt;/h3&gt;
    &lt;p&gt;Lorem Ipsum is simply dummy text of the printing and typesetting industry.&lt;/p&gt;
    &lt;p&gt;
        &lt;a href=&quot;#&quot;&gt;Button&lt;/a&gt;
    &lt;/p&gt;
    
    &lt;hr&gt;
&lt;/body&gt;
&lt;/html&gt;
&#39;&#39;&#39;
big_list = []
soup = bs(html, &#39;html.parser&#39;)

for link in soup.select(&#39;a&#39;):
    link_text = link.get_text(strip=True)
    link_url = link.get(&#39;href&#39;)
    previous_header = [x.get_text(strip=True) for x in link.find_all_previous() if x.name in [&#39;h1&#39;, &#39;h2&#39;, &#39;h3&#39;]][0]
    big_list.append((link_text, link_url, previous_header))
df = pd.DataFrame(big_list, columns=[&#39;link_text&#39;, &#39;link_url&#39;, &#39;previous_header_text&#39;])
print(df)

Result in terminal:

 	link_text 	link_url 	previous_header_text
0 	Button 	# 	Heading H1
1 	Button 	# 	Heading H2
2 	Button 	# 	Heading H3

See BeautifulSoup documentation here.

huangapple
  • 本文由 发表于 2023年7月24日 15:05:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/76752110.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定