英文:
BeautifulSoup - Find <h1> <h2> & <h3> tags elements which is placed above <a> tag
问题
# 我尝试了上面的代码,它只显示了 `<a>` 标签的文本。我还想获取位于 `<a>` 标签之上的 `<h1>`、`<h2>` 和 `<h3>` 标签。
英文:
How can I scrape the following structure to only get h1
, h2
& h3
elements above <a>
tag
I would like to get all <a>
tag heading which are placed above by targeting the <a>
tag in beautiful soup.
HTML Code:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Document</title>
</head>
<body>
<h1>Heading H1</h1>
<p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
<a href="#">Button</a>
<hr>
<h2>Heading H2</h2>
<p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
<p>
<a href="#">Button</a>
</p>
<hr>
<h3>Heading H3</h3>
<p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
<p>
<a href="#">Button</a>
</p>
<hr>
</body>
</html>
My Code:
from bs4 import BeautifulSoup
import requests
website = 'http://127.0.0.1:5500/test.html'
result = requests.get(website)
content = result.text
soup = BeautifulSoup(result.text)
# print(soup.prettify())
href_tags = ["a"]
for tags in soup.find_all(href_tags):
print(tags.name + ' -> ' + tags.text.strip())
Tried with above code it's displaying <a>
tag text only. I would also like to get the <h1>
, <h2>
& <h3>
tags which are placed above the <a>
tag.
答案1
得分: 0
以下是获取信息的一种方法:
from bs4 import BeautifulSoup as bs
import pandas as pd
html = '''
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Document</title>
</head>
<body>
<h1>Heading H1</h1>
<p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
<a href="#">Button</a>
<hr>
<h2>Heading H2</h2>
<p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
<p>
<a href="#">Button</a>
</p>
<hr>
<h3>Heading H3</h3>
<p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
<p>
<a href="#">Button</a>
</p>
<hr>
</body>
</html>
'''
big_list = []
soup = bs(html, 'html.parser')
for link in soup.select('a'):
link_text = link.get_text(strip=True)
link_url = link.get('href')
previous_header = [x.get_text(strip=True) for x in link.find_all_previous() if x.name in ['h1', 'h2', 'h3']][0]
big_list.append((link_text, link_url, previous_header))
df = pd.DataFrame(big_list, columns=['link_text', 'link_url', 'previous_header_text'])
print(df)
在终端中的结果:
link_text link_url previous_header_text
0 Button # Heading H1
1 Button # Heading H2
2 Button # Heading H3
请查看BeautifulSoup文档 这里。
英文:
Here is one way of getting that information:
from bs4 import BeautifulSoup as bs
import pandas as pd
html = '''
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Document</title>
</head>
<body>
<h1>Heading H1</h1>
<p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
<a href="#">Button</a>
<hr>
<h2>Heading H2</h2>
<p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
<p>
<a href="#">Button</a>
</p>
<hr>
<h3>Heading H3</h3>
<p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
<p>
<a href="#">Button</a>
</p>
<hr>
</body>
</html>
'''
big_list = []
soup = bs(html, 'html.parser')
for link in soup.select('a'):
link_text = link.get_text(strip=True)
link_url = link.get('href')
previous_header = [x.get_text(strip=True) for x in link.find_all_previous() if x.name in ['h1', 'h2', 'h3']][0]
big_list.append((link_text, link_url, previous_header))
df = pd.DataFrame(big_list, columns=['link_text', 'link_url', 'previous_header_text'])
print(df)
Result in terminal:
link_text link_url previous_header_text
0 Button # Heading H1
1 Button # Heading H2
2 Button # Heading H3
See BeautifulSoup documentation here.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论