英文:
How to extract text from a tag containing pseudo-element
问题
我尝试从一个网站上抓取数据,我想从包含伪元素(::after)的 span 标签中提取位置(文本),而这些 span 标签位于其他父 div 标签内,如下所示:
import requests
from bs4 import BeautifulSoup
import re
url = 'https://some website'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
locations = soup.find_all("span", class_=re.compile("text$"))
for location in locations:
print(location.text)
我还认为并非所有的 div 标签都包含位置标签。它没有输出任何内容,也没有返回任何错误。但是预期的输出示例将是'拉各斯,莱基'等。欢迎任何方法。
英文:
I'm trying to scrape a website, I want to extract only the location (text) from a span tag containing a pseudo-element (::after) within other parent div tags thus:
import requests
from bs4 import BeautifulSoup
import re
url = 'https://some website'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
locations = soup.find_all("span", class_=re.compile("text$"))
for location in locations:
print(location.text)
I also think that not all the div tags contain the tag for location. It's not giving any output and not returning any error.
But expected output for example will be 'Lagos,Lekki' amidst others. Any method is appreciated.
答案1
得分: 1
以下是翻译好的部分:
"The content is loaded via an additional api call and than rendered by the browser, a behavior what is not supported directly via requests
. While the content is not available in the response data BeautifulSoup
is also not able to find it."
"内容是通过额外的 API 调用加载,然后由浏览器渲染,这种行为不受 requests
直接支持。当内容在响应数据中不可用时,BeautifulSoup
也无法找到它。"
"To get the data call the api directly and use the JSON repsonse to pick Information is needed."
"要获取数据,请直接调用 API 并使用 JSON 响应来获取所需的信息。"
Example
示例
import requests
# list to hold all results
data = []
# for multiple pages iterate in range from - to
for i in range(0,2):
# increase page by number of iteration
url = f'https://jiji.ng/api_web/v1/listing?slug=mobile-phones&init_page=true&page={i}'
# extend list with all items per page
data.extend(requests.get(url).json()['adverts_list']['adverts'])
for item in data:
print(item.get('region_item_text'))
import requests
# 用于保存所有结果的列表
data = []
# 对于多个页面,迭代范围从 - 到
for i in range(0,2):
# 通过迭代次数增加页面号
url = f'https://jiji.ng/api_web/v1/listing?slug=mobile-phones&init_page=true&page={i}'
# 将列表扩展为每页的所有项目
data.extend(requests.get(url).json()['adverts_list']['adverts'])
for item in data:
print(item.get('region_item_text'))
Output
输出
Lagos, Lekki
Lagos, Lekki
Lagos, Ikeja
Lagos, Ikeja
Lagos, Ikeja
Oyo, Ibadan
Oyo, Ibadan
Abuja, Kubwa
Imo, Owerri
Lagos, Ikeja
Oyo, Oyo / Oyo State
拉各斯,莱基
拉各斯,莱基
拉各斯,伊科贾
拉各斯,伊科贾
拉各斯,伊科贾
奥约,伊巴丹
奥约,伊巴丹
阿布贾,库布瓦
伊莫州,奥维里
拉各斯,伊科贾
奥约,奥约 / 奥约州
英文:
The content is loaded via an additional api call and than rendered by the browser, a behavior what is not supported directly via requests
. While the content is not available in the response data BeautifulSoup
is also not able to find it.
To get the data call the api directly and use the JSON repsonse to pick Information is needed.
Example
import requests
# list to hold all results
data = []
# for multiple pages iterate in range from - to
for i in range(0,2):
# increase page by number of iteration
url = f'https://jiji.ng/api_web/v1/listing?slug=mobile-phones&init_page=true&page={i}'
# extend list with all items per page
data.extend(requests.get(url).json()['adverts_list']['adverts'])
for item in data:
print(item.get('region_item_text'))
Output
Lagos, Lekki
Lagos, Lekki
Lagos, Ikeja
Lagos, Ikeja
Lagos, Ikeja
Oyo, Ibadan
Oyo, Ibadan
Abuja, Kubwa
Imo, Owerri
Lagos, Ikeja
Oyo, Oyo / Oyo State
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论