如何从包含伪元素的标签中提取文本

huangapple go评论80阅读模式
英文:

How to extract text from a tag containing pseudo-element

问题

我尝试从一个网站上抓取数据,我想从包含伪元素(::after)的 span 标签中提取位置(文本),而这些 span 标签位于其他父 div 标签内,如下所示:

import requests
from bs4 import BeautifulSoup
import re    
url = 'https://some website'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
locations = soup.find_all("span", class_=re.compile("text$"))
for location in locations:
    print(location.text)

我还认为并非所有的 div 标签都包含位置标签。它没有输出任何内容,也没有返回任何错误。但是预期的输出示例将是'拉各斯,莱基'等。欢迎任何方法。

英文:

I'm trying to scrape a website, I want to extract only the location (text) from a span tag containing a pseudo-element (::after) within other parent div tags thus:

如何从包含伪元素的标签中提取文本

import requests
from bs4 import BeautifulSoup
import re    
url = 'https://some website'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
locations = soup.find_all("span", class_=re.compile("text$"))
for location in locations:
    print(location.text)

I also think that not all the div tags contain the tag for location. It's not giving any output and not returning any error.
But expected output for example will be 'Lagos,Lekki' amidst others. Any method is appreciated.

答案1

得分: 1

以下是翻译好的部分:

"The content is loaded via an additional api call and than rendered by the browser, a behavior what is not supported directly via requests. While the content is not available in the response data BeautifulSoup is also not able to find it."

"内容是通过额外的 API 调用加载,然后由浏览器渲染,这种行为不受 requests 直接支持。当内容在响应数据中不可用时,BeautifulSoup 也无法找到它。"

"To get the data call the api directly and use the JSON repsonse to pick Information is needed."

"要获取数据,请直接调用 API 并使用 JSON 响应来获取所需的信息。"

Example

示例

import requests
# list to hold all results
data = []

# for multiple pages iterate in range from - to
for i in range(0,2):
  # increase page by number of iteration
  url = f'https://jiji.ng/api_web/v1/listing?slug=mobile-phones&init_page=true&page={i}'
  # extend list with all items per page
  data.extend(requests.get(url).json()['adverts_list']['adverts'])

for item in data:
  print(item.get('region_item_text'))
import requests
# 用于保存所有结果的列表
data = []

# 对于多个页面,迭代范围从 - 到
for i in range(0,2):
  # 通过迭代次数增加页面号
  url = f'https://jiji.ng/api_web/v1/listing?slug=mobile-phones&init_page=true&page={i}'
  # 将列表扩展为每页的所有项目
  data.extend(requests.get(url).json()['adverts_list']['adverts'])

for item in data:
  print(item.get('region_item_text'))

Output

输出

Lagos, Lekki
Lagos, Lekki
Lagos, Ikeja
Lagos, Ikeja
Lagos, Ikeja
Oyo, Ibadan
Oyo, Ibadan
Abuja, Kubwa
Imo, Owerri
Lagos, Ikeja
Oyo, Oyo / Oyo State
拉各斯,莱基
拉各斯,莱基
拉各斯,伊科贾
拉各斯,伊科贾
拉各斯,伊科贾
奥约,伊巴丹
奥约,伊巴丹
阿布贾,库布瓦
伊莫州,奥维里
拉各斯,伊科贾
奥约,奥约 / 奥约州
英文:

The content is loaded via an additional api call and than rendered by the browser, a behavior what is not supported directly via requests. While the content is not available in the response data BeautifulSoup is also not able to find it.

To get the data call the api directly and use the JSON repsonse to pick Information is needed.

Example

import requests
# list to hold all results
data = []

# for multiple pages iterate in range from - to
for i in range(0,2):
  # increase page by number of iteration
  url = f'https://jiji.ng/api_web/v1/listing?slug=mobile-phones&init_page=true&page={i}'
  # extend list with all items per page
  data.extend(requests.get(url).json()['adverts_list']['adverts'])

for item in data:
  print(item.get('region_item_text'))

Output

Lagos, Lekki
Lagos, Lekki
Lagos, Ikeja
Lagos, Ikeja
Lagos, Ikeja
Oyo, Ibadan
Oyo, Ibadan
Abuja, Kubwa
Imo, Owerri
Lagos, Ikeja
Oyo, Oyo / Oyo State

huangapple
  • 本文由 发表于 2023年2月27日 00:13:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/75573299.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定