2023年2月27日 00:13:50go评论170阅读模式

英文:

How to extract text from a tag containing pseudo-element

问题

我尝试从一个网站上抓取数据，我想从包含伪元素(::after)的 span 标签中提取位置（文本），而这些 span 标签位于其他父 div 标签内，如下所示：

import requests
from bs4 import BeautifulSoup
import re    
url = 'https://some website'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
locations = soup.find_all("span", class_=re.compile("text$"))
for location in locations:
    print(location.text)

我还认为并非所有的 div 标签都包含位置标签。它没有输出任何内容，也没有返回任何错误。但是预期的输出示例将是'拉各斯，莱基'等。欢迎任何方法。

英文:

I'm trying to scrape a website, I want to extract only the location (text) from a span tag containing a pseudo-element (::after) within other parent div tags thus:

import requests
from bs4 import BeautifulSoup
import re    
url = &#39;https://some website&#39;
page = requests.get(url)
soup = BeautifulSoup(page.content, &#39;html.parser&#39;)
locations = soup.find_all(&quot;span&quot;, class_=re.compile(&quot;text$&quot;))
for location in locations:
    print(location.text)

I also think that not all the div tags contain the tag for location. It's not giving any output and not returning any error.
But expected output for example will be 'Lagos,Lekki' amidst others. Any method is appreciated.

答案1

得分: 1

以下是翻译好的部分：

"The content is loaded via an additional api call and than rendered by the browser, a behavior what is not supported directly via requests. While the content is not available in the response data BeautifulSoup is also not able to find it."

"内容是通过额外的 API 调用加载，然后由浏览器渲染，这种行为不受 requests 直接支持。当内容在响应数据中不可用时，BeautifulSoup 也无法找到它。"

"To get the data call the api directly and use the JSON repsonse to pick Information is needed."

"要获取数据，请直接调用 API 并使用 JSON 响应来获取所需的信息。"

Example

示例

import requests
# list to hold all results
data = []

# for multiple pages iterate in range from - to
for i in range(0,2):
  # increase page by number of iteration
  url = f'https://jiji.ng/api_web/v1/listing?slug=mobile-phones&init_page=true&page={i}'
  # extend list with all items per page
  data.extend(requests.get(url).json()['adverts_list']['adverts'])

for item in data:
  print(item.get('region_item_text'))

import requests
# 用于保存所有结果的列表
data = []

# 对于多个页面，迭代范围从 - 到
for i in range(0,2):
  # 通过迭代次数增加页面号
  url = f'https://jiji.ng/api_web/v1/listing?slug=mobile-phones&init_page=true&page={i}'
  # 将列表扩展为每页的所有项目
  data.extend(requests.get(url).json()['adverts_list']['adverts'])

for item in data:
  print(item.get('region_item_text'))

Output

输出

Lagos, Lekki
Lagos, Lekki
Lagos, Ikeja
Lagos, Ikeja
Lagos, Ikeja
Oyo, Ibadan
Oyo, Ibadan
Abuja, Kubwa
Imo, Owerri
Lagos, Ikeja
Oyo, Oyo / Oyo State

拉各斯，莱基
拉各斯，莱基
拉各斯，伊科贾
拉各斯，伊科贾
拉各斯，伊科贾
奥约，伊巴丹
奥约，伊巴丹
阿布贾，库布瓦
伊莫州，奥维里
拉各斯，伊科贾
奥约，奥约 / 奥约州

英文:

The content is loaded via an additional api call and than rendered by the browser, a behavior what is not supported directly via requests. While the content is not available in the response data BeautifulSoup is also not able to find it.

To get the data call the api directly and use the JSON repsonse to pick Information is needed.

Example

import requests
# list to hold all results
data = []

# for multiple pages iterate in range from - to
for i in range(0,2):
  # increase page by number of iteration
  url = f&#39;https://jiji.ng/api_web/v1/listing?slug=mobile-phones&amp;init_page=true&amp;page={i}&#39;
  # extend list with all items per page
  data.extend(requests.get(url).json()[&#39;adverts_list&#39;][&#39;adverts&#39;])

for item in data:
  print(item.get(&#39;region_item_text&#39;))

Output

Lagos, Lekki
Lagos, Lekki
Lagos, Ikeja
Lagos, Ikeja
Lagos, Ikeja
Oyo, Ibadan
Oyo, Ibadan
Abuja, Kubwa
Imo, Owerri
Lagos, Ikeja
Oyo, Oyo / Oyo State

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何从包含伪元素的标签中提取文本

问题

答案1

Example

示例

Output

输出

Example

Output

如何将我的<p>文本</p>放在背景图像的末尾，保持<h1>不变？

在Python中加载Matlab .mat格式的信号日志导出文件。

Pyinstaller构建后运行可执行文件时出现错误 = “Pyarrow.vendored.version”

Python and Reddit APIs: my code doesn't give back all results from the huge reddit database. Why?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论