英文:
Scraping USA Today Using the Wayback Machine
问题
我正在尝试从USA Today网站上爬取新闻标题及其链接的信息,网址是:
https://web.archive.org/web/20220101001435/https://www.usatoday.com/
以下是返回的soup中我需要的部分内容:
<a class="section-helper-flex section-helper-row ten-column spacer-small p1-container" data-index="2" data-module-name="promo-story-thumb-small" href="https://www.usatoday.com/story/sports/nfl/2021/12/31/nfl-predictions-wrong-worst-preseason-mvp-super-bowl/9061437002/" onclick="firePromoAnalytics(event)"><div class="section-helper-flex section-helper-column p1-text-wrap"><div class="promo-premium-content-label-wrap"></div><div class="p1-title"><div class="p1-title-spacer">Revisiting our worst NFL predictions of 2021: What went wrong?</div></div><div class="p1-info-wrap"><span class="p1-label">NFL</span>
我不知道如何从整个soup中提取特定的摘录,但我假设可以像这样操作:
links = soup.find_all('a', href=True)
代码如下:
import re
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import requests
url = 'https://web.archive.org/web/20220101001435/https://www.usatoday.com/'
options = Options()
options.headless = False
driver = webdriver.Chrome(options=options)
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")
links = soup.find_all('a', href=True)
print(links)
我期望的输出是一个二维数组,第一个元素是标题,第二个元素是链接,类似于:
[...[Revisiting our worst NFL predictions of 2021: What went wrong?, https://www.usatoday.com/story/sports/nfl/2021/12/31/nfl-predictions-wrong-worst-preseason-mvp-super-bowl/9061437002/],...]
(请注意,由于您的原始代码中有一些HTML实体编码,我在示例中保留了这些编码,您可以使用相应的解码方法将它们还原为正常字符。)
英文:
I am trying to scrape the news title and its url from the USA today website on:
https://web.archive.org/web/20220101001435/https://www.usatoday.com/
Excerpt of the soup it returned that i need:
<a class="section-helper-flex section-helper-row ten-column spacer-small p1-container" data-index="2" data-module-name="promo-story-thumb-small" href="https://www.usatoday.com/story/sports/nfl/2021/12/31/nfl-predictions-wrong-worst-preseason-mvp-super-bowl/9061437002/" onclick="firePromoAnalytics(event)"><div class="section-helper-flex section-helper-column p1-text-wrap"><div class="promo-premium-content-label-wrap"></div><div class="p1-title"><div class="p1-title-spacer">Revisiting our worst NFL predictions of 2021: What went wrong?</div></div><div class="p1-info-wrap"><span class="p1-label">NFL</span>
I don't know how to take specific excerpts like this out of the whole soup as it should with as i assume:
links=soup.find_all('a',href=True)
code:
import re
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import requests
url = 'https://web.archive.org/web/20220101001435/https://www.usatoday.com/'
options = Options()
options.headless = False
driver = webdriver.Chrome(options=options)
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")
links=soup.find_all('a',href=True)
print (links)
I expect an output of a 2-D array with the first element being the title and the second element being the link, something like:
[...[Revisiting our worst NFL predictions of 2021: What went wrong?,https://www.usatoday.com/story/sports/nfl/2021/12/31/nfl-predictions-wrong-worst-preseason-mvp-super-bowl/9061437002/],...]
答案1
得分: 0
soup.find_all
返回的是一个 bs4.element.Tag
列表,而不是标题和链接的列表。指定 href=True
只是确保所选的 a
元素具有 href
属性。
所以下面是完整的代码,可以实现你想要的功能:
import requests
from bs4 import BeautifulSoup
# 1.
url = "https://web.archive.org/web/20220101001435/https://www.usatoday.com/"
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
# 2.
links = soup.find_all("a", class_="section-helper-flex", href=True)
results = []
for link in links:
# 3.
title = link.find("div", class_="p1-title-spacer")
href = link["href"]
results.append([title.text, href])
print(results)
操作步骤如下:
- 获取 HTML 源码,你无需使用
selenium
,只需使用requests
库即可。然后使用bs4
解析它。 - 查找所有的
a
元素。由于你原始的筛选条件太宽泛,会获取页面上的每个链接,我进行了修改,以便仅获取具有文章的元素。 - 最后提取
a
元素内部的标题和URL。
英文:
soup.find_all
returns a list of bs4.element.Tag
, not a list of title and href. The href=True
specified merely makes sure that the a
elements selected have the attribute href
with them.
So here is the complete code that will do what you want:
import requests
from bs4 import BeautifulSoup
# 1.
url = "https://web.archive.org/web/20220101001435/https://www.usatoday.com/"
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
# 2.
links = soup.find_all("a", class_="section-helper-flex", href=True)
results = []
for link in links:
# 3.
title = link.find("div", class_="p1-title-spacer")
href = link["href"]
results.append([title.text, href])
print(results)
The process is as follows:
- Get the HTML source, which you don't need
selenium
but can use therequests
library. After that, parse that withbs4
. - Find all the
<a>
elements. Since your original filter is way to generous that would get every single link on the page, I modified it so it will only yield elements that have articles. - Finally, extract the title inside
a
and the URL.
答案2
得分: 0
这段代码完全符合你的需求:
titles = [title.text for title in soup.select("div.p1_title > div.p1-title-spacer")]
urls = for link in soup.select("a.section-helper-flex.section-helper-row.ten-column.spacer-small.p1-container")]
output = list(zip(titles, urls))
print(output)
你之前做错的地方是soup.findall
返回的是一个包含bs4.element.Tag
的列表,而不是包含文本和href的列表。
我使用了列表推导式,它会找到所有类为p1-title-spacer
的div
元素,这些元素是p1_title
类div
的子元素,并提取了每个元素的文本。类似地,我使用CSS选择器提取了URL。
我建议你阅读一下findall的文档。
另外,你也可以完全不使用BeautifulSoup,纯粹使用Selenium来实现:
from selenium.webdriver.common.by import By
titles = [title.text for title in driver.find_elements(By.CSS_SELECTOR, "div.p1_title > div.p1-title-spacer")]
urls =
output = list(zip(titles, urls))
print(output)
你还可以查阅一下WebDriverWait,以确保在元素加载之前不要查找元素。
以下是完整的代码:
import re
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import requests
url = 'https://web.archive.org/web/20220101001435/https://www.usatoday.com/'
options = Options()
options.headless = False
driver = webdriver.Chrome(options=options)
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")
titles = [title.text for title in soup.select("div.p1_title > div.p1-title-spacer")]
urls = for link in soup.select("a.section-helper-flex.section-helper-row.ten-column.spacer-small.p1-container")]
output = list(zip(titles, urls))
print(output)
英文:
This does exactly what you want:
titles = [title.text for title in soup.select("div.p1_title > div.p1-title-spacer")]
urls = for link in soup.select("a.section-helper-flex.section-helper-row.ten-column.spacer-small.p1-container")]
output = list(zip(titles, urls))
print(output)
What you were doing wrong is that soup.findall returns a list of bs4.element.Tag
s, not lists containing the text and href.
I used a list comprehension which sought out all div
s of the class p1-title-spacer
which were children of a div
with class p1_title
and extracted the text from each of them. Similarly, I used CSS selectors to extract the urls too.
I recommend that you read the documentation for findall.
Alternatively, you could do this without using beautifulsoup at all, purely using selenium:
from selenium.webdriver.common.by import By
titles = [title.text for title in driver.find_elements(By.CSS_SELECTOR, ""div.p1_title > div.p1-title-spacer"")]
urls =
output = list(zip(titles, urls))
print(output)
You could also look into WebDriverWait to ensure that you don't look for elements before they've loaded.
Here's the whole thing:
import re
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import requests
url = 'https://web.archive.org/web/20220101001435/https://www.usatoday.com/'
options = Options()
options.headless = False
driver = webdriver.Chrome(options=options)
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")
titles = [title.text for title in soup.select("div.p1_title > div.p1-title-spacer")]
urls = for link in soup.select("a.section-helper-flex.section-helper-row.ten-column.spacer-small.p1-container")]
output = list(zip(titles, urls))
print(output)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论