使用Wayback Machine抓取USA Today的内容

huangapple go评论59阅读模式
英文:

Scraping USA Today Using the Wayback Machine

问题

我正在尝试从USA Today网站上爬取新闻标题及其链接的信息,网址是:
https://web.archive.org/web/20220101001435/https://www.usatoday.com/

以下是返回的soup中我需要的部分内容:

<a class="section-helper-flex section-helper-row ten-column spacer-small p1-container" data-index="2" data-module-name="promo-story-thumb-small" href="https://www.usatoday.com/story/sports/nfl/2021/12/31/nfl-predictions-wrong-worst-preseason-mvp-super-bowl/9061437002/" onclick="firePromoAnalytics(event)"><div class="section-helper-flex section-helper-column p1-text-wrap"><div class="promo-premium-content-label-wrap"></div><div class="p1-title"><div class="p1-title-spacer">Revisiting our worst NFL predictions of 2021: What went wrong?</div></div><div class="p1-info-wrap"><span class="p1-label">NFL</span>

我不知道如何从整个soup中提取特定的摘录,但我假设可以像这样操作:

links = soup.find_all('a', href=True)

代码如下:

import re
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import requests

url = 'https://web.archive.org/web/20220101001435/https://www.usatoday.com/'
options = Options()
options.headless = False
driver = webdriver.Chrome(options=options)
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")

links = soup.find_all('a', href=True)
print(links)

我期望的输出是一个二维数组,第一个元素是标题,第二个元素是链接,类似于:

[...[Revisiting our worst NFL predictions of 2021: What went wrong?, https://www.usatoday.com/story/sports/nfl/2021/12/31/nfl-predictions-wrong-worst-preseason-mvp-super-bowl/9061437002/],...]

(请注意,由于您的原始代码中有一些HTML实体编码,我在示例中保留了这些编码,您可以使用相应的解码方法将它们还原为正常字符。)

英文:

I am trying to scrape the news title and its url from the USA today website on:
https://web.archive.org/web/20220101001435/https://www.usatoday.com/

Excerpt of the soup it returned that i need:

<a class="section-helper-flex section-helper-row ten-column spacer-small p1-container" data-index="2" data-module-name="promo-story-thumb-small" href="https://www.usatoday.com/story/sports/nfl/2021/12/31/nfl-predictions-wrong-worst-preseason-mvp-super-bowl/9061437002/" onclick="firePromoAnalytics(event)"><div class="section-helper-flex section-helper-column p1-text-wrap"><div class="promo-premium-content-label-wrap"></div><div class="p1-title"><div class="p1-title-spacer">Revisiting our worst NFL predictions of 2021: What went wrong?</div></div><div class="p1-info-wrap"><span class="p1-label">NFL</span>

I don't know how to take specific excerpts like this out of the whole soup as it should with as i assume:

links=soup.find_all('a',href=True)

code:

import re
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import requests

url = 'https://web.archive.org/web/20220101001435/https://www.usatoday.com/'
options = Options()
options.headless = False
driver = webdriver.Chrome(options=options)
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")

links=soup.find_all('a',href=True)
print (links)

I expect an output of a 2-D array with the first element being the title and the second element being the link, something like:

[...[Revisiting our worst NFL predictions of 2021: What went wrong?,https://www.usatoday.com/story/sports/nfl/2021/12/31/nfl-predictions-wrong-worst-preseason-mvp-super-bowl/9061437002/],...]

答案1

得分: 0

soup.find_all 返回的是一个 bs4.element.Tag 列表,而不是标题和链接的列表。指定 href=True 只是确保所选的 a 元素具有 href 属性。

所以下面是完整的代码,可以实现你想要的功能:

import requests
from bs4 import BeautifulSoup

# 1.
url = "https://web.archive.org/web/20220101001435/https://www.usatoday.com/"
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
# 2.
links = soup.find_all("a", class_="section-helper-flex", href=True)

results = []
for link in links:
    # 3.
    title = link.find("div", class_="p1-title-spacer")
    href = link["href"]
    results.append([title.text, href])

print(results)

操作步骤如下:

  1. 获取 HTML 源码,你无需使用 selenium,只需使用 requests 库即可。然后使用 bs4 解析它。
  2. 查找所有的 a 元素。由于你原始的筛选条件太宽泛,会获取页面上的每个链接,我进行了修改,以便仅获取具有文章的元素。
  3. 最后提取 a 元素内部的标题和URL。
英文:

soup.find_all returns a list of bs4.element.Tag, not a list of title and href. The href=True specified merely makes sure that the a elements selected have the attribute href with them.

So here is the complete code that will do what you want:

import requests
from bs4 import BeautifulSoup

# 1.
url = "https://web.archive.org/web/20220101001435/https://www.usatoday.com/"
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
# 2.
links = soup.find_all("a", class_="section-helper-flex", href=True)

results = []
for link in links:
    # 3.
    title = link.find("div", class_="p1-title-spacer")
    href = link["href"]
    results.append([title.text, href])

print(results)

The process is as follows:

  1. Get the HTML source, which you don't need selenium but can use the requests library. After that, parse that with bs4.
  2. Find all the <a> elements. Since your original filter is way to generous that would get every single link on the page, I modified it so it will only yield elements that have articles.
  3. Finally, extract the title inside a and the URL.

答案2

得分: 0

这段代码完全符合你的需求:

titles = [title.text for title in soup.select("div.p1_title > div.p1-title-spacer")]
urls = 
for link in soup.select("a.section-helper-flex.section-helper-row.ten-column.spacer-small.p1-container")]
output = list(zip(titles, urls)) print(output)

你之前做错的地方是soup.findall返回的是一个包含bs4.element.Tag的列表,而不是包含文本和href的列表。

我使用了列表推导式,它会找到所有类为p1-title-spacerdiv元素,这些元素是p1_titlediv的子元素,并提取了每个元素的文本。类似地,我使用CSS选择器提取了URL。

我建议你阅读一下findall的文档

另外,你也可以完全不使用BeautifulSoup,纯粹使用Selenium来实现:

from selenium.webdriver.common.by import By

titles = [title.text for title in driver.find_elements(By.CSS_SELECTOR, "div.p1_title > div.p1-title-spacer")]
urls = 
output = list(zip(titles, urls)) print(output)

你还可以查阅一下WebDriverWait,以确保在元素加载之前不要查找元素。

以下是完整的代码:

import re
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import requests

url = 'https://web.archive.org/web/20220101001435/https://www.usatoday.com/'
options = Options()
options.headless = False
driver = webdriver.Chrome(options=options)
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")
titles = [title.text for title in soup.select("div.p1_title > div.p1-title-spacer")]
urls = 
for link in soup.select("a.section-helper-flex.section-helper-row.ten-column.spacer-small.p1-container")]
output = list(zip(titles, urls)) print(output)
英文:

This does exactly what you want:

titles = [title.text for title in soup.select("div.p1_title > div.p1-title-spacer")]
urls = 
for link in soup.select("a.section-helper-flex.section-helper-row.ten-column.spacer-small.p1-container")]
output = list(zip(titles, urls)) print(output)

What you were doing wrong is that soup.findall returns a list of bs4.element.Tags, not lists containing the text and href.

I used a list comprehension which sought out all divs of the class p1-title-spacer which were children of a div with class p1_title and extracted the text from each of them. Similarly, I used CSS selectors to extract the urls too.

I recommend that you read the documentation for findall.

Alternatively, you could do this without using beautifulsoup at all, purely using selenium:

from selenium.webdriver.common.by import By

titles = [title.text for title in driver.find_elements(By.CSS_SELECTOR, ""div.p1_title > div.p1-title-spacer"")]
urls = 
output = list(zip(titles, urls)) print(output)

You could also look into WebDriverWait to ensure that you don't look for elements before they've loaded.

Here's the whole thing:

import re
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import requests

url = 'https://web.archive.org/web/20220101001435/https://www.usatoday.com/'
options = Options()
options.headless = False
driver = webdriver.Chrome(options=options)
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")
titles = [title.text for title in soup.select("div.p1_title > div.p1-title-spacer")]
urls = 
for link in soup.select("a.section-helper-flex.section-helper-row.ten-column.spacer-small.p1-container")] output = list(zip(titles, urls)) print(output)

huangapple
  • 本文由 发表于 2023年6月11日 21:00:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76450581.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定