2023年6月11日 21:00:48go评论73阅读模式

英文:

Scraping USA Today Using the Wayback Machine

问题

我正在尝试从USA Today网站上爬取新闻标题及其链接的信息，网址是：
https://web.archive.org/web/20220101001435/https://www.usatoday.com/

以下是返回的soup中我需要的部分内容：

&lt;a class=&quot;section-helper-flex section-helper-row ten-column spacer-small p1-container&quot; data-index=&quot;2&quot; data-module-name=&quot;promo-story-thumb-small&quot; href=&quot;https://www.usatoday.com/story/sports/nfl/2021/12/31/nfl-predictions-wrong-worst-preseason-mvp-super-bowl/9061437002/&quot; onclick=&quot;firePromoAnalytics(event)&quot;&gt;&lt;div class=&quot;section-helper-flex section-helper-column p1-text-wrap&quot;&gt;&lt;div class=&quot;promo-premium-content-label-wrap&quot;&gt;&lt;/div&gt;&lt;div class=&quot;p1-title&quot;&gt;&lt;div class=&quot;p1-title-spacer&quot;&gt;Revisiting our worst NFL predictions of 2021: What went wrong?&lt;/div&gt;&lt;/div&gt;&lt;div class=&quot;p1-info-wrap&quot;&gt;&lt;span class=&quot;p1-label&quot;&gt;NFL&lt;/span&gt;

我不知道如何从整个soup中提取特定的摘录，但我假设可以像这样操作：

links = soup.find_all('a', href=True)

代码如下：

import re
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import requests

url = 'https://web.archive.org/web/20220101001435/https://www.usatoday.com/'
options = Options()
options.headless = False
driver = webdriver.Chrome(options=options)
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")

links = soup.find_all('a', href=True)
print(links)

我期望的输出是一个二维数组，第一个元素是标题，第二个元素是链接，类似于：

[...[Revisiting our worst NFL predictions of 2021: What went wrong?, https://www.usatoday.com/story/sports/nfl/2021/12/31/nfl-predictions-wrong-worst-preseason-mvp-super-bowl/9061437002/],...]

（请注意，由于您的原始代码中有一些HTML实体编码，我在示例中保留了这些编码，您可以使用相应的解码方法将它们还原为正常字符。）

英文:

I am trying to scrape the news title and its url from the USA today website on:
https://web.archive.org/web/20220101001435/https://www.usatoday.com/

Excerpt of the soup it returned that i need:

&lt;a class=&quot;section-helper-flex section-helper-row ten-column spacer-small p1-container&quot; data-index=&quot;2&quot; data-module-name=&quot;promo-story-thumb-small&quot; href=&quot;https://www.usatoday.com/story/sports/nfl/2021/12/31/nfl-predictions-wrong-worst-preseason-mvp-super-bowl/9061437002/&quot; onclick=&quot;firePromoAnalytics(event)&quot;&gt;&lt;div class=&quot;section-helper-flex section-helper-column p1-text-wrap&quot;&gt;&lt;div class=&quot;promo-premium-content-label-wrap&quot;&gt;&lt;/div&gt;&lt;div class=&quot;p1-title&quot;&gt;&lt;div class=&quot;p1-title-spacer&quot;&gt;Revisiting our worst NFL predictions of 2021: What went wrong?&lt;/div&gt;&lt;/div&gt;&lt;div class=&quot;p1-info-wrap&quot;&gt;&lt;span class=&quot;p1-label&quot;&gt;NFL&lt;/span&gt;

I don't know how to take specific excerpts like this out of the whole soup as it should with as i assume:

links=soup.find_all(&#39;a&#39;,href=True)

code:

import re
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import requests

url = &#39;https://web.archive.org/web/20220101001435/https://www.usatoday.com/&#39;
options = Options()
options.headless = False
driver = webdriver.Chrome(options=options)
driver.get(url)
soup = BeautifulSoup(driver.page_source, &quot;html.parser&quot;)

links=soup.find_all(&#39;a&#39;,href=True)
print (links)

I expect an output of a 2-D array with the first element being the title and the second element being the link, something like:

[...[Revisiting our worst NFL predictions of 2021: What went wrong?,https://www.usatoday.com/story/sports/nfl/2021/12/31/nfl-predictions-wrong-worst-preseason-mvp-super-bowl/9061437002/],...]

答案1

得分: 0

soup.find_all 返回的是一个 bs4.element.Tag 列表，而不是标题和链接的列表。指定 href=True 只是确保所选的 a 元素具有 href 属性。

所以下面是完整的代码，可以实现你想要的功能：

import requests
from bs4 import BeautifulSoup

# 1.
url = "https://web.archive.org/web/20220101001435/https://www.usatoday.com/"
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
# 2.
links = soup.find_all("a", class_="section-helper-flex", href=True)

results = []
for link in links:
    # 3.
    title = link.find("div", class_="p1-title-spacer")
    href = link["href"]
    results.append([title.text, href])

print(results)

操作步骤如下：

获取 HTML 源码，你无需使用 selenium，只需使用 requests 库即可。然后使用 bs4 解析它。
查找所有的 a 元素。由于你原始的筛选条件太宽泛，会获取页面上的每个链接，我进行了修改，以便仅获取具有文章的元素。
最后提取 a 元素内部的标题和URL。

英文:

soup.find_all returns a list of bs4.element.Tag, not a list of title and href. The href=True specified merely makes sure that the a elements selected have the attribute href with them.

So here is the complete code that will do what you want:

import requests
from bs4 import BeautifulSoup

# 1.
url = &quot;https://web.archive.org/web/20220101001435/https://www.usatoday.com/&quot;
html = requests.get(url).text
soup = BeautifulSoup(html, &quot;html.parser&quot;)
# 2.
links = soup.find_all(&quot;a&quot;, class_=&quot;section-helper-flex&quot;, href=True)

results = []
for link in links:
    # 3.
    title = link.find(&quot;div&quot;, class_=&quot;p1-title-spacer&quot;)
    href = link[&quot;href&quot;]
    results.append([title.text, href])

print(results)

The process is as follows:

Get the HTML source, which you don't need selenium but can use the requests library. After that, parse that with bs4.
Find all the <a> elements. Since your original filter is way to generous that would get every single link on the page, I modified it so it will only yield elements that have articles.
Finally, extract the title inside a and the URL.

答案2

得分: 0

这段代码完全符合你的需求：

titles = [title.text for title in soup.select("div.p1_title > div.p1-title-spacer")]
urls = 
 for link in soup.select("a.section-helper-flex.section-helper-row.ten-column.spacer-small.p1-container")]
output = list(zip(titles, urls))
print(output)

你之前做错的地方是soup.findall返回的是一个包含bs4.element.Tag的列表，而不是包含文本和href的列表。

我使用了列表推导式，它会找到所有类为p1-title-spacer的div元素，这些元素是p1_title类div的子元素，并提取了每个元素的文本。类似地，我使用CSS选择器提取了URL。

我建议你阅读一下findall的文档。

另外，你也可以完全不使用BeautifulSoup，纯粹使用Selenium来实现：

from selenium.webdriver.common.by import By

titles = [title.text for title in driver.find_elements(By.CSS_SELECTOR, "div.p1_title > div.p1-title-spacer")]
urls = 

output = list(zip(titles, urls))
print(output)

你还可以查阅一下WebDriverWait，以确保在元素加载之前不要查找元素。

以下是完整的代码：

import re
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import requests

url = 'https://web.archive.org/web/20220101001435/https://www.usatoday.com/'
options = Options()
options.headless = False
driver = webdriver.Chrome(options=options)
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")
titles = [title.text for title in soup.select("div.p1_title > div.p1-title-spacer")]
urls = 
 for link in soup.select("a.section-helper-flex.section-helper-row.ten-column.spacer-small.p1-container")]
output = list(zip(titles, urls))
print(output)

英文:

This does exactly what you want:

titles = [title.text for title in soup.select(&quot;div.p1_title &gt; div.p1-title-spacer&quot;)]
urls = 
 for link in soup.select(&quot;a.section-helper-flex.section-helper-row.ten-column.spacer-small.p1-container&quot;)]
output = list(zip(titles, urls))
print(output)

What you were doing wrong is that soup.findall returns a list of bs4.element.Tags, not lists containing the text and href.

I used a list comprehension which sought out all divs of the class p1-title-spacer which were children of a div with class p1_title and extracted the text from each of them. Similarly, I used CSS selectors to extract the urls too.

I recommend that you read the documentation for findall.

Alternatively, you could do this without using beautifulsoup at all, purely using selenium:

from selenium.webdriver.common.by import By

titles = [title.text for title in driver.find_elements(By.CSS_SELECTOR, &quot;&quot;div.p1_title &gt; div.p1-title-spacer&quot;&quot;)]
urls = 

output = list(zip(titles, urls))
print(output)

You could also look into WebDriverWait to ensure that you don't look for elements before they've loaded.

Here's the whole thing:

import re
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import requests

url = &#39;https://web.archive.org/web/20220101001435/https://www.usatoday.com/&#39;
options = Options()
options.headless = False
driver = webdriver.Chrome(options=options)
driver.get(url)
soup = BeautifulSoup(driver.page_source, &quot;html.parser&quot;)
titles = [title.text for title in soup.select(&quot;div.p1_title &gt; div.p1-title-spacer&quot;)]
urls = 
 for link in soup.select(&quot;a.section-helper-flex.section-helper-row.ten-column.spacer-small.p1-container&quot;)]
output = list(zip(titles, urls))
print(output)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Wayback Machine抓取USA Today的内容

问题

答案1

答案2

CUDA内存不足错误：尽管有可用的GPU内存，但CUDA内存不足。

scipy genextreme fit 在相同数据上返回与 MATLAB gev fit 函数不同的参数

Dask的`map_partition`在客户端上未使用所有的工作节点。

使用嵌套的灵活类型（np.void类型）索引结构化的numpy数组。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论