2023年5月11日 12:21:05go评论72阅读模式

英文:

How to scrape table using beautifulsoup only summary and width?

问题

以下是您要翻译的部分：

我试图从这个网站上抓取一个表格：http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm

这个表格没有 id 或 class，只包含了摘要和宽度。有没有办法抓取这个表格？也许可以使用 XPath？

我听说 XPath 与 BeautifulSoup 不兼容，希望这是错误的。

以下是我的代码：

import requests
from bs4 import BeautifulSoup
link = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'
page = 15
pdf = []
for p in range(1, page+1):
   l = link + '?page=' + str(p)
    # 下载网页内容
    data = requests.get(l).text
    # 创建BeautifulSoup对象
    soup = BeautifulSoup(data, 'html.parser')
    tables = soup.find_all('table')
    table = soup.find('table', 插入XPath表达式)
    df = pd.DataFrame(columns=['date', 'brand', 'descr', 'reason', 'company'])
    for row in table.tbody.find_all('tr'):
        # 找到每列的所有数据
        columns = row.find_all('td')
        if columns != []:
            date = columns[0].text.strip()

请注意，XPath表达式部分需要您自行插入适当的XPath表达式以定位所需的表格。

英文:

I am trying to scrape a table from this site.:http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm

This table has no id or class and only contains summary and width. Is there any way to scrape this table?
Perhaps xpath?

I heard that xpath is not compatible with beautifulsoup and hope that is wrong.

&lt;table width=&quot;100%&quot; cellpadding=&quot;3&quot; border=&quot;1&quot; summary=&quot;Layout showing RecallTest table with 6 columns: Date,Brand Name,Product Description,Reason/Problem,Company,Details/Photo&quot; style=&quot;margin-bottom:28px&quot;&gt;
		  &lt;thead&gt;
			&lt;tr&gt;
					&lt;th scope=&quot;col&quot; data-type=&quot;numeric&quot; data-toggle=&quot;true&quot;&gt; Date &lt;/th&gt;
			&lt;/tr&gt;
		  &lt;/thead&gt;
		  &lt;tbody&gt;

Here is my code:

import requests
from bs4 import BeautifulSoup
link = &#39;http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm&#39;
page = 15
pdf = []
for p in range(1,page+1):
   l = link + &#39;?page=&#39;+str(p)
	# Downloading contents of the web page
	data = requests.get(l).text
	# Creating BeautifulSoup object
	soup = BeautifulSoup(data, &#39;html.parser&#39;)
	tables = soup.find_all(&#39;table&#39;)
	table = soup.find(&#39;table&#39;, INSERT XPATH EXPRESSION)
	df = pd.DataFrame(columns = [&#39;date&#39;,&#39;brand&#39;,&#39;descr&#39;,&#39;reason&#39;,&#39;company&#39;])
	for row in table.tbody.find_all(&#39;tr&#39;):    
		# Find all data for each column
		columns = row.find_all(&#39;td&#39;)
		if columns != []:
			date = columns[0].text.strip()

答案1

得分: 1

# 用于爬取表格的最佳实践是使用 `pandas.read_html()`，它可以涵盖 95% 的情况。只需迭代站点并连接 `dataframes`：

import pandas as pd

url = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'

pd.concat(
    [pd.read_html(url+'?page='+str(i))[0] for i in range(1,16)],
    ignore_index=True
)

# 请注意，您还可以通过 `extract_links='body'` 包含链接

# 这将产生：

|    | Date       | Brand Name  | Product Description  | Reason/Problem  | Company  | Details/Photo |
|---:|:-----------|:------------|:--------------------|:----------------|:---------|---------------|
|  0 | 12/31/2015 | PharMEDium  | Norepinephrine Bitartrate added to Sodium Chloride  | Discoloration  | PharMEDium Services, LLC  | nan |
|  1 | 12/31/2015 | Thomas Produce  | Cucumbers  | Salmonella  | Thomas Produce Company  | nan |
|  2 | 12/28/2015 | Wegmans, Uoriki Fresh  | Octopus Salad  | Listeria monocytogenes  | Uoriki Fresh, Inc.  | nan |
| ...
| 433 | 01/05/2015 | Whole Foods Market  | Assorted cookie platters  | Undeclared tree nuts  | Whole Foods Market  | nan |
| 434 | 01/05/2015 | Eillien's, Blain's Farms and Fleet & more  | Walnut Pieces  | Salmonella contamination  | Eillien’s Candies Inc.  | nan |
| 435 | 01/02/2015 | Full Tilt Ice Cream  | Ice Cream  | Listeria monocytogenes  | Full Tilt Ice Cream  | nan |
| 436 | 01/02/2015 | Zilks  | Hummus  | Undeclared peanuts  | Zilks Foods  | nan |

# 根据您手动的方法，只需选择第一个表格，迭代行并将信息存储在一个字典列表中，然后可以简单地转换为一个 dataframe：

import requests
from bs4 import BeautifulSoup

url = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'

data = []

for i in range(1,16):
    soup = BeautifulSoup(requests.get(url+'?page='+str(i)).text)
    for e in soup.table.select('tr:has(td)'):
        data.append({
            'date': e.td.text,
            'any other': 'column',
            'link': e.a.get('href')
        })

data

英文:

Scraping tables it is best practice to use pandas.read_html() that covers 95% of all cases. Simply iterate the sites and concat the dataframes:

import pandas as pd
url = &#39;http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm&#39;
pd.concat(
[pd.read_html(url+&#39;?page=&#39;+str(i))[0]for i in range(1,16)],
ignore_index=True
)

Note, that you can also include links via extract_links='body'

This will result in:

	Date	Brand Name	Product Description	Reason/Problem	Company	Details/Photo
0	12/31/2015	PharMEDium	Norepinephrine Bitartrate added to Sodium Chloride	Discoloration	PharMEDium Services, LLC	nan
1	12/31/2015	Thomas Produce	Cucumbers	Salmonella	Thomas Produce Company	nan
2	12/28/2015	Wegmans, Uoriki Fresh	Octopus Salad	Listeria monocytogenes	Uoriki Fresh, Inc.	nan
...
433	01/05/2015	Whole Foods Market	Assorted cookie platters	Undeclared tree nuts	Whole Foods Market	nan
434	01/05/2015	Eillien's, Blain's Farms and Fleet & more	Walnut Pieces	Salmonella contamination	Eillien’s Candies Inc.	nan
435	01/02/2015	Full Tilt Ice Cream	Ice Cream	Listeria monocytogenes	Full Tilt Ice Cream	nan
436	01/02/2015	Zilks	Hummus	Undeclared peanuts	Zilks Foods	nan

Based on your manually approach simply select the first table, iterate over the rows and store information in a list of dicts , that could be simply converted into a dataframe:

import requests
from bs4 import BeautifulSoup
url = &#39;http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm&#39;
data = []
for i in range(1,16):
soup = BeautifulSoup(requests.get(url+&#39;?page=&#39;+str(i)).text)
for e in soup.table.select(&#39;tr:has(td)&#39;):
data.append({
&#39;date&#39;: e.td.text,
&#39;any other&#39;: &#39;column&#39;,
&#39;link&#39;: e.a.get(&#39;href&#39;)
})
data

答案2

得分: -2

import requests
import import pandas as pd


url = &quot;http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm&quot;
tables = pd.read_html(requests.get(url).text)

print(tables[0])

英文:

There is just two line of code to get the table from website

import requests
import pandas as pd
url = &quot;http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm&quot;
tables = pd.read_html(requests.get(url).text)
print(tables[0])

You have to use two modules requests and pandas

you can read more about pandas.read_html function from Here

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用BeautifulSoup仅提取表格的摘要和宽度信息。

问题

答案1

答案2

`from_dict` 方法返回 TypeError。

为什么我的JSON数据中的null不返回0而引发异常？

如何使用服务器发送事件(Server-Sent Events)流式传输JSON数据

传递 R 对象（plot/image）到 Python 环境中的 Python

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论