英文:
How to scrape table using beautifulsoup only summary and width?
问题
以下是您要翻译的部分:
我试图从这个网站上抓取一个表格:http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm
这个表格没有 id
或 class
,只包含了摘要和宽度。有没有办法抓取这个表格?也许可以使用 XPath?
我听说 XPath 与 BeautifulSoup 不兼容,希望这是错误的。
以下是我的代码:
import requests
from bs4 import BeautifulSoup
link = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'
page = 15
pdf = []
for p in range(1, page+1):
l = link + '?page=' + str(p)
# 下载网页内容
data = requests.get(l).text
# 创建BeautifulSoup对象
soup = BeautifulSoup(data, 'html.parser')
tables = soup.find_all('table')
table = soup.find('table', 插入XPath表达式)
df = pd.DataFrame(columns=['date', 'brand', 'descr', 'reason', 'company'])
for row in table.tbody.find_all('tr'):
# 找到每列的所有数据
columns = row.find_all('td')
if columns != []:
date = columns[0].text.strip()
请注意,XPath表达式部分需要您自行插入适当的XPath表达式以定位所需的表格。
英文:
I am trying to scrape a table from this site.:http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm
This table has no id
or class
and only contains summary and width. Is there any way to scrape this table?
Perhaps xpath?
I heard that xpath is not compatible with beautifulsoup and hope that is wrong.
<table width="100%" cellpadding="3" border="1" summary="Layout showing RecallTest table with 6 columns: Date,Brand Name,Product Description,Reason/Problem,Company,Details/Photo" style="margin-bottom:28px">
<thead>
<tr>
<th scope="col" data-type="numeric" data-toggle="true"> Date </th>
</tr>
</thead>
<tbody>
Here is my code:
import requests
from bs4 import BeautifulSoup
link = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'
page = 15
pdf = []
for p in range(1,page+1):
l = link + '?page='+str(p)
# Downloading contents of the web page
data = requests.get(l).text
# Creating BeautifulSoup object
soup = BeautifulSoup(data, 'html.parser')
tables = soup.find_all('table')
table = soup.find('table', INSERT XPATH EXPRESSION)
df = pd.DataFrame(columns = ['date','brand','descr','reason','company'])
for row in table.tbody.find_all('tr'):
# Find all data for each column
columns = row.find_all('td')
if columns != []:
date = columns[0].text.strip()
答案1
得分: 1
# 用于爬取表格的最佳实践是使用 `pandas.read_html()`,它可以涵盖 95% 的情况。只需迭代站点并连接 `dataframes`:
import pandas as pd
url = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'
pd.concat(
[pd.read_html(url+'?page='+str(i))[0] for i in range(1,16)],
ignore_index=True
)
# 请注意,您还可以通过 `extract_links='body'` 包含链接
# 这将产生:
| | Date | Brand Name | Product Description | Reason/Problem | Company | Details/Photo |
|---:|:-----------|:------------|:--------------------|:----------------|:---------|---------------|
| 0 | 12/31/2015 | PharMEDium | Norepinephrine Bitartrate added to Sodium Chloride | Discoloration | PharMEDium Services, LLC | nan |
| 1 | 12/31/2015 | Thomas Produce | Cucumbers | Salmonella | Thomas Produce Company | nan |
| 2 | 12/28/2015 | Wegmans, Uoriki Fresh | Octopus Salad | Listeria monocytogenes | Uoriki Fresh, Inc. | nan |
| ...
| 433 | 01/05/2015 | Whole Foods Market | Assorted cookie platters | Undeclared tree nuts | Whole Foods Market | nan |
| 434 | 01/05/2015 | Eillien's, Blain's Farms and Fleet & more | Walnut Pieces | Salmonella contamination | Eillien’s Candies Inc. | nan |
| 435 | 01/02/2015 | Full Tilt Ice Cream | Ice Cream | Listeria monocytogenes | Full Tilt Ice Cream | nan |
| 436 | 01/02/2015 | Zilks | Hummus | Undeclared peanuts | Zilks Foods | nan |
# 根据您手动的方法,只需选择第一个表格,迭代行并将信息存储在一个字典列表中,然后可以简单地转换为一个 dataframe:
import requests
from bs4 import BeautifulSoup
url = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'
data = []
for i in range(1,16):
soup = BeautifulSoup(requests.get(url+'?page='+str(i)).text)
for e in soup.table.select('tr:has(td)'):
data.append({
'date': e.td.text,
'any other': 'column',
'link': e.a.get('href')
})
data
英文:
Scraping tables it is best practice to use pandas.read_html()
that covers 95% of all cases. Simply iterate the sites and concat the dataframes
:
import pandas as pd
url = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'
pd.concat(
[pd.read_html(url+'?page='+str(i))[0]for i in range(1,16)],
ignore_index=True
)
Note, that you can also include links via extract_links='body'
This will result in:
Date | Brand Name | Product Description | Reason/Problem | Company | Details/Photo | |
---|---|---|---|---|---|---|
0 | 12/31/2015 | PharMEDium | Norepinephrine Bitartrate added to Sodium Chloride | Discoloration | PharMEDium Services, LLC | nan |
1 | 12/31/2015 | Thomas Produce | Cucumbers | Salmonella | Thomas Produce Company | nan |
2 | 12/28/2015 | Wegmans, Uoriki Fresh | Octopus Salad | Listeria monocytogenes | Uoriki Fresh, Inc. | nan |
... | ||||||
433 | 01/05/2015 | Whole Foods Market | Assorted cookie platters | Undeclared tree nuts | Whole Foods Market | nan |
434 | 01/05/2015 | Eillien's, Blain's Farms and Fleet & more | Walnut Pieces | Salmonella contamination | Eillien’s Candies Inc. | nan |
435 | 01/02/2015 | Full Tilt Ice Cream | Ice Cream | Listeria monocytogenes | Full Tilt Ice Cream | nan |
436 | 01/02/2015 | Zilks | Hummus | Undeclared peanuts | Zilks Foods | nan |
Based on your manually approach simply select the first table, iterate over the rows and store information in a list of dicts , that could be simply converted into a dataframe:
import requests
from bs4 import BeautifulSoup
url = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'
data = []
for i in range(1,16):
soup = BeautifulSoup(requests.get(url+'?page='+str(i)).text)
for e in soup.table.select('tr:has(td)'):
data.append({
'date': e.td.text,
'any other': 'column',
'link': e.a.get('href')
})
data
答案2
得分: -2
import requests
import import pandas as pd
url = "http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm"
tables = pd.read_html(requests.get(url).text)
print(tables[0])
英文:
There is just two line of code to get the table from website
import requests
import pandas as pd
url = "http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm"
tables = pd.read_html(requests.get(url).text)
print(tables[0])
You have to use two modules requests and pandas
you can read more about pandas.read_html function from Here
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论