使用BeautifulSoup仅提取表格的摘要和宽度信息。

huangapple go评论72阅读模式
英文:

How to scrape table using beautifulsoup only summary and width?

问题

以下是您要翻译的部分:

我试图从这个网站上抓取一个表格:http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm

这个表格没有 idclass,只包含了摘要和宽度。有没有办法抓取这个表格?也许可以使用 XPath?

我听说 XPath 与 BeautifulSoup 不兼容,希望这是错误的。

以下是我的代码:

import requests
from bs4 import BeautifulSoup
link = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'
page = 15
pdf = []
for p in range(1, page+1):
   l = link + '?page=' + str(p)
    # 下载网页内容
    data = requests.get(l).text
    # 创建BeautifulSoup对象
    soup = BeautifulSoup(data, 'html.parser')
    tables = soup.find_all('table')
    table = soup.find('table', 插入XPath表达式)
    df = pd.DataFrame(columns=['date', 'brand', 'descr', 'reason', 'company'])
    for row in table.tbody.find_all('tr'):
        # 找到每列的所有数据
        columns = row.find_all('td')
        if columns != []:
            date = columns[0].text.strip()

请注意,XPath表达式部分需要您自行插入适当的XPath表达式以定位所需的表格。

英文:

I am trying to scrape a table from this site.:http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm

This table has no id or class and only contains summary and width. Is there any way to scrape this table?
Perhaps xpath?

I heard that xpath is not compatible with beautifulsoup and hope that is wrong.

<table width="100%" cellpadding="3" border="1" summary="Layout showing RecallTest table with 6 columns: Date,Brand Name,Product Description,Reason/Problem,Company,Details/Photo" style="margin-bottom:28px">
		  <thead>
			<tr>
					<th scope="col" data-type="numeric" data-toggle="true"> Date </th>
			</tr>
		  </thead>
		  <tbody>

Here is my code:

import requests
from bs4 import BeautifulSoup
link = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'
page = 15
pdf = []
for p in range(1,page+1):
   l = link + '?page='+str(p)
	# Downloading contents of the web page
	data = requests.get(l).text
	# Creating BeautifulSoup object
	soup = BeautifulSoup(data, 'html.parser')
	tables = soup.find_all('table')
	table = soup.find('table', INSERT XPATH EXPRESSION)
	df = pd.DataFrame(columns = ['date','brand','descr','reason','company'])
	for row in table.tbody.find_all('tr'):    
		# Find all data for each column
		columns = row.find_all('td')
		if columns != []:
			date = columns[0].text.strip()

答案1

得分: 1

# 用于爬取表格的最佳实践是使用 `pandas.read_html()`,它可以涵盖 95% 的情况。只需迭代站点并连接 `dataframes`:

import pandas as pd

url = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'

pd.concat(
    [pd.read_html(url+'?page='+str(i))[0] for i in range(1,16)],
    ignore_index=True
)

# 请注意,您还可以通过 `extract_links='body'` 包含链接

# 这将产生:

|    | Date       | Brand Name  | Product Description  | Reason/Problem  | Company  | Details/Photo |
|---:|:-----------|:------------|:--------------------|:----------------|:---------|---------------|
|  0 | 12/31/2015 | PharMEDium  | Norepinephrine Bitartrate added to Sodium Chloride  | Discoloration  | PharMEDium Services, LLC  | nan |
|  1 | 12/31/2015 | Thomas Produce  | Cucumbers  | Salmonella  | Thomas Produce Company  | nan |
|  2 | 12/28/2015 | Wegmans, Uoriki Fresh  | Octopus Salad  | Listeria monocytogenes  | Uoriki Fresh, Inc.  | nan |
| ...
| 433 | 01/05/2015 | Whole Foods Market  | Assorted cookie platters  | Undeclared tree nuts  | Whole Foods Market  | nan |
| 434 | 01/05/2015 | Eillien's, Blain's Farms and Fleet & more  | Walnut Pieces  | Salmonella contamination  | Eilliens Candies Inc.  | nan |
| 435 | 01/02/2015 | Full Tilt Ice Cream  | Ice Cream  | Listeria monocytogenes  | Full Tilt Ice Cream  | nan |
| 436 | 01/02/2015 | Zilks  | Hummus  | Undeclared peanuts  | Zilks Foods  | nan |

# 根据您手动的方法,只需选择第一个表格,迭代行并将信息存储在一个字典列表中,然后可以简单地转换为一个 dataframe:

import requests
from bs4 import BeautifulSoup

url = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'

data = []

for i in range(1,16):
    soup = BeautifulSoup(requests.get(url+'?page='+str(i)).text)
    for e in soup.table.select('tr:has(td)'):
        data.append({
            'date': e.td.text,
            'any other': 'column',
            'link': e.a.get('href')
        })

data
英文:

Scraping tables it is best practice to use pandas.read_html() that covers 95% of all cases. Simply iterate the sites and concat the dataframes:

import pandas as pd
url = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'
pd.concat(
[pd.read_html(url+'?page='+str(i))[0]for i in range(1,16)],
ignore_index=True
)

Note, that you can also include links via extract_links='body'

This will result in:

Date Brand Name Product Description Reason/Problem Company Details/Photo
0 12/31/2015 PharMEDium Norepinephrine Bitartrate added to Sodium Chloride Discoloration PharMEDium Services, LLC nan
1 12/31/2015 Thomas Produce Cucumbers Salmonella Thomas Produce Company nan
2 12/28/2015 Wegmans, Uoriki Fresh Octopus Salad Listeria monocytogenes Uoriki Fresh, Inc. nan
...
433 01/05/2015 Whole Foods Market Assorted cookie platters Undeclared tree nuts Whole Foods Market nan
434 01/05/2015 Eillien's, Blain's Farms and Fleet & more Walnut Pieces Salmonella contamination Eillien’s Candies Inc. nan
435 01/02/2015 Full Tilt Ice Cream Ice Cream Listeria monocytogenes Full Tilt Ice Cream nan
436 01/02/2015 Zilks Hummus Undeclared peanuts Zilks Foods nan

Based on your manually approach simply select the first table, iterate over the rows and store information in a list of dicts , that could be simply converted into a dataframe:

import requests
from bs4 import BeautifulSoup
url = 'http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm'
data = []
for i in range(1,16):
soup = BeautifulSoup(requests.get(url+'?page='+str(i)).text)
for e in soup.table.select('tr:has(td)'):
data.append({
'date': e.td.text,
'any other': 'column',
'link': e.a.get('href')
})
data

答案2

得分: -2

import requests
import import pandas as pd


url = "http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm"
tables = pd.read_html(requests.get(url).text)

print(tables[0])
英文:

There is just two line of code to get the table from website

import requests
import pandas as pd
url = "http://wayback.archive-it.org/7993/20170110233205/http://www.fda.gov/Safety/Recalls/ArchiveRecalls/2015/default.htm"
tables = pd.read_html(requests.get(url).text)
print(tables[0])

You have to use two modules requests and pandas

you can read more about pandas.read_html function from Here

huangapple
  • 本文由 发表于 2023年5月11日 12:21:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/76224134.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定