英文:
Need help web scraping a table
问题
我是新手,尝试爬取这个表格的16k行数据,https://www.levantineceramics.org/vessels,但是表格行位于tbody内,标准的网页爬取方法,使用pandas和beautiful soup都不起作用,因为它们显示为空白(空数据框)或(['
'])。
我尝试查看了关于pandas、beautiful soup和selenium的网页爬取教程,但没有成功。是否可能爬取这个表格,如果可以的话,您能指导我正确的方向吗?
这是我的代码结果:
['
']
from bs4 import BeautifulSoup as bs
import requests
url = 'https://www.levantineceramics.org/vessels'
page = requests.get(url)
data = bs(page.text, "html.parser")
table = data.body.findAll('tbody')
print(table)
(Note: I've removed the HTML tags in the code for better readability.)
英文:
I'm new to web scraping and trying to scrape 16k rows of this table, https://www.levantineceramics.org/vessels, but the table rows are inside a tbody and standard web scraping methods using pandas and beautiful soup do not work, as they show up blank (Empty Dataframe) or (['<tbody></tbody>]).
I tried looking at web scraping tutorials for pandas, beautiful soup, and selenium and wasn't successful. Is it even possible to scrape this table and if so, could you push me in the right direction?
Here is my code showing the result of :
[<tbody>
</tbody>]
from bs4 import BeautifulSoup as bs
import requests
url = 'https://www.levantineceramics.org/vessels'
page = requests.get(url)
data = bs(page.text, "html.parser")
table = data.body.findAll('tbody')
print(table)
答案1
得分: 2
以下是您提供的代码的翻译部分:
# 通过JavaScript从外部URL加载数据,因此BeautifulSoup无法看到它。 您可以尝试通过requests从其Ajax API获取数据:
import pandas as pd
import requests
api_url = "https://www.levantineceramics.org/vessels/datatable.json"
params = {
"sEcho": "2",
"iColumns": "12",
"sColumns": ",,,,,,,,,,,",
"iDisplayStart": "0",
"iDisplayLength": "100",
}
data = requests.get(api_url, params=params).json()
# 创建<table>并通过pandas解析它
columns = [
"ID",
"Vessel registration number",
"Vessel photos",
"Vessel drawings",
"Shape",
"Functional category",
"Date BCE/CE",
"Period",
"Site Name",
"Country/region",
"Contributors",
"Action",
]
table = ["<tr>" + "\n".join(f"<th>{cell}</th>" for cell in columns) + "</tr>"]
for row in data["aaData"]:
table.append("<tr>" + "\n".join(f"<td>{cell}</td>" for cell in row) + "</tr>")
df = pd.read_html("<table>" + "\n".join(table) + "</table>")[0]
print(df.head(10).to_markdown(index=False))
输出:
ID | Vessel registration number | Vessel photos | Vessel drawings | Shape | Functional category | Date BCE/CE | Period | Site Name | Country/region | Contributors | Action |
---|---|---|---|---|---|---|---|---|---|---|---|
7 | Mizpe Yammim 19/165 | View | View | Juglet | Cosmetic/Toilette/Medicine | 5th-4th centuries BCE | Achaemenid Persian | Mizpe Yammim | Israel/Galilee | Andrea M. BerlinRafael Frankel | nan |
8 | Mizpe Yammim 22/452 | View | View | Juglet | Cosmetic/Toilette/Medicine | 5th-4th century BCE | Achaemenid Persian | Mizpe Yammim | Israel/Galilee | Andrea M. BerlinRafael Frankel | nan |
9 | Mizpe Yammim 8/398 | View | View | Juglet | Cosmetic/Toilette/Medicine | 5th-4th century BCE | Achaemenid Persian | Mizpe Yammim | Israel/Galilee | Rafael FrankelAndrea M. Berlin | nan |
10 | Qedesh K00P168 | View | nan | Cooking pot | Cooking/Food production | 3rd-mid-2nd c. BCE | Hellenistic | Qedesh | Israel/Galilee | Peter J. StoneAndrea M. Berlin | nan |
11 | Qedesh K00P058 | View | nan | Casserole/Lopas | Cooking/Food production | 2nd c. BCE | Middle Hellenistic | Qedesh | Israel/Galilee | Peter J. StoneAndrea M. Berlin | nan |
14 | Miqne INE.4.392/1 | View | nan | Bowl, large | Household/Utility | 1200BCE - 1150BCE | Iron Age I | Tel Miqne/Ekron | Israel/Shephelah | nan | nan |
16 | Qedesh K09P046 | View | nan | Saucer | Dining/Drinking/Serving | 300 BCE - 150 BCE | Middle Hellenistic | Qedesh | Israel/Galilee | Peter J. StoneAndrea M. Berlin | nan |
17 | Qedesh K00P157 | View | View | Plate | Dining/Drinking/Serving | 200 BCE - 140 BCE | Middle Hellenistic | Qedesh | Israel/Galilee | Peter J. Stone | nan |
18 | Tel Anafa PW 49/TA79P49 | View | View | Juglet | Cosmetic/Toilette/Medicine | 5th-4th c. BCE | Achaemenid Persian | Tel Anafa | Israel/Hula Valley | Andrea M. Berlin | nan |
19 | Mizpe Yammim 15/362 | View | View | Juglet | Cosmetic/Toilette/Medicine | 5th-4th century BCE | Achaemenid Persian | Mizpe Yammim | Israel/Galilee | Andrea M. BerlinRafael Frankel | nan |
(注意:由于代码中包含HTML标签,因此将其保留在翻译中以确保代码的完整性。)
<details>
<summary>英文:</summary>
The data is loaded from external URL via Javascript, so BeautifulSoup doesn't see it. You can try to get data from their Ajax API via requests:
```py
import pandas as pd
import requests
api_url = "https://www.levantineceramics.org/vessels/datatable.json"
params = {
"sEcho": "2",
"iColumns": "12",
"sColumns": ",,,,,,,,,,,",
"iDisplayStart": "0",
"iDisplayLength": "100",
}
data = requests.get(api_url, params=params).json()
# create <table> and parse it through pandas
columns = [
"ID",
"Vessel registration number",
"Vessel photos",
"Vessel drawings",
"Shape",
"Functional category",
"Date BCE/CE",
"Period",
"Site Name",
"Country/region",
"Contributors",
"Action",
]
table = ["<tr>" + "\n".join(f"<th>{cell}</th>" for cell in columns) + "</tr>"]
for row in data["aaData"]:
table.append("<tr>" + "\n".join(f"<td>{cell}</td>" for cell in row) + "</tr>")
df = pd.read_html("<table>" + "\n".join(table) + "</table>")[0]
print(df.head(10).to_markdown(index=False))
Prints:
ID | Vessel registration number | Vessel photos | Vessel drawings | Shape | Functional category | Date BCE/CE | Period | Site Name | Country/region | Contributors | Action |
---|---|---|---|---|---|---|---|---|---|---|---|
7 | Mizpe Yammim 19/165 | View | View | Juglet | Cosmetic/Toilette/Medicine | 5th-4th centuries BCE | Achaemenid Persian | Mizpe Yammim | Israel/Galilee | Andrea M. BerlinRafael Frankel | nan |
8 | Mizpe Yammim 22/452 | View | View | Juglet | Cosmetic/Toilette/Medicine | 5th-4th century BCE | Achaemenid Persian | Mizpe Yammim | Israel/Galilee | Andrea M. BerlinRafael Frankel | nan |
9 | Mizpe Yammim 8/398 | View | View | Juglet | Cosmetic/Toilette/Medicine | 5th-4th century BCE | Achaemenid Persian | Mizpe Yammim | Israel/Galilee | Rafael FrankelAndrea M. Berlin | nan |
10 | Qedesh K00P168 | View | nan | Cooking pot | Cooking/Food production | 3rd-mid-2nd c. BCE | Hellenistic | Qedesh | Israel/Galilee | Peter J. StoneAndrea M. Berlin | nan |
11 | Qedesh K00P058 | View | nan | Casserole/Lopas | Cooking/Food production | 2nd c. BCE | Middle Hellenistic | Qedesh | Israel/Galilee | Peter J. StoneAndrea M. Berlin | nan |
14 | Miqne INE.4.392/1 | View | nan | Bowl, large | Household/Utility | 1200BCE - 1150BCE | Iron Age I | Tel Miqne/Ekron | Israel/Shephelah | nan | nan |
16 | Qedesh K09P046 | View | nan | Saucer | Dining/Drinking/Serving | 300 BCE - 150 BCE | Middle Hellenistic | Qedesh | Israel/Galilee | Peter J. StoneAndrea M. Berlin | nan |
17 | Qedesh K00P157 | View | View | Plate | Dining/Drinking/Serving | 200 BCE - 140 BCE | Middle Hellenistic | Qedesh | Israel/Galilee | Peter J. Stone | nan |
18 | Tel Anafa PW 49/TA79P49 | View | View | Juglet | Cosmetic/Toilette/Medicine | 5th-4th c. BCE | Achaemenid Persian | Tel Anafa | Israel/Hula Valley | Andrea M. Berlin | nan |
19 | Mizpe Yammim 15/362 | View | View | Juglet | Cosmetic/Toilette/Medicine | 5th-4th century BCE | Achaemenid Persian | Mizpe Yammim | Israel/Galilee | Andrea M. BerlinRafael Frankel | nan |
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论