需要帮助网页抓取表格。

huangapple go评论121阅读模式
英文:

Need help web scraping a table

问题

我是新手,尝试爬取这个表格的16k行数据,https://www.levantineceramics.org/vessels,但是表格行位于tbody内,标准的网页爬取方法,使用pandas和beautiful soup都不起作用,因为它们显示为空白(空数据框)或(['

'])。

我尝试查看了关于pandas、beautiful soup和selenium的网页爬取教程,但没有成功。是否可能爬取这个表格,如果可以的话,您能指导我正确的方向吗?

这是我的代码结果:
['

']

from bs4 import BeautifulSoup as bs
import requests

url = 'https://www.levantineceramics.org/vessels'
page = requests.get(url)
data = bs(page.text, "html.parser")
table = data.body.findAll('tbody')
print(table)

(Note: I've removed the HTML tags in the code for better readability.)

英文:

I'm new to web scraping and trying to scrape 16k rows of this table, https://www.levantineceramics.org/vessels, but the table rows are inside a tbody and standard web scraping methods using pandas and beautiful soup do not work, as they show up blank (Empty Dataframe) or (['<tbody></tbody>]).

I tried looking at web scraping tutorials for pandas, beautiful soup, and selenium and wasn't successful. Is it even possible to scrape this table and if so, could you push me in the right direction?

Here is my code showing the result of :
[<tbody>

</tbody>]

from bs4 import BeautifulSoup as bs
import requests

url = 'https://www.levantineceramics.org/vessels'
page = requests.get(url)
data = bs(page.text, "html.parser")
table = data.body.findAll('tbody')
print(table)

答案1

得分: 2

以下是您提供的代码的翻译部分:

# 通过JavaScript从外部URL加载数据,因此BeautifulSoup无法看到它。 您可以尝试通过requests从其Ajax API获取数据:

import pandas as pd
import requests

api_url = "https://www.levantineceramics.org/vessels/datatable.json"

params = {
    "sEcho": "2",
    "iColumns": "12",
    "sColumns": ",,,,,,,,,,,",
    "iDisplayStart": "0",
    "iDisplayLength": "100",
}

data = requests.get(api_url, params=params).json()

# 创建<table>并通过pandas解析它

columns = [
    "ID",
    "Vessel registration number",
    "Vessel photos",
    "Vessel drawings",
    "Shape",
    "Functional category",
    "Date BCE/CE",
    "Period",
    "Site Name",
    "Country/region",
    "Contributors",
    "Action",
]

table = ["<tr>" + "\n".join(f"<th>{cell}</th>" for cell in columns) + "</tr>"]
for row in data["aaData"]:
    table.append("<tr>" + "\n".join(f"<td>{cell}</td>" for cell in row) + "</tr>")

df = pd.read_html("<table>" + "\n".join(table) + "</table>")[0]
print(df.head(10).to_markdown(index=False))

输出:

ID Vessel registration number Vessel photos Vessel drawings Shape Functional category Date BCE/CE Period Site Name Country/region Contributors Action
7 Mizpe Yammim 19/165 View View Juglet Cosmetic/Toilette/Medicine 5th-4th centuries BCE Achaemenid Persian Mizpe Yammim Israel/Galilee Andrea M. BerlinRafael Frankel nan
8 Mizpe Yammim 22/452 View View Juglet Cosmetic/Toilette/Medicine 5th-4th century BCE Achaemenid Persian Mizpe Yammim Israel/Galilee Andrea M. BerlinRafael Frankel nan
9 Mizpe Yammim 8/398 View View Juglet Cosmetic/Toilette/Medicine 5th-4th century BCE Achaemenid Persian Mizpe Yammim Israel/Galilee Rafael FrankelAndrea M. Berlin nan
10 Qedesh K00P168 View nan Cooking pot Cooking/Food production 3rd-mid-2nd c. BCE Hellenistic Qedesh Israel/Galilee Peter J. StoneAndrea M. Berlin nan
11 Qedesh K00P058 View nan Casserole/Lopas Cooking/Food production 2nd c. BCE Middle Hellenistic Qedesh Israel/Galilee Peter J. StoneAndrea M. Berlin nan
14 Miqne INE.4.392/1 View nan Bowl, large Household/Utility 1200BCE - 1150BCE Iron Age I Tel Miqne/Ekron Israel/Shephelah nan nan
16 Qedesh K09P046 View nan Saucer Dining/Drinking/Serving 300 BCE - 150 BCE Middle Hellenistic Qedesh Israel/Galilee Peter J. StoneAndrea M. Berlin nan
17 Qedesh K00P157 View View Plate Dining/Drinking/Serving 200 BCE - 140 BCE Middle Hellenistic Qedesh Israel/Galilee Peter J. Stone nan
18 Tel Anafa PW 49/TA79P49 View View Juglet Cosmetic/Toilette/Medicine 5th-4th c. BCE Achaemenid Persian Tel Anafa Israel/Hula Valley Andrea M. Berlin nan
19 Mizpe Yammim 15/362 View View Juglet Cosmetic/Toilette/Medicine 5th-4th century BCE Achaemenid Persian Mizpe Yammim Israel/Galilee Andrea M. BerlinRafael Frankel nan

(注意:由于代码中包含HTML标签,因此将其保留在翻译中以确保代码的完整性。)

<details>
<summary>英文:</summary>

The data is loaded from external URL via Javascript, so BeautifulSoup doesn&#39;t see it. You can try to get data from their Ajax API via requests:

```py
import pandas as pd
import requests

api_url = &quot;https://www.levantineceramics.org/vessels/datatable.json&quot;

params = {
    &quot;sEcho&quot;: &quot;2&quot;,
    &quot;iColumns&quot;: &quot;12&quot;,
    &quot;sColumns&quot;: &quot;,,,,,,,,,,,&quot;,
    &quot;iDisplayStart&quot;: &quot;0&quot;,
    &quot;iDisplayLength&quot;: &quot;100&quot;,
}

data = requests.get(api_url, params=params).json()

# create &lt;table&gt; and parse it through pandas

columns = [
    &quot;ID&quot;,
    &quot;Vessel registration number&quot;,
    &quot;Vessel photos&quot;,
    &quot;Vessel drawings&quot;,
    &quot;Shape&quot;,
    &quot;Functional category&quot;,
    &quot;Date BCE/CE&quot;,
    &quot;Period&quot;,
    &quot;Site Name&quot;,
    &quot;Country/region&quot;,
    &quot;Contributors&quot;,
    &quot;Action&quot;,
]

table = [&quot;&lt;tr&gt;&quot; + &quot;\n&quot;.join(f&quot;&lt;th&gt;{cell}&lt;/th&gt;&quot; for cell in columns) + &quot;&lt;/tr&gt;&quot;]
for row in data[&quot;aaData&quot;]:
    table.append(&quot;&lt;tr&gt;&quot; + &quot;\n&quot;.join(f&quot;&lt;td&gt;{cell}&lt;/td&gt;&quot; for cell in row) + &quot;&lt;/tr&gt;&quot;)


df = pd.read_html(&quot;&lt;table&gt;&quot; + &quot;\n&quot;.join(table) + &quot;&lt;/table&gt;&quot;)[0]
print(df.head(10).to_markdown(index=False))

Prints:

ID Vessel registration number Vessel photos Vessel drawings Shape Functional category Date BCE/CE Period Site Name Country/region Contributors Action
7 Mizpe Yammim 19/165 View View Juglet Cosmetic/Toilette/Medicine 5th-4th centuries BCE Achaemenid Persian Mizpe Yammim Israel/Galilee Andrea M. BerlinRafael Frankel nan
8 Mizpe Yammim 22/452 View View Juglet Cosmetic/Toilette/Medicine 5th-4th century BCE Achaemenid Persian Mizpe Yammim Israel/Galilee Andrea M. BerlinRafael Frankel nan
9 Mizpe Yammim 8/398 View View Juglet Cosmetic/Toilette/Medicine 5th-4th century BCE Achaemenid Persian Mizpe Yammim Israel/Galilee Rafael FrankelAndrea M. Berlin nan
10 Qedesh K00P168 View nan Cooking pot Cooking/Food production 3rd-mid-2nd c. BCE Hellenistic Qedesh Israel/Galilee Peter J. StoneAndrea M. Berlin nan
11 Qedesh K00P058 View nan Casserole/Lopas Cooking/Food production 2nd c. BCE Middle Hellenistic Qedesh Israel/Galilee Peter J. StoneAndrea M. Berlin nan
14 Miqne INE.4.392/1 View nan Bowl, large Household/Utility 1200BCE - 1150BCE Iron Age I Tel Miqne/Ekron Israel/Shephelah nan nan
16 Qedesh K09P046 View nan Saucer Dining/Drinking/Serving 300 BCE - 150 BCE Middle Hellenistic Qedesh Israel/Galilee Peter J. StoneAndrea M. Berlin nan
17 Qedesh K00P157 View View Plate Dining/Drinking/Serving 200 BCE - 140 BCE Middle Hellenistic Qedesh Israel/Galilee Peter J. Stone nan
18 Tel Anafa PW 49/TA79P49 View View Juglet Cosmetic/Toilette/Medicine 5th-4th c. BCE Achaemenid Persian Tel Anafa Israel/Hula Valley Andrea M. Berlin nan
19 Mizpe Yammim 15/362 View View Juglet Cosmetic/Toilette/Medicine 5th-4th century BCE Achaemenid Persian Mizpe Yammim Israel/Galilee Andrea M. BerlinRafael Frankel nan

huangapple
  • 本文由 发表于 2023年1月9日 03:32:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/75050698.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定