如何使用Python从网站(URL)提取表格

huangapple go评论49阅读模式
英文:

How to extract a table from a website(url) using python

问题

NIST数据集网站包含一些关于铜的数据,如何使用Python脚本从该网站中获取左侧标题为“HTML表格格式”的表格?然后只提取如下图片所示的第二列和第三列中的数字,并将所有数据存储在.csv文件中。我尝试了以下代码,但未能获取正确格式的表格。

import pandas as pd

# 表格的URL
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"

# 将表格读入pandas数据框
df = pd.read_html(url, header=0, index_col=0)[0]
# 将处理后的表格保存为CSV文件
df.to_csv("nist_table.csv", index=False)

如何使用Python从网站(URL)提取表格

英文:

The NIST dataset website contains some data of copper, how can I grab the table in the left (titled “HTML table format
“) from the website using a script of python. And only perverse the numbers in the second and third columns as shown in picture below. And store all data into a .csv file. I tried codes below, but it failed to get the correct format of the table.

import pandas as pd

# URL of the table
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"

# Read the table into a pandas dataframe
df = pd.read_html(url, header=0, index_col=0)[0]
# Save the processed table to a CSV file
df.to_csv("nist_table.csv", index=False)

如何使用Python从网站(URL)提取表格

答案1

得分: 2

- 使用 `.droplevel([0,1])` 删除不需要的表头行
- 使用 `.dropna(axis=1, how='all')` 删除空列
- 使用 `.iloc[:,1:]` 选择特定的三列

#### 示例

```python
import pandas as pd
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"

df = pd.read_html(url, header=[0,1,2,3])[1].droplevel([0,1], axis=1).dropna(axis=1, how='all').iloc[:,1:]
df
英文:

You could use:

  • .droplevel([0,1]) to remove the unwanted header rows
  • .dropna(axis=1, how='all') to remove the empty columns
  • .iloc[:,1:] to select only specific 3 columns

Example

import pandas as pd
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"

df = pd.read_html(url, header=[0,1,2,3])[1].droplevel([0,1], axis=1).dropna(axis=1, how='all').iloc[:,1:]
df

答案2

得分: 0

以下是已翻译的部分:

For parsing HTML documents BeautifulSoup is a great Python package to use, this with the requests library you can extract the data you want.

The code below should extract the desired data:

# import packages/libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

# define URL link variable, get the response and parse the HTML dom contents
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')

# declare table variable and use soup to find table in HTML dom
table = soup.find('table')

# iterate over table rows (tr) and append table data (td) to rows list
rows = []
for i, row in enumerate(table.find_all('tr')):
    # only append data if its after 3rd row -> (MeV),(cm2/g),(cm2/g)
    if i > 3:
        rows.append([value.text.strip() for value in row.find_all('td')])

# create DataFrame from the data appended to the rows list
df = pd.DataFrame(rows)

# export data to csv file called datafile
df.to_csv("datafile.csv")
英文:

For parsing HTML documents BeautifulSoup is a great Python package to use, this with the requests library you can extract the data you want.

The code below should extract the desired data:

# import packages/libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

# define URL link variable, get the response and parse the HTML dom contents
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')

# declare table variable and use soup to find table in HTML dom
table = soup.find('table')

# iterate over table rows (tr) and append table data (td) to rows list
rows = []
for i, row in enumerate(table.find_all('tr')):
    # only append data if its after 3rd row -> (MeV),(cm2/g),(cm2/g)
    if i > 3:
        rows.append([value.text.strip() for value in row.find_all('td')])

# create DataFrame from the data appended to the rows list
df = pd.DataFrame(rows)

# export data to csv file called datafile
df.to_csv(r"datafile.csv")

huangapple
  • 本文由 发表于 2023年2月8日 12:24:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/75381367.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定