英文:
How to extract a table from a website(url) using python
问题
NIST数据集网站包含一些关于铜的数据,如何使用Python脚本从该网站中获取左侧标题为“HTML表格格式”的表格?然后只提取如下图片所示的第二列和第三列中的数字,并将所有数据存储在.csv文件中。我尝试了以下代码,但未能获取正确格式的表格。
import pandas as pd
# 表格的URL
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
# 将表格读入pandas数据框
df = pd.read_html(url, header=0, index_col=0)[0]
# 将处理后的表格保存为CSV文件
df.to_csv("nist_table.csv", index=False)
英文:
The NIST dataset website contains some data of copper, how can I grab the table in the left (titled “HTML table format
“) from the website using a script of python. And only perverse the numbers in the second and third columns as shown in picture below. And store all data into a .csv file. I tried codes below, but it failed to get the correct format of the table.
import pandas as pd
# URL of the table
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
# Read the table into a pandas dataframe
df = pd.read_html(url, header=0, index_col=0)[0]
# Save the processed table to a CSV file
df.to_csv("nist_table.csv", index=False)
答案1
得分: 2
- 使用 `.droplevel([0,1])` 删除不需要的表头行
- 使用 `.dropna(axis=1, how='all')` 删除空列
- 使用 `.iloc[:,1:]` 选择特定的三列
#### 示例
```python
import pandas as pd
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
df = pd.read_html(url, header=[0,1,2,3])[1].droplevel([0,1], axis=1).dropna(axis=1, how='all').iloc[:,1:]
df
英文:
You could use:
.droplevel([0,1])
to remove the unwanted header rows.dropna(axis=1, how='all')
to remove the empty columns.iloc[:,1:]
to select only specific 3 columns
Example
import pandas as pd
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
df = pd.read_html(url, header=[0,1,2,3])[1].droplevel([0,1], axis=1).dropna(axis=1, how='all').iloc[:,1:]
df
答案2
得分: 0
以下是已翻译的部分:
For parsing HTML documents BeautifulSoup is a great Python package to use, this with the requests library you can extract the data you want.
The code below should extract the desired data:
# import packages/libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
# define URL link variable, get the response and parse the HTML dom contents
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
# declare table variable and use soup to find table in HTML dom
table = soup.find('table')
# iterate over table rows (tr) and append table data (td) to rows list
rows = []
for i, row in enumerate(table.find_all('tr')):
# only append data if its after 3rd row -> (MeV),(cm2/g),(cm2/g)
if i > 3:
rows.append([value.text.strip() for value in row.find_all('td')])
# create DataFrame from the data appended to the rows list
df = pd.DataFrame(rows)
# export data to csv file called datafile
df.to_csv("datafile.csv")
英文:
For parsing HTML documents BeautifulSoup is a great Python package to use, this with the requests library you can extract the data you want.
The code below should extract the desired data:
# import packages/libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
# define URL link variable, get the response and parse the HTML dom contents
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
# declare table variable and use soup to find table in HTML dom
table = soup.find('table')
# iterate over table rows (tr) and append table data (td) to rows list
rows = []
for i, row in enumerate(table.find_all('tr')):
# only append data if its after 3rd row -> (MeV),(cm2/g),(cm2/g)
if i > 3:
rows.append([value.text.strip() for value in row.find_all('td')])
# create DataFrame from the data appended to the rows list
df = pd.DataFrame(rows)
# export data to csv file called datafile
df.to_csv(r"datafile.csv")
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论