2023年2月8日 12:24:37go评论88阅读模式

英文:

How to extract a table from a website(url) using python

问题

NIST数据集网站包含一些关于铜的数据，如何使用Python脚本从该网站中获取左侧标题为“HTML表格格式”的表格？然后只提取如下图片所示的第二列和第三列中的数字，并将所有数据存储在.csv文件中。我尝试了以下代码，但未能获取正确格式的表格。

import pandas as pd
# 表格的URL
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
# 将表格读入pandas数据框
df = pd.read_html(url, header=0, index_col=0)[0]
# 将处理后的表格保存为CSV文件
df.to_csv("nist_table.csv", index=False)

英文:

The NIST dataset website contains some data of copper, how can I grab the table in the left (titled “HTML table format
“) from the website using a script of python. And only perverse the numbers in the second and third columns as shown in picture below. And store all data into a .csv file. I tried codes below, but it failed to get the correct format of the table.

import pandas as pd
# URL of the table
url = &quot;https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html&quot;
# Read the table into a pandas dataframe
df = pd.read_html(url, header=0, index_col=0)[0]
# Save the processed table to a CSV file
df.to_csv(&quot;nist_table.csv&quot;, index=False)

答案1

得分: 2

- 使用 `.droplevel([0,1])` 删除不需要的表头行
- 使用 `.dropna(axis=1, how='all')` 删除空列
- 使用 `.iloc[:,1:]` 选择特定的三列
#### 示例
```python
import pandas as pd
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
df = pd.read_html(url, header=[0,1,2,3])[1].droplevel([0,1], axis=1).dropna(axis=1, how='all').iloc[:,1:]
df

英文:

You could use:

.droplevel([0,1]) to remove the unwanted header rows
.dropna(axis=1, how='all') to remove the empty columns
.iloc[:,1:] to select only specific 3 columns

Example

import pandas as pd
url = &quot;https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html&quot;
df = pd.read_html(url, header=[0,1,2,3])[1].droplevel([0,1], axis=1).dropna(axis=1, how=&#39;all&#39;).iloc[:,1:]
df

答案2

得分: 0

以下是已翻译的部分：

For parsing HTML documents BeautifulSoup is a great Python package to use, this with the requests library you can extract the data you want.

The code below should extract the desired data:

# import packages/libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
# define URL link variable, get the response and parse the HTML dom contents
url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
# declare table variable and use soup to find table in HTML dom
table = soup.find('table')
# iterate over table rows (tr) and append table data (td) to rows list
rows = []
for i, row in enumerate(table.find_all('tr')):
    # only append data if its after 3rd row -> (MeV),(cm2/g),(cm2/g)
    if i > 3:
        rows.append([value.text.strip() for value in row.find_all('td')])
# create DataFrame from the data appended to the rows list
df = pd.DataFrame(rows)
# export data to csv file called datafile
df.to_csv("datafile.csv")

英文:

For parsing HTML documents BeautifulSoup is a great Python package to use, this with the requests library you can extract the data you want.

The code below should extract the desired data:

# import packages/libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
# define URL link variable, get the response and parse the HTML dom contents
url = &quot;https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html&quot;
response = requests.get(url).text
soup = BeautifulSoup(response, &#39;html.parser&#39;)
# declare table variable and use soup to find table in HTML dom
table = soup.find(&#39;table&#39;)
# iterate over table rows (tr) and append table data (td) to rows list
rows = []
for i, row in enumerate(table.find_all(&#39;tr&#39;)):
    # only append data if its after 3rd row -&gt; (MeV),(cm2/g),(cm2/g)
    if i &gt; 3:
        rows.append([value.text.strip() for value in row.find_all(&#39;td&#39;)])
# create DataFrame from the data appended to the rows list
df = pd.DataFrame(rows)
# export data to csv file called datafile
df.to_csv(r&quot;datafile.csv&quot;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用Python从网站（URL）提取表格

问题

答案1

Example

答案2

Tracing source of error in difflib due to very different string comparison

从列多级索引的数据框中选择两个不同的列集。

How can I vectorize the interaction of two numpy arrays

如何添加一个依赖于其他列的值，同时还涉及其他行的列？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。