如何使用Python从网站(URL)提取表格

huangapple go评论88阅读模式
英文:

How to extract a table from a website(url) using python

问题

NIST数据集网站包含一些关于铜的数据,如何使用Python脚本从该网站中获取左侧标题为“HTML表格格式”的表格?然后只提取如下图片所示的第二列和第三列中的数字,并将所有数据存储在.csv文件中。我尝试了以下代码,但未能获取正确格式的表格。

  1. import pandas as pd
  2. # 表格的URL
  3. url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
  4. # 将表格读入pandas数据框
  5. df = pd.read_html(url, header=0, index_col=0)[0]
  6. # 将处理后的表格保存为CSV文件
  7. df.to_csv("nist_table.csv", index=False)

如何使用Python从网站(URL)提取表格

英文:

The NIST dataset website contains some data of copper, how can I grab the table in the left (titled “HTML table format
“) from the website using a script of python. And only perverse the numbers in the second and third columns as shown in picture below. And store all data into a .csv file. I tried codes below, but it failed to get the correct format of the table.

  1. import pandas as pd
  2. # URL of the table
  3. url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
  4. # Read the table into a pandas dataframe
  5. df = pd.read_html(url, header=0, index_col=0)[0]
  6. # Save the processed table to a CSV file
  7. df.to_csv("nist_table.csv", index=False)

如何使用Python从网站(URL)提取表格

答案1

得分: 2

  1. - 使用 `.droplevel([0,1])` 删除不需要的表头行
  2. - 使用 `.dropna(axis=1, how='all')` 删除空列
  3. - 使用 `.iloc[:,1:]` 选择特定的三列
  4. #### 示例
  5. ```python
  6. import pandas as pd
  7. url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
  8. df = pd.read_html(url, header=[0,1,2,3])[1].droplevel([0,1], axis=1).dropna(axis=1, how='all').iloc[:,1:]
  9. df
英文:

You could use:

  • .droplevel([0,1]) to remove the unwanted header rows
  • .dropna(axis=1, how='all') to remove the empty columns
  • .iloc[:,1:] to select only specific 3 columns

Example

  1. import pandas as pd
  2. url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
  3. df = pd.read_html(url, header=[0,1,2,3])[1].droplevel([0,1], axis=1).dropna(axis=1, how='all').iloc[:,1:]
  4. df

答案2

得分: 0

以下是已翻译的部分:

For parsing HTML documents BeautifulSoup is a great Python package to use, this with the requests library you can extract the data you want.

The code below should extract the desired data:

  1. # import packages/libraries
  2. from bs4 import BeautifulSoup
  3. import requests
  4. import pandas as pd
  5. # define URL link variable, get the response and parse the HTML dom contents
  6. url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
  7. response = requests.get(url).text
  8. soup = BeautifulSoup(response, 'html.parser')
  9. # declare table variable and use soup to find table in HTML dom
  10. table = soup.find('table')
  11. # iterate over table rows (tr) and append table data (td) to rows list
  12. rows = []
  13. for i, row in enumerate(table.find_all('tr')):
  14. # only append data if its after 3rd row -> (MeV),(cm2/g),(cm2/g)
  15. if i > 3:
  16. rows.append([value.text.strip() for value in row.find_all('td')])
  17. # create DataFrame from the data appended to the rows list
  18. df = pd.DataFrame(rows)
  19. # export data to csv file called datafile
  20. df.to_csv("datafile.csv")
英文:

For parsing HTML documents BeautifulSoup is a great Python package to use, this with the requests library you can extract the data you want.

The code below should extract the desired data:

  1. # import packages/libraries
  2. from bs4 import BeautifulSoup
  3. import requests
  4. import pandas as pd
  5. # define URL link variable, get the response and parse the HTML dom contents
  6. url = "https://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z29.html"
  7. response = requests.get(url).text
  8. soup = BeautifulSoup(response, 'html.parser')
  9. # declare table variable and use soup to find table in HTML dom
  10. table = soup.find('table')
  11. # iterate over table rows (tr) and append table data (td) to rows list
  12. rows = []
  13. for i, row in enumerate(table.find_all('tr')):
  14. # only append data if its after 3rd row -> (MeV),(cm2/g),(cm2/g)
  15. if i > 3:
  16. rows.append([value.text.strip() for value in row.find_all('td')])
  17. # create DataFrame from the data appended to the rows list
  18. df = pd.DataFrame(rows)
  19. # export data to csv file called datafile
  20. df.to_csv(r"datafile.csv")

huangapple
  • 本文由 发表于 2023年2月8日 12:24:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/75381367.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定