如何提取包含嵌套元素的每行都有不同类名的动态HTML表格?

huangapple go评论112阅读模式
英文:

How to scrap dynamic HTML table with differencet class name for each row containing nested elements?

问题

我想通过抓取此处的表格创建一个数据框,该表格对于每一行都有不同的class名称,并包含嵌套元素。

  1. table_rows = driver.find_elements(By.CLASS_NAME, "bgColor-white")
  2. for _, val in enumerate(table_rows):
  3. print(val.text)

上述代码的print输出是字符串,但无法分隔成适当的列。

英文:

I want to create a dataframe by scrapping the table here which has different class name for each row and contains nested elements.

  1. table_rows = driver.find_elements(By.CLASS_NAME, "bgColor-white")
  2. for _, val in enumerate(table_rows):
  3. print(val.text)

Print output of the above code is string but could not segregate into appropriate columns.

答案1

得分: 1

识别表格元素,然后获取表格元素的 outerHTML
使用 pandas 的 read_html() 方法获取数据框。

  1. driver.get("https://www.egp.gov.bt/resources/common/TenderListing.jsp?lang=en_US&langForMenu=en_US&h=t")
  2. time.sleep(3)
  3. table = driver.find_element(By.CSS_SELECTOR, "table#resultTable").get_attribute("outerHTML")
  4. df = pd.read_html(table)[0]
  5. print(df)

控制台输出:

  1. Sl. No. 招标ID,参考号,公共状态 ... 类型,方法 发布日期和时间 | 截止日期和时间
  2. 0 1 15183, TSHA-6/Engineering/9/2022-2023/769, Live ... NCB, OTM 03-Mar-2023 15:00 | 14-Mar-2023 15:10
  3. 1 2 15180, STCB/PD/TS/Samtse/2023/213, Live ... NCB, OTM 03-Mar-2023 10:00 | 14-Mar-2023 11:10
  4. 2 3 15160, JNEC/Adm-33/2022-2023, Cancelled ... NCB, OTM 02-Mar-2023 22:00 | 10-Mar-2023 10:30
  5. 3 4 15179, DAG/DEHSS(07)/2022-2023/148, Live ... NCB, OTM 02-Mar-2023 15:00 | 16-Mar-2023 09:00
  6. 4 5 15181, DCHS/PRP-01/2022-2023/244, Amendment/Co... ... NCB, OTM 02-Mar-2023 09:00 | 13-Mar-2023 10:30
  7. 5 6 15174, NBC/Adm/06/2022/1198, Live ... NCB, OTM 01-Mar-2023 09:00 | 20-Mar-2023 11:30
  8. 6 7 15161, PDA/adm -35/2022-2023/, Live ... NCB, OTM 27-Feb-2023 16:00 | 10-Mar-2023 11:00
  9. 7 8 15169, MD/Dz.EHSS-20/2022-2023/5179, Amendmen... ... NCB, OTM 27-Feb-2023 14:30 | 10-Mar-2023 14:00
  10. 8 9 15157, nofp2, Live ... NCB, OTM 21-Feb-2023 09:00 | 08-Mar-2023 11:30
  11. 9 10 15158, MD/DES-20/2022-2023/5095, Being processed ... NCB, OTM 21-Feb-2023 02:00 | 02-Mar-2023 10:00
  12. [10 x 6 列]
英文:

Identify the table element and then get the outerHTML of the table element.
Use pandas read_html() method and get the dataframe.

  1. driver.get ("https://www.egp.gov.bt/resources/common/TenderListing.jsp?lang=en_US&langForMenu=en_US&h=t")
  2. time.sleep(3)
  3. table= driver.find_element(By.CSS_SELECTOR, "table#resultTable").get_attribute("outerHTML")
  4. df=pd.read_html(table)[0]
  5. print(df)

console output:

  1. Sl. No. Tender ID, Reference No, Public Status ... Type, Method Publishing Date & Time | Closing Date & Time
  2. 0 1 15183, TSHA-6/Engineering/9/2022-2023/769, Live ... NCB, OTM 03-Mar-2023 15:00 | 14-Mar-2023 15:10
  3. 1 2 15180, STCB/PD/TS/Samtse/2023/213, Live ... NCB, OTM 03-Mar-2023 10:00 | 14-Mar-2023 11:10
  4. 2 3 15160, JNEC/Adm-33/2022-2023, Cancelled ... NCB, OTM 02-Mar-2023 22:00 | 10-Mar-2023 10:30
  5. 3 4 15179, DAG/DEHSS(07)/2022-2023/148, Live ... NCB, OTM 02-Mar-2023 15:00 | 16-Mar-2023 09:00
  6. 4 5 15181, DCHS/PRP-01/2022-2023/244, Amendment/Co... ... NCB, OTM 02-Mar-2023 09:00 | 13-Mar-2023 10:30
  7. 5 6 15174, NBC/Adm/06/2022/1198, Live ... NCB, OTM 01-Mar-2023 09:00 | 20-Mar-2023 11:30
  8. 6 7 15161, PDA/adm -35/2022-2023/, Live ... NCB, OTM 27-Feb-2023 16:00 | 10-Mar-2023 11:00
  9. 7 8 15169, MD/Dz.EHSS-20/2022-2023/5179, Amendmen... ... NCB, OTM 27-Feb-2023 14:30 | 10-Mar-2023 14:00
  10. 8 9 15157, nofp2, Live ... NCB, OTM 21-Feb-2023 09:00 | 08-Mar-2023 11:30
  11. 9 10 15158, MD/DES-20/2022-2023/5095, Being processed ... NCB, OTM 21-Feb-2023 02:00 | 02-Mar-2023 10:00
  12. [10 rows x 6 columns]

huangapple
  • 本文由 发表于 2023年3月4日 01:32:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/75630195.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定