如何提取包含嵌套元素的每行都有不同类名的动态HTML表格?

huangapple go评论78阅读模式
英文:

How to scrap dynamic HTML table with differencet class name for each row containing nested elements?

问题

我想通过抓取此处的表格创建一个数据框,该表格对于每一行都有不同的class名称,并包含嵌套元素。

table_rows = driver.find_elements(By.CLASS_NAME, "bgColor-white")
for _, val in enumerate(table_rows):
    print(val.text)

上述代码的print输出是字符串,但无法分隔成适当的列。

英文:

I want to create a dataframe by scrapping the table here which has different class name for each row and contains nested elements.

table_rows = driver.find_elements(By.CLASS_NAME, "bgColor-white")
for _, val in enumerate(table_rows):
    print(val.text)

Print output of the above code is string but could not segregate into appropriate columns.

答案1

得分: 1

识别表格元素,然后获取表格元素的 outerHTML
使用 pandas 的 read_html() 方法获取数据框。

driver.get("https://www.egp.gov.bt/resources/common/TenderListing.jsp?lang=en_US&langForMenu=en_US&h=t")
time.sleep(3)
table = driver.find_element(By.CSS_SELECTOR, "table#resultTable").get_attribute("outerHTML")
df = pd.read_html(table)[0]
print(df)

控制台输出:

   Sl. No.           招标ID,参考号,公共状态  ... 类型,方法 发布日期和时间 | 截止日期和时间
0        1  15183, TSHA-6/Engineering/9/2022-2023/769, Live  ...  NCB,  OTM        03-Mar-2023 15:00 | 14-Mar-2023 15:10
1        2  15180, STCB/PD/TS/Samtse/2023/213, Live  ...  NCB,  OTM        03-Mar-2023 10:00 | 14-Mar-2023 11:10
2        3  15160, JNEC/Adm-33/2022-2023, Cancelled  ...  NCB,  OTM        02-Mar-2023 22:00 | 10-Mar-2023 10:30
3        4  15179, DAG/DEHSS(07)/2022-2023/148, Live  ...  NCB,  OTM        02-Mar-2023 15:00 | 16-Mar-2023 09:00
4        5  15181, DCHS/PRP-01/2022-2023/244, Amendment/Co...  ...  NCB,  OTM        02-Mar-2023 09:00 | 13-Mar-2023 10:30
5        6  15174, NBC/Adm/06/2022/1198, Live  ...  NCB,  OTM        01-Mar-2023 09:00 | 20-Mar-2023 11:30
6        7  15161, PDA/adm -35/2022-2023/, Live  ...  NCB,  OTM        27-Feb-2023 16:00 | 10-Mar-2023 11:00
7        8  15169,  MD/Dz.EHSS-20/2022-2023/5179, Amendmen...  ...  NCB,  OTM        27-Feb-2023 14:30 | 10-Mar-2023 14:00
8        9  15157, nofp2, Live  ...  NCB,  OTM        21-Feb-2023 09:00 | 08-Mar-2023 11:30
9       10  15158, MD/DES-20/2022-2023/5095, Being processed  ...  NCB,  OTM        21-Feb-2023 02:00 | 02-Mar-2023 10:00

[10 行 x 6 列]
英文:

Identify the table element and then get the outerHTML of the table element.
Use pandas read_html() method and get the dataframe.

driver.get ("https://www.egp.gov.bt/resources/common/TenderListing.jsp?lang=en_US&langForMenu=en_US&h=t")
time.sleep(3)
table= driver.find_element(By.CSS_SELECTOR, "table#resultTable").get_attribute("outerHTML")
df=pd.read_html(table)[0]
print(df)

console output:

   Sl. No.           Tender ID,  Reference No,  Public Status  ... Type,  Method Publishing Date & Time | Closing Date & Time
0        1    15183, TSHA-6/Engineering/9/2022-2023/769, Live  ...     NCB,  OTM        03-Mar-2023 15:00 | 14-Mar-2023 15:10
1        2            15180, STCB/PD/TS/Samtse/2023/213, Live  ...     NCB,  OTM        03-Mar-2023 10:00 | 14-Mar-2023 11:10
2        3            15160, JNEC/Adm-33/2022-2023, Cancelled  ...     NCB,  OTM        02-Mar-2023 22:00 | 10-Mar-2023 10:30
3        4          15179,  DAG/DEHSS(07)/2022-2023/148, Live  ...     NCB,  OTM        02-Mar-2023 15:00 | 16-Mar-2023 09:00
4        5  15181, DCHS/PRP-01/2022-2023/244, Amendment/Co...  ...     NCB,  OTM        02-Mar-2023 09:00 | 13-Mar-2023 10:30
5        6                  15174, NBC/Adm/06/2022/1198, Live  ...     NCB,  OTM        01-Mar-2023 09:00 | 20-Mar-2023 11:30
6        7                15161, PDA/adm -35/2022-2023/, Live  ...     NCB,  OTM        27-Feb-2023 16:00 | 10-Mar-2023 11:00
7        8  15169,  MD/Dz.EHSS-20/2022-2023/5179, Amendmen...  ...     NCB,  OTM        27-Feb-2023 14:30 | 10-Mar-2023 14:00
8        9                                 15157, nofp2, Live  ...     NCB,  OTM        21-Feb-2023 09:00 | 08-Mar-2023 11:30
9       10   15158, MD/DES-20/2022-2023/5095, Being processed  ...     NCB,  OTM        21-Feb-2023 02:00 | 02-Mar-2023 10:00

[10 rows x 6 columns]

huangapple
  • 本文由 发表于 2023年3月4日 01:32:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/75630195.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定