如何使用Selenium通过多个标签查找HTML元素

huangapple go评论62阅读模式
英文:

How to find HTML elements by multiple tags with selenium

问题

我需要使用selenium从一个网页中抓取数据我需要找到以下元素

```html
<div class="content-left">
    <ul></ul>
    <ul></ul>
    <p></p>
    <ul></ul>
    <p></p>
    <ul></ul>
    <p></p>
    <ul>
        <li></li>
        <li></li>
    </ul>
    <p></p>
</div>

如你所见,<p><ul> 标签没有类,我不知道如何按顺序获取它们。

之前我使用过Beautifulsoup:

allP = bs.find('div', attrs={"class":"content-left"})
txt = ""
for p in allP.find_all(['p', 'li']):

但现在不再起作用(通过requests得到403错误)。我需要用selenium找到这些元素。

HTML:

如何使用Selenium通过多个标签查找HTML元素


<details>
<summary>英文:</summary>

I need to scrape data from a webpage with selenium. I need to find these elements:

<div class="content-left">
<ul></ul>
<ul></ul>
<p></p>
<ul></ul>
<p></p>
<ul></ul>
<p></p>
<ul>
<li></li>
<li></li>
</ul>
<p></p>
</div>

As you can see `&lt;p&gt;` and `&lt;ul&gt;` tags has no classes and I don&#39;t know how to get them in order.

I used Beautifulsoup before:

allP = bs.find('div', attrs={"class":"content-left"})
txt = ""
for p in allP.find_all(['p', 'li']):

But It&#39;s not working anymore (got 403 error by requests). And I need to find these elements with selenium.

HTML:

![This image](https://i.stack.imgur.com/lqfYm.png)

</details>


# 答案1
**得分**: 0

从`&lt;p&gt;`和`&lt;li&gt;`标记中提取文本,你可以使用[**Beautiful Soup**](https://stackoverflow.com/a/47871704/7429447)如下所示:

```python
from bs4 import BeautifulSoup

html_text = '''<div class="content-left">
    <ul>1</ul>
    <ul>2</ul>
    <p>3</p>
    <ul>4</ul>
    <p>5</p>
    <ul>6</ul>
    <p>7</p>
    <ul>
        <li>8</li>
        <li>9</li>
    </ul>
    <p>10</p>
</div>
'''
soup = BeautifulSoup(html_text, 'html.parser')
parent_element = soup.find("div", {"class": "content-left"})
for element in parent_element.find_all(['p', 'li']):
    print(element.text)

控制台输出:

3
5
7
8
9
10

使用_Selenium_

使用Selenium,你可以使用list comprehension如下所示:

  • 使用_CSS_SELECTOR_:
print([my_elem.text for my_elem in driver.find_elements(By.CSS_SELECTOR, "div.content-left p, div.content-left li")])
英文:

To extract the texts from &lt;p&gt; and &lt;li&gt; tags only you can use Beautiful Soup as follows:

from bs4 import BeautifulSoup

html_text = &#39;&#39;&#39;
&lt;div class=&quot;content-left&quot;&gt;
    &lt;ul&gt;1&lt;/ul&gt;
    &lt;ul&gt;2&lt;/ul&gt;
    &lt;p&gt;3&lt;/p&gt;
    &lt;ul&gt;4&lt;/ul&gt;
    &lt;p&gt;5&lt;/p&gt;
    &lt;ul&gt;6&lt;/ul&gt;
    &lt;p&gt;7&lt;/p&gt;
    &lt;ul&gt;
        &lt;li&gt;8&lt;/li&gt;
        &lt;li&gt;9&lt;/li&gt;
    &lt;/ul&gt;
    &lt;p&gt;10&lt;/p&gt;
&lt;/div&gt;
&#39;&#39;&#39;
soup = BeautifulSoup(html_text, &#39;html.parser&#39;)
parent_element = soup.find(&quot;div&quot;, {&quot;class&quot;: &quot;content-left&quot;})
for element in parent_element.find_all([&#39;p&#39;, &#39;li&#39;]):
	print(element.text)

Console output:

3
5
7
8
9
10

Using Selenium

Using Selenium you can use list comprehension as follows:

  • Using CSS_SELECTOR:

    print([my_elem.text for my_elem in driver.find_elements(By.CSS_SELECTOR, &quot;div.content-left p, div.content-left li&quot;)])
    

huangapple
  • 本文由 发表于 2023年6月13日 03:00:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76459579.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定