BeautifulSoup的previous_sibling不起作用。

huangapple go评论63阅读模式
英文:

BeautifulSoup previous_sibling not working

问题

以下是已经翻译好的部分:

Some items have title but some don't, sample html like this:

<div id="content">
    <h5>Title1</h5>
    <div class="text">text 1</div>

    <h5>Title2</h5>
    <div class="text">text 2</div>

    <div class="text">text 3</div>

    <div class="text">text 4</div>
</div>

Tried to get all the class `text`, and get their titles `h5`(if any).

`find_previous_sibling` can get the title, but the last two `text` also list the title which is not owned by them.

and also tried `previous_sibling`, then judge whether it is `h5` or `div`, `h5` as title, but it returns nothing.

html = BeautifulSoup(response.text, 'lxml')
content = html.find('div', {'id': 'content'})
paras = content.find_all('div', {'class': 'text'})

for para in paras:
    title = p.find_previous_sibling('h5')
    if title:
        print(title.get_text())

    pr = para.previous_sibling
    if pr:
        print(pr)
英文:

Some items have title but some don't, sample html like this:

&lt;div id=&quot;content&quot;&gt;
    &lt;h5&gt;Title1&lt;/h5&gt;
    &lt;div class=&quot;text&quot;&gt;text 1&lt;/div&gt;

    &lt;h5&gt;Title2&lt;/h5&gt;
    &lt;div class=&quot;text&quot;&gt;text 2&lt;/div&gt;

    &lt;div class=&quot;text&quot;&gt;text 3&lt;/div&gt;

    &lt;div class=&quot;text&quot;&gt;text 4&lt;/div&gt;
&lt;/div&gt;

Tried to get all the class text, and get their titles h5(if any).

find_previous_sibling can get the title, but the last two text also list the title which is not owned by them.

and also tried previous_sibling, then judge whether it is h5 or div, h5 as title, but it returns nothing.

html = BeautifulSoup(response.text,&#39;lxml&#39;)
content = html.find(&#39;div&#39;,{&#39;id&#39;: &#39;content&#39;})
paras = content.find_all(&#39;div&#39;, {&#39;class&#39;: &#39;text&#39;})

for para in paras:
	title = p.find_previous_sibling(&#39;h5&#39;)
	if title:
		print(title.get_text())

	pr = para.previous_sibling
	if pr:
		print(pr)

答案1

得分: 1

你可以在不带任何参数的情况下使用 `find_previous()` 来获取 `div` 元素之前的 DOM 元素然后使用 `.name` 来检查它是否是 `<h5>` 元素

```python3
from bs4 import BeautifulSoup

html = &quot;&quot;&quot;
&lt;div id=&quot;content&quot;&gt;
    &lt;h5&gt;Title1&lt;/h5&gt;
    &lt;div class=&quot;text&quot;&gt;text 1&lt;/div&gt;

    &lt;h5&gt;Title2&lt;/h5&gt;
    &lt;div class=&quot;text&quot;&gt;text 2&lt;/div&gt;

    &lt;div class=&quot;text&quot;&gt;text 3&lt;/div&gt;
    &lt;div class=&quot;text&quot;&gt;text 4&lt;/div&gt;
&lt;/div&gt;
&quot;&quot;&quot;

html = BeautifulSoup(html,&#39;html.parser&#39;)
content = html.find(&#39;div&#39;,{&#39;id&#39;: &#39;content&#39;})
paras = content.find_all(&#39;div&#39;, {&#39;class&#39;: &#39;text&#39;})

for para in paras:
    print(para.text)
    prev = para.find_previous()
    if prev and prev.name == &#39;h5&#39;:
        print(prev.text)

结果输出:

text 1
Title1
text 2
Title2
text 3
text 4

<details>
<summary>英文:</summary>

You could use `find_previous()` without any params to get the DOM element before the `div`, then use `.name` to check if it&#39;s a `&lt;h5&gt;`:

```python3
from bs4 import BeautifulSoup

html = &quot;&quot;&quot;
&lt;div id=&quot;content&quot;&gt;
    &lt;h5&gt;Title1&lt;/h5&gt;
    &lt;div class=&quot;text&quot;&gt;text 1&lt;/div&gt;

    &lt;h5&gt;Title2&lt;/h5&gt;
    &lt;div class=&quot;text&quot;&gt;text 2&lt;/div&gt;

    &lt;div class=&quot;text&quot;&gt;text 3&lt;/div&gt;
    &lt;div class=&quot;text&quot;&gt;text 4&lt;/div&gt;
&lt;/div&gt;
&quot;&quot;&quot;

html = BeautifulSoup(html,&#39;html.parser&#39;)
content = html.find(&#39;div&#39;,{&#39;id&#39;: &#39;content&#39;})
paras = content.find_all(&#39;div&#39;, {&#39;class&#39;: &#39;text&#39;})

for para in paras:
    print(para.text)
    prev = para.find_previous()
    if prev and prev.name == &#39;h5&#39;:
        print(prev.text)

Gives:

text 1
Title1
text 2
Title2
text 3
text 4

huangapple
  • 本文由 发表于 2023年7月13日 18:04:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/76678185.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定