从网站中用Python提取句子中的一些文本。

huangapple go评论71阅读模式
英文:

Extracting some text in a sentence from a website in python

问题

我在尝试从这个[网站](http://wiseafrican.isslserv.ng/index.php/category/nigerian-proverbs/yoruba-proverbs/page/5/)中提取句子中的一些文本时遇到了困难。

```python
import pandas as pd
import requests
from bs4 import BeautifulSoup

res = requests.get('http://wiseafrican.isslserv.ng/index.php/category/nigerian-proverbs/yoruba-proverbs/page/5/')
soup4 = BeautifulSoup(res.content, 'html.parser')

soup4.findAll('div', 'excerpt')

以下是输出。我想要提取每个HTML标签中**Translation:**之前的句子,然后将它们添加到一个pandas DataFrame中。

[<div class="excerpt">
 <p>A ki i fi ara eni se oogun alokunna. Translation: One does not use oneself as an ingredient in a medicine requiring that the ingredients be pulverized. Meaning; Self-preservation is a compulsory project for all.</p>
 </div>, <div class="excerpt">
 <p>A ki i fi ai-mo-we mookun. Translation: One does not dive under water without knowing how to swim. Meaning: Never engage in a project for which you lack the requisite skills.</p>
 </div>, <div class="excerpt">
 <p>A ki i fi agba sile sin agba. Translation: One does not leave one elder sitting to walk another elder part of his way. meaning: One should not slight one person in order to humor another.</p>
 </div>, <div class="excerpt">
 <p>A ki i fa ori lehin olori. Translation: One does not shave a head in the absence of the owner. Meaning: One does not settle a matter in the absence of the person most concerned.</p>
 </div>, <div class="excerpt">
 <p>A ki i duni loye ka fona ile-e Baale hanni. Translation: One does not compete with another for a chieftaincy title and also show the way to the king’s house to the competitor. Meaning: A person should be treated either as an adversary or as an ally, not as both.</p>
 </div>, <div class="excerpt">
 <p>A ki i du ori olori ki awodi gbe teni lo. Translation: One does not fight to save another person’s head only to have a kite carry one’s own away. Meaning: One should not save other’s at the cost of one’s own safety.</p>
 </div>, <div class="excerpt">
 <p>A ki i da eru ikun pa ori. Translation: One does not weigh the head down with a load that belongs to the belly. Meaning: Responsibilities should rest where they belong.</p>
 </div>, <div class="excerpt">
 <p>A ki i da aro nisokun ala la nlo. Translation: One does not engage in a dyeing trade in (isokun) people there wear only white. Meaning Wherever one might be, one should respect the manners and habits of the place.</p>
 </div>, <div class="excerpt">
 <p>A ki bo sinu omi tan ka maa sa fun otutu. Translation: Does not enter into the water and then run from the cold. Meaning: Precautions are useful only before the event.</p>
 </div>, <div class="excerpt">
 <p>A fun o lobe o tami si; o gbon ju olobe lo. Translation: You are given some stew and you add water; you must be wiser than the cook. Meaning: Adding water is a means of stretching stew. A person who thus stretches the stew he or she is given would seem to know better than the person who served it how much would suffice for the meal.</p>
 </div>]

<details>
<summary>英文:</summary>

I was stuck while trying to extract some text in a sentence via this [website](http://wiseafrican.isslserv.ng/index.php/category/nigerian-proverbs/yoruba-proverbs/page/5/).

import pandas as pd
import requests
from b24 import BeautifulSoap

res = requests.get('http://wiseafrican.isslserv.ng/index.php/category/nigerian-proverbs/yoruba-proverbs/page/5/')
soup4 = BeautifulSoup(res4.content, 'html.parser')

soup4.findAll('div','excerpt')

Below are the output. I will like to extract only the sentence before **Translation:** in each html tag,before adding them to a `pandas DataFrame`.

[<div class="excerpt">
<p>A ki i fi ara eni se oogun alokunna. Translation: One does not use oneself as an ingredient in a medicine requiring that the ingredients be pulverized. Meaning; Self-preservation is a compulsory project for all.</p>
</div>, <div class="excerpt">
<p>A ki i fi ai-mo-we mookun. Translation: One does not dive under water without knowing how to swim. Meaning: Never engage in a project for which you lack the requisite skills.</p>
</div>, <div class="excerpt">
<p>A ki i fi agba sile sin agba. Translation: One does not leave one elder sitting to walk another elder part of his way. meaning: One should not slight one person in order to humor another.</p>
</div>, <div class="excerpt">
<p>A ki i fa ori lehin olori. Translation: One does not shave a head in the absence of the owner. Meaning: One does not settle a matter in the absence of the person most concerned.</p>
</div>, <div class="excerpt">
<p>A ki i duni loye ka fona ile-e Baale hanni. Translation: One does not compete with another for a chieftaincy title and also show the way to the king’s house to the competitor. Meaning: A person should be treated either as an adversary or as an ally, not as both.</p>
</div>, <div class="excerpt">
<p>A ki i du ori olori ki awodi gbe teni lo. Translation: One does not fight to save another person’s head only to have a kite carry one’s own away. Meaning: One should not save other’s at the cost of one’s own safety.</p>
</div>, <div class="excerpt">
<p>A ki i da eru ikun pa ori. Translation: One does not weigh the head down with a load that belongs to the belly. Meaning: Responsibilities should rest where they belong.</p>
</div>, <div class="excerpt">
<p>A ki i da aro nisokun ala la nlo. Translation: One does not engage in a dyeing trade in (isokun) people there wear only white. Meaning Wherever one might be, one should respect the manners and habits of the place.</p>
</div>, <div class="excerpt">
<p>A ki bo sinu omi tan ka maa sa fun otutu. Translation: Does not enter into the water and then run from the cold. Meaning: Precautions are useful only before the event.</p>
</div>, <div class="excerpt">
<p>A fun o lobe o tami si; o gbon ju olobe lo. Translation: You are given some stew and you add water; you must be wiser than the cook. Meaning: Adding water is a means of stretching stew. A person who thus stretches the stew he or she is given would seem to know better than the person who served it how much would suffice for the meal.</p>
</div>]


</details>


# 答案1
**得分**: 0

One solution is to add text to Dataframe and then use `.str.extract()` to clear your data:

```python
import requests
import pandas as pd
from bs4 import BeautifulSoup

res = requests.get('http://wiseafrican.isslserv.ng/index.php/category/nigerian-proverbs/yoruba-proverbs/page/5/')
soup4 = BeautifulSoup(res.content, 'html.parser')

df = pd.DataFrame([div.get_text(strip=True) for div in soup4.findAll('div','excerpt')], columns=['Proverb'])

df['Proverb'] = df['Proverb'].str.extract('^(.*)\s+Translation')
print(df)

Prints:

                                       Proverb
0         A ki&#160;i fi ara eni se oogun alokunna.
1                   A ki&#160;i&#160;fi ai-mo-we&#160;mookun.
2                A ki&#160;i&#160;fi agba&#160;sile&#160;sin agba.
3                   A ki&#160;i&#160;fa&#160;ori&#160;lehin olori.
4  A ki&#160;i&#160;duni loye&#160;ka&#160;fona&#160;ile-e Baale&#160;hanni.
5    A ki&#160;i&#160;du ori&#160;olori&#160;ki&#160;awodi&#160;gbe&#160;teni lo.
6                   A ki&#160;i&#160;da&#160;eru&#160;ikun pa ori.
7            A ki&#160;i&#160;da&#160;aro&#160;nisokun&#160;ala&#160;la&#160;nlo.
8   A ki&#160;&#160;bo&#160;sinu&#160;omi tan ka&#160;maa sa&#160;fun otutu.
9  A fun o&#160;lobe&#160;o tami si; o gbon ju olobe lo.

Or use re module before:

df = pd.DataFrame([re.sub(r'^(.*)\s+Translation:.*', r'', div.get_text(strip=True)) for div in soup4.findAll('div','excerpt')], columns=['Proverb'])
print(df)

Link to pandas documentation

英文:

One solution is to add text to Dataframe and then use .str.extract() to clear your data:

import requests
import pandas as pd
from bs4 import BeautifulSoup

res = requests.get(&#39;http://wiseafrican.isslserv.ng/index.php/category/nigerian-proverbs/yoruba-proverbs/page/5/&#39;)
soup4 = BeautifulSoup(res.content, &#39;html.parser&#39;)

df = pd.DataFrame([div.get_text(strip=True) for div in soup4.findAll(&#39;div&#39;,&#39;excerpt&#39;)], columns=[&#39;Proverb&#39;])

df[&#39;Proverb&#39;] = df[&#39;Proverb&#39;].str.extract(&#39;^(.*)\s+Translation&#39;)
print(df)

Prints:

                                       Proverb
0         A ki&#160;i fi ara eni se oogun alokunna.
1                   A ki&#160;i&#160;fi ai-mo-we&#160;mookun.
2                A ki&#160;i&#160;fi agba&#160;sile&#160;sin agba.
3                   A ki&#160;i&#160;fa&#160;ori&#160;lehin olori.
4  A ki&#160;i&#160;duni loye&#160;ka&#160;fona&#160;ile-e Baale&#160;hanni.
5    A ki&#160;i&#160;du ori&#160;olori&#160;ki&#160;awodi&#160;gbe&#160;teni lo.
6                   A ki&#160;i&#160;da&#160;eru&#160;ikun pa ori.
7            A ki&#160;i&#160;da&#160;aro&#160;nisokun&#160;ala&#160;la&#160;nlo.
8   A ki&#160;&#160;bo&#160;sinu&#160;omi tan ka&#160;maa sa&#160;fun otutu.
9  A fun o&#160;lobe&#160;o tami si; o gbon ju olobe lo.

Or use re module before:

df = pd.DataFrame([re.sub(r&#39;^(.*)\s+Translation:.*&#39;, r&#39;&#39;, div.get_text(strip=True)) for div in soup4.findAll(&#39;div&#39;,&#39;excerpt&#39;)], columns=[&#39;Proverb&#39;])
print(df)

答案2

得分: 0

import pandas as pd
import requests
from bs4 import BeautifulSoup

res = requests.get('http://wiseafrican.isslserv.ng/index.php/category/nigerian-proverbs/yoruba-proverbs/page/5/')
soup4 = BeautifulSoup(res.content, 'html.parser')

data = soup4.findAll('div', 'excerpt')
for i in data:
    #print(i.p.text)
    print(i.p.text.split('Translation:')[0])
英文:
import pandas as pd
import requests
from bs4 import BeautifulSoup


res = requests.get(&#39;http://wiseafrican.isslserv.ng/index.php/category/nigerian-proverbs/yoruba-proverbs/page/5/&#39;)
soup4 = BeautifulSoup(res.content, &#39;html.parser&#39;)

data = soup4.findAll(&#39;div&#39;,&#39;excerpt&#39;)
for i in data:
    #print(i.p.text)
    print(i.p.text.split(&#39;Translation:&#39;)[0])

huangapple
  • 本文由 发表于 2020年1月7日 01:38:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/59616571.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定