从网站中用Python提取句子中的一些文本。

huangapple go评论99阅读模式
英文:

Extracting some text in a sentence from a website in python

问题

  1. 我在尝试从这个[网站](http://wiseafrican.isslserv.ng/index.php/category/nigerian-proverbs/yoruba-proverbs/page/5/)中提取句子中的一些文本时遇到了困难。
  2. ```python
  3. import pandas as pd
  4. import requests
  5. from bs4 import BeautifulSoup
  6. res = requests.get('http://wiseafrican.isslserv.ng/index.php/category/nigerian-proverbs/yoruba-proverbs/page/5/')
  7. soup4 = BeautifulSoup(res.content, 'html.parser')
  8. soup4.findAll('div', 'excerpt')

以下是输出。我想要提取每个HTML标签中**Translation:**之前的句子,然后将它们添加到一个pandas DataFrame中。

  1. [<div class="excerpt">
  2. <p>A ki i fi ara eni se oogun alokunna. Translation: One does not use oneself as an ingredient in a medicine requiring that the ingredients be pulverized. Meaning; Self-preservation is a compulsory project for all.</p>
  3. </div>, <div class="excerpt">
  4. <p>A ki i fi ai-mo-we mookun. Translation: One does not dive under water without knowing how to swim. Meaning: Never engage in a project for which you lack the requisite skills.</p>
  5. </div>, <div class="excerpt">
  6. <p>A ki i fi agba sile sin agba. Translation: One does not leave one elder sitting to walk another elder part of his way. meaning: One should not slight one person in order to humor another.</p>
  7. </div>, <div class="excerpt">
  8. <p>A ki i fa ori lehin olori. Translation: One does not shave a head in the absence of the owner. Meaning: One does not settle a matter in the absence of the person most concerned.</p>
  9. </div>, <div class="excerpt">
  10. <p>A ki i duni loye ka fona ile-e Baale hanni. Translation: One does not compete with another for a chieftaincy title and also show the way to the kings house to the competitor. Meaning: A person should be treated either as an adversary or as an ally, not as both.</p>
  11. </div>, <div class="excerpt">
  12. <p>A ki i du ori olori ki awodi gbe teni lo. Translation: One does not fight to save another persons head only to have a kite carry ones own away. Meaning: One should not save others at the cost of ones own safety.</p>
  13. </div>, <div class="excerpt">
  14. <p>A ki i da eru ikun pa ori. Translation: One does not weigh the head down with a load that belongs to the belly. Meaning: Responsibilities should rest where they belong.</p>
  15. </div>, <div class="excerpt">
  16. <p>A ki i da aro nisokun ala la nlo. Translation: One does not engage in a dyeing trade in (isokun) people there wear only white. Meaning Wherever one might be, one should respect the manners and habits of the place.</p>
  17. </div>, <div class="excerpt">
  18. <p>A ki bo sinu omi tan ka maa sa fun otutu. Translation: Does not enter into the water and then run from the cold. Meaning: Precautions are useful only before the event.</p>
  19. </div>, <div class="excerpt">
  20. <p>A fun o lobe o tami si; o gbon ju olobe lo. Translation: You are given some stew and you add water; you must be wiser than the cook. Meaning: Adding water is a means of stretching stew. A person who thus stretches the stew he or she is given would seem to know better than the person who served it how much would suffice for the meal.</p>
  21. </div>]
  1. <details>
  2. <summary>英文:</summary>
  3. I was stuck while trying to extract some text in a sentence via this [website](http://wiseafrican.isslserv.ng/index.php/category/nigerian-proverbs/yoruba-proverbs/page/5/).

import pandas as pd
import requests
from b24 import BeautifulSoap

res = requests.get('http://wiseafrican.isslserv.ng/index.php/category/nigerian-proverbs/yoruba-proverbs/page/5/')
soup4 = BeautifulSoup(res4.content, 'html.parser')

soup4.findAll('div','excerpt')

  1. Below are the output. I will like to extract only the sentence before **Translation:** in each html tag,before adding them to a `pandas DataFrame`.

[<div class="excerpt">
<p>A ki i fi ara eni se oogun alokunna. Translation: One does not use oneself as an ingredient in a medicine requiring that the ingredients be pulverized. Meaning; Self-preservation is a compulsory project for all.</p>
</div>, <div class="excerpt">
<p>A ki i fi ai-mo-we mookun. Translation: One does not dive under water without knowing how to swim. Meaning: Never engage in a project for which you lack the requisite skills.</p>
</div>, <div class="excerpt">
<p>A ki i fi agba sile sin agba. Translation: One does not leave one elder sitting to walk another elder part of his way. meaning: One should not slight one person in order to humor another.</p>
</div>, <div class="excerpt">
<p>A ki i fa ori lehin olori. Translation: One does not shave a head in the absence of the owner. Meaning: One does not settle a matter in the absence of the person most concerned.</p>
</div>, <div class="excerpt">
<p>A ki i duni loye ka fona ile-e Baale hanni. Translation: One does not compete with another for a chieftaincy title and also show the way to the king’s house to the competitor. Meaning: A person should be treated either as an adversary or as an ally, not as both.</p>
</div>, <div class="excerpt">
<p>A ki i du ori olori ki awodi gbe teni lo. Translation: One does not fight to save another person’s head only to have a kite carry one’s own away. Meaning: One should not save other’s at the cost of one’s own safety.</p>
</div>, <div class="excerpt">
<p>A ki i da eru ikun pa ori. Translation: One does not weigh the head down with a load that belongs to the belly. Meaning: Responsibilities should rest where they belong.</p>
</div>, <div class="excerpt">
<p>A ki i da aro nisokun ala la nlo. Translation: One does not engage in a dyeing trade in (isokun) people there wear only white. Meaning Wherever one might be, one should respect the manners and habits of the place.</p>
</div>, <div class="excerpt">
<p>A ki bo sinu omi tan ka maa sa fun otutu. Translation: Does not enter into the water and then run from the cold. Meaning: Precautions are useful only before the event.</p>
</div>, <div class="excerpt">
<p>A fun o lobe o tami si; o gbon ju olobe lo. Translation: You are given some stew and you add water; you must be wiser than the cook. Meaning: Adding water is a means of stretching stew. A person who thus stretches the stew he or she is given would seem to know better than the person who served it how much would suffice for the meal.</p>
</div>]

  1. </details>
  2. # 答案1
  3. **得分**: 0
  4. One solution is to add text to Dataframe and then use `.str.extract()` to clear your data:
  5. ```python
  6. import requests
  7. import pandas as pd
  8. from bs4 import BeautifulSoup
  9. res = requests.get('http://wiseafrican.isslserv.ng/index.php/category/nigerian-proverbs/yoruba-proverbs/page/5/')
  10. soup4 = BeautifulSoup(res.content, 'html.parser')
  11. df = pd.DataFrame([div.get_text(strip=True) for div in soup4.findAll('div','excerpt')], columns=['Proverb'])
  12. df['Proverb'] = df['Proverb'].str.extract('^(.*)\s+Translation')
  13. print(df)

Prints:

  1. Proverb
  2. 0 A ki&#160;i fi ara eni se oogun alokunna.
  3. 1 A ki&#160;i&#160;fi ai-mo-we&#160;mookun.
  4. 2 A ki&#160;i&#160;fi agba&#160;sile&#160;sin agba.
  5. 3 A ki&#160;i&#160;fa&#160;ori&#160;lehin olori.
  6. 4 A ki&#160;i&#160;duni loye&#160;ka&#160;fona&#160;ile-e Baale&#160;hanni.
  7. 5 A ki&#160;i&#160;du ori&#160;olori&#160;ki&#160;awodi&#160;gbe&#160;teni lo.
  8. 6 A ki&#160;i&#160;da&#160;eru&#160;ikun pa ori.
  9. 7 A ki&#160;i&#160;da&#160;aro&#160;nisokun&#160;ala&#160;la&#160;nlo.
  10. 8 A ki&#160;&#160;bo&#160;sinu&#160;omi tan ka&#160;maa sa&#160;fun otutu.
  11. 9 A fun o&#160;lobe&#160;o tami si; o gbon ju olobe lo.

Or use re module before:

  1. df = pd.DataFrame([re.sub(r'^(.*)\s+Translation:.*', r'', div.get_text(strip=True)) for div in soup4.findAll('div','excerpt')], columns=['Proverb'])
  2. print(df)

Link to pandas documentation

英文:

One solution is to add text to Dataframe and then use .str.extract() to clear your data:

  1. import requests
  2. import pandas as pd
  3. from bs4 import BeautifulSoup
  4. res = requests.get(&#39;http://wiseafrican.isslserv.ng/index.php/category/nigerian-proverbs/yoruba-proverbs/page/5/&#39;)
  5. soup4 = BeautifulSoup(res.content, &#39;html.parser&#39;)
  6. df = pd.DataFrame([div.get_text(strip=True) for div in soup4.findAll(&#39;div&#39;,&#39;excerpt&#39;)], columns=[&#39;Proverb&#39;])
  7. df[&#39;Proverb&#39;] = df[&#39;Proverb&#39;].str.extract(&#39;^(.*)\s+Translation&#39;)
  8. print(df)

Prints:

  1. Proverb
  2. 0 A ki&#160;i fi ara eni se oogun alokunna.
  3. 1 A ki&#160;i&#160;fi ai-mo-we&#160;mookun.
  4. 2 A ki&#160;i&#160;fi agba&#160;sile&#160;sin agba.
  5. 3 A ki&#160;i&#160;fa&#160;ori&#160;lehin olori.
  6. 4 A ki&#160;i&#160;duni loye&#160;ka&#160;fona&#160;ile-e Baale&#160;hanni.
  7. 5 A ki&#160;i&#160;du ori&#160;olori&#160;ki&#160;awodi&#160;gbe&#160;teni lo.
  8. 6 A ki&#160;i&#160;da&#160;eru&#160;ikun pa ori.
  9. 7 A ki&#160;i&#160;da&#160;aro&#160;nisokun&#160;ala&#160;la&#160;nlo.
  10. 8 A ki&#160;&#160;bo&#160;sinu&#160;omi tan ka&#160;maa sa&#160;fun otutu.
  11. 9 A fun o&#160;lobe&#160;o tami si; o gbon ju olobe lo.

Or use re module before:

  1. df = pd.DataFrame([re.sub(r&#39;^(.*)\s+Translation:.*&#39;, r&#39;&#39;, div.get_text(strip=True)) for div in soup4.findAll(&#39;div&#39;,&#39;excerpt&#39;)], columns=[&#39;Proverb&#39;])
  2. print(df)

答案2

得分: 0

  1. import pandas as pd
  2. import requests
  3. from bs4 import BeautifulSoup
  4. res = requests.get('http://wiseafrican.isslserv.ng/index.php/category/nigerian-proverbs/yoruba-proverbs/page/5/')
  5. soup4 = BeautifulSoup(res.content, 'html.parser')
  6. data = soup4.findAll('div', 'excerpt')
  7. for i in data:
  8. #print(i.p.text)
  9. print(i.p.text.split('Translation:')[0])
英文:
  1. import pandas as pd
  2. import requests
  3. from bs4 import BeautifulSoup
  4. res = requests.get(&#39;http://wiseafrican.isslserv.ng/index.php/category/nigerian-proverbs/yoruba-proverbs/page/5/&#39;)
  5. soup4 = BeautifulSoup(res.content, &#39;html.parser&#39;)
  6. data = soup4.findAll(&#39;div&#39;,&#39;excerpt&#39;)
  7. for i in data:
  8. #print(i.p.text)
  9. print(i.p.text.split(&#39;Translation:&#39;)[0])

huangapple
  • 本文由 发表于 2020年1月7日 01:38:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/59616571.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定