从BeautifulSoup4的结果创建数据框由于结构问题无法工作。

huangapple go评论76阅读模式
英文:

Creating dataframe from beautifulsoup4 result does not work due to structure

问题

I get "ValueError: No tables found".

I try to scrape html from a website as follows:

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import lxml

def getHTMLdocument(url):

    response = requests.get(url) 

    return response.text

url_to_scrape = "https://website.com";
html_document = getHTMLdocument(url_to_scrape)

soup = BeautifulSoup(html_document, 'lxml')
table = soup.find_all('h3')

table

As a output I get the following (which I am totally fine):

<h3><a class="xyz-link" href="https//address1.com">address1</a></h3>,
<h3><a class="xyz-link" href="https//address2.com">address2</a></h3>,
...

After that I try using

df = pd.read_html(str(table))[0]
df

but getting "ValueError: No tables found". I think this is because of the structure of my beautifuolSoup result. I want to extract the addresses (e.g. https//address1.com) and also the followed text (address1) to a dataframe. Any ideas?

英文:

I get "ValueError: No tables found".

I try to scrape html from a website as follows:

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import lxml

def getHTMLdocument(url):

    response = requests.get(url) 

    return response.text


url_to_scrape = "https://website.com"
html_document = getHTMLdocument(url_to_scrape)

soup = BeautifulSoup(html_document, 'lxml')
table = soup.find_all('h3')

table

As a output I get the following (which I am totally fine):

<h3><a class="xyz-link" href=""https//address1.com">address1</a></h3>,
<h3><a class="xyz-link" href=""https//address2.com">address2</a></h3>,
...

After that I try using

df = pd.read_html(str(table))[0]
df

but getting "ValueError: No tables found". I think this is because of the structure of my beautifuolSoup result. I want to extract the addresses (e.g. https//address1.com) and also the followed text (address1) to a dataframe. Any ideas?

答案1

得分: 2

以下是您要翻译的内容:

data = [{'Link': t['href'], 'Name': t.text} for t in soup.select('h3 > a')]
df = pd.DataFrame(data)
print(df)

# 输出
                  Link      Name
0  https//address1.com  address1
1  https//address2.com  address2

而不是提取所有标签<a><h3>, 我们提取所有<h3>标签后跟<a>标签的数据,使用select方法而不是find_all。其余部分是一个列表推导式,创建一个包含属性href和文本的字典。最后,您可以使用字典列表创建一个Pandas DataFrame。

英文:

You can try:

data = [{&#39;Link&#39;: t[&#39;href&#39;], &#39;Name&#39;: t.text} for t in soup.select(&#39;h3 &gt; a&#39;)]
df = pd.DataFrame(data)
print(df)

# Output
                  Link      Name
0  https//address1.com  address1
1  https//address2.com  address2

Rather than extract all tags &lt;a&gt; or &lt;h3&gt;, we extract all data where &lt;h3&gt; tag is followed by &lt;a&gt; with the select method instead of find_all. The rest is a list comprehension that creates a dict with the attribute href and the text. Finally, you can create a Pandas DataFrame with a list of dicts.

huangapple
  • 本文由 发表于 2023年3月8日 15:17:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/75670240.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定