英文:
Creating dataframe from beautifulsoup4 result does not work due to structure
问题
I get "ValueError: No tables found".
I try to scrape html from a website as follows:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import lxml
def getHTMLdocument(url):
response = requests.get(url)
return response.text
url_to_scrape = "https://website.com";
html_document = getHTMLdocument(url_to_scrape)
soup = BeautifulSoup(html_document, 'lxml')
table = soup.find_all('h3')
table
As a output I get the following (which I am totally fine):
<h3><a class="xyz-link" href="https//address1.com">address1</a></h3>,
<h3><a class="xyz-link" href="https//address2.com">address2</a></h3>,
...
After that I try using
df = pd.read_html(str(table))[0]
df
but getting "ValueError: No tables found". I think this is because of the structure of my beautifuolSoup result. I want to extract the addresses (e.g. https//address1.com) and also the followed text (address1) to a dataframe. Any ideas?
英文:
I get "ValueError: No tables found".
I try to scrape html from a website as follows:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import lxml
def getHTMLdocument(url):
response = requests.get(url)
return response.text
url_to_scrape = "https://website.com"
html_document = getHTMLdocument(url_to_scrape)
soup = BeautifulSoup(html_document, 'lxml')
table = soup.find_all('h3')
table
As a output I get the following (which I am totally fine):
<h3><a class="xyz-link" href=""https//address1.com">address1</a></h3>,
<h3><a class="xyz-link" href=""https//address2.com">address2</a></h3>,
...
After that I try using
df = pd.read_html(str(table))[0]
df
but getting "ValueError: No tables found". I think this is because of the structure of my beautifuolSoup result. I want to extract the addresses (e.g. https//address1.com) and also the followed text (address1) to a dataframe. Any ideas?
答案1
得分: 2
以下是您要翻译的内容:
data = [{'Link': t['href'], 'Name': t.text} for t in soup.select('h3 > a')]
df = pd.DataFrame(data)
print(df)
# 输出
Link Name
0 https//address1.com address1
1 https//address2.com address2
而不是提取所有标签<a>
或<h3>
, 我们提取所有<h3>
标签后跟<a>
标签的数据,使用select
方法而不是find_all
。其余部分是一个列表推导式,创建一个包含属性href
和文本的字典。最后,您可以使用字典列表创建一个Pandas DataFrame。
英文:
You can try:
data = [{'Link': t['href'], 'Name': t.text} for t in soup.select('h3 > a')]
df = pd.DataFrame(data)
print(df)
# Output
Link Name
0 https//address1.com address1
1 https//address2.com address2
Rather than extract all tags <a>
or <h3>
, we extract all data where <h3>
tag is followed by <a>
with the select
method instead of find_all
. The rest is a list comprehension that creates a dict with the attribute href
and the text. Finally, you can create a Pandas DataFrame with a list of dicts.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论