问题

I get "ValueError: No tables found".

I try to scrape html from a website as follows:

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import lxml

def getHTMLdocument(url):

    response = requests.get(url) 

    return response.text

url_to_scrape = "https://website.com";
html_document = getHTMLdocument(url_to_scrape)

soup = BeautifulSoup(html_document, 'lxml')
table = soup.find_all('h3')

table

As a output I get the following (which I am totally fine):

&lt;h3&gt;&lt;a class="xyz-link" href="https//address1.com"&gt;address1&lt;/a&gt;&lt;/h3&gt;,
&lt;h3&gt;&lt;a class="xyz-link" href="https//address2.com"&gt;address2&lt;/a&gt;&lt;/h3&gt;,
...

After that I try using

df = pd.read_html(str(table))[0]
df

but getting "ValueError: No tables found". I think this is because of the structure of my beautifuolSoup result. I want to extract the addresses (e.g. https//address1.com) and also the followed text (address1) to a dataframe. Any ideas?

英文:

I get "ValueError: No tables found".

I try to scrape html from a website as follows:

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import lxml

def getHTMLdocument(url):

    response = requests.get(url) 

    return response.text


url_to_scrape = &quot;https://website.com&quot;
html_document = getHTMLdocument(url_to_scrape)

soup = BeautifulSoup(html_document, &#39;lxml&#39;)
table = soup.find_all(&#39;h3&#39;)

table

As a output I get the following (which I am totally fine):

&lt;h3&gt;&lt;a class=&quot;xyz-link&quot; href=&quot;&quot;https//address1.com&quot;&gt;address1&lt;/a&gt;&lt;/h3&gt;,
&lt;h3&gt;&lt;a class=&quot;xyz-link&quot; href=&quot;&quot;https//address2.com&quot;&gt;address2&lt;/a&gt;&lt;/h3&gt;,
...

After that I try using

df = pd.read_html(str(table))[0]
df

答案1

得分: 2

以下是您要翻译的内容：

data = [{'Link': t['href'], 'Name': t.text} for t in soup.select('h3 > a')]
df = pd.DataFrame(data)
print(df)

# 输出
                  Link      Name
0  https//address1.com  address1
1  https//address2.com  address2

而不是提取所有标签<a>或<h3>, 我们提取所有<h3>标签后跟<a>标签的数据，使用select方法而不是find_all。其余部分是一个列表推导式，创建一个包含属性href和文本的字典。最后，您可以使用字典列表创建一个Pandas DataFrame。

英文:

You can try:

data = [{&#39;Link&#39;: t[&#39;href&#39;], &#39;Name&#39;: t.text} for t in soup.select(&#39;h3 &gt; a&#39;)]
df = pd.DataFrame(data)
print(df)

# Output
                  Link      Name
0  https//address1.com  address1
1  https//address2.com  address2

Rather than extract all tags <a> or <h3>, we extract all data where <h3> tag is followed by <a> with the select method instead of find_all. The rest is a list comprehension that creates a dict with the attribute href and the text. Finally, you can create a Pandas DataFrame with a list of dicts.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从BeautifulSoup4的结果创建数据框由于结构问题无法工作。

问题

答案1

我的代码执行乘客摘要太早了

Miniconda VS Code Output Not Running From Code Runner Output Tab

在CSV中解析字典数值

Excluding fields on a pydantic model when it is the nested child of another model

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论