2023年2月6日 21:26:54go评论75阅读模式

英文:

Combine multiple BeautifulSoup calls

问题

我想遍历一个网页。
我使用soup来查找/选择html中的标签。
目前，我有两个分开的语句。但我想在一个语句中完成，这样我就不必两次遍历同一个页面。
我的代码如下：

headers = ({'User-Agent':
        'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
sapo="https://casa.sapo.pt/comprar-apartamentos/ofertas-recentes/distrito.lisboa/?pn=1"
soup = BeautifulSoup(response.text, 'html.parser')
data1 = [json.loads(x.string) for x in soup.find_all("script", type="application/ld+json")]
data2 = soup.select('div.property')
del data1[:2]

页面上有25个属性。data1返回27个结果，而前两个结果只是开头，所以我删除它们。所以我有25个带有10个"列"的结果。
现在我想将data2作为第11列。

我应该如何实现这个？

英文:

I want to iterate over a webpage.
I use soup to find/select the tags in the html.
For now, I have the two separated statements. But I'd like to have it done in one statement so I dont have to iterate over the same page twice.
My code is the following:

headers = ({&#39;User-Agent&#39;:
        &#39;Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36&#39;})
sapo=&quot;https://casa.sapo.pt/comprar-apartamentos/ofertas-recentes/distrito.lisboa/?pn=1&quot;
soup = BeautifulSoup(response.text, &#39;html.parser&#39;)
data1 = [json.loads(x.string) for x in soup.find_all(&quot;script&quot;, type=&quot;application/ld+json&quot;)]
data2= soup.select(&#39;div.property&#39;)
del  data1[:2]

There are 25 properties on the page. data1 returns 27 results, whereas the first 2 results are just overhead, so I delete them. So I have 25 results with 10 "columns".
Now I'd like to have the data2 as an 11th column.

How could I achieve this?

答案1

得分: 1

以下是翻译好的部分：

"我不确定为什么你喜欢获取整个HTML元素，但我们可以继续。更改选择元素的策略并从容器开始："

"根据您的评论提取href："

"数据 = []"

"对于soup.select('div.property')中的每个元素："

"d = {'link': 'https://casa.sapo.pt' + e.a.get('href')}"

"d.update(json.loads(e.script.string))"

"data.append(d)"

"pd.DataFrame(data)"

英文:

I am not sure why you like to get the whole HTML element, but here we go. Change your strategy selecting elements and start withe the containers:

data = []
for e in soup.select(&#39;div.property&#39;):
    d = {&#39;html&#39;:e}
    d.update(json.loads(e.script.string))
    data.append(d)
pd.DataFrame(data)

EDIT

Based on your comment extract the href via

d = {&#39;link&#39;:&#39;https://casa.sapo.pt&#39;+e.a.get(&#39;href&#39;)}

data = []
for e in soup.select(&#39;div.property&#39;):
    d = {&#39;link&#39;:&#39;https://casa.sapo.pt&#39;+e.a.get(&#39;href&#39;)}
    d.update(json.loads(e.script.string))
    data.append(d)
pd.DataFrame(data)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

合并多个BeautifulSoup调用

问题

答案1

EDIT

将线程计时，然后将时间传递给主线程中的另一个函数。

找到基于举办最多音乐会的前3年的前2个场馆？

将列中的数据展开为多行

"Quota exceeded for quota group 'ReadGroup' and limit 'Read requests per user per 100 seconds' of service google api

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。