2023年7月31日 20:52:23go评论141阅读模式

英文:

Python beautifulsoup scraping and collecting data

问题

从zillow中，我想获取无序列表中的每个项目并转到特定类：

“StyledPropertyCardDataWrapper-c11n-8-84-3__sc-1omp4c3-0 bKpguY property-card-data”

在这里，我可以获取价格、地址和链接，创建包括所需详细信息的3个不同列表。
我只得到了空白的结果

尝试了这个：

all_tags = soup.select(selector='body ul')
print(all_tags)
all_li_tags = [tags.findall('li') for tags in all_tags]
print(all_li_tags)

和很多其他变化

尝试创建这3个不同的列表，包括地址、价格和物业链接

英文:

From zillow I want to get every item in the unordered list and go to the specific class:

&quot;StyledPropertyCardDataWrapper-c11n-8-84-3__sc-1omp4c3-0 bKpguY property-card-data&quot;

Where I can get the price, address, and link creating 3 different lists including each of the wanted details.
I'm getting just empty blankets

tried this:

all_tags = soup.select(selector=&#39;body ul&#39;)
print(all_tags)
all_li_tags = [tags.findall(&#39;li&#39;) for tags in all_tags]
print(all_li_tags)

and a lot of other variations

Tried to create those 3 different lists including the address, price, and link to the property

答案1

得分: 1

带有selenium标签，我假设您将获得完全呈现的源文本，您可以按以下方式选择您的元素（通过常量id或标签选择您的元素，并尽量避免使用诸如类名之类的动态标识符）：

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
driver = webdriver.Chrome(ChromeDriverManager(version='114.0.5735.90').install())
driver.get(url)
soup = BeautifulSoup(driver.page_source)
data = []
for e in soup.select('#grid-search-results article'):
    data.append({
        'address': e.address.text,
        'condition': e.span.text,
        'link': e.a.get('href') if e.a.get('href').startswith('https') else 'https://www.zillow.com' + e.a.get('href')
    })
data

请记住，这只会提取第一页的信息，如果您想要爬取多于前41个项目的信息，您需要迭代所有后续的结果页面。

此外，您应该选择一种更结构化的存储形式，如dict，这样可以轻松转换为dataframe等：

[{'address': '1190 Mission at Trinity Place | 1190 Mission St, San Francisco, CA',
  'price': '$2,199+ 1 bd',
  'link': 'https://www.zillow.com/apartments/san-francisco-ca/1190-mission-at-trinity-place/5XjVtb/'},
 {'address': '1188 Mission at Trinity Place | 1188 Mission St, San Francisco, CA',
  'price': '$2,099+ 1 bd',
  'link': 'https://www.zillow.com/apartments/san-francisco-ca/1188-mission-at-trinity-place/5XjN4q/'},
 {'address': 'Soma at 788 | 788 Harrison St, San Francisco, CA',
  'price': '$2,767+ 1 bd',
  'link': 'https://www.zillow.com/apartments/san-francisco-ca/soma-at-788/5XkGzw/'}, ...]

另一种方法是访问API的JSON请求，可以查看：https://stackoverflow.com/q/76209111/14460824

英文:

Tagged with selenium I assume that you will get the completely rendered source text, with which you can select your elements as follows (select your elements by constant id or tag and try to avoid dynamic identifiers such as class names):

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
driver = webdriver.Chrome(ChromeDriverManager(version=&#39;114.0.5735.90&#39;).install())
driver.get(url)
soup = BeautifulSoup(driver.page_source)
data = []
for e in soup.select(&#39;#grid-search-results article&#39;):
    data.append({
        &#39;address&#39;: e.address.text,
        &#39;condition&#39;: e.span.text,
        &#39;link&#39;: e.a.get(&#39;href&#39;) if e.a.get(&#39;href&#39;).startswith(&#39;https&#39;) else &#39;https://www.zillow.com&#39;+e.a.get(&#39;href&#39;)
    })
data

Keep in mind, that this will only pick the information from first page, you need to iterate all following result pages, if you like to scrape more then the first 41 itmes.

Furthermore, instead of different lists, you should choose a more structured form of storage like a dict, what could be easily transformed into dataframe, ...:

[{&#39;address&#39;: &#39;1190 Mission at Trinity Place | 1190 Mission St, San Francisco, CA&#39;,
  &#39;price&#39;: &#39;$2,199+ 1 bd&#39;,
  &#39;link&#39;: &#39;https://www.zillow.com/apartments/san-francisco-ca/1190-mission-at-trinity-place/5XjVtb/&#39;},
 {&#39;address&#39;: &#39;1188 Mission at Trinity Place | 1188 Mission St, San Francisco, CA&#39;,
  &#39;price&#39;: &#39;$2,099+ 1 bd&#39;,
  &#39;link&#39;: &#39;https://www.zillow.com/apartments/san-francisco-ca/1188-mission-at-trinity-place/5XjN4q/&#39;},
 {&#39;address&#39;: &#39;Soma at 788 | 788 Harrison St, San Francisco, CA&#39;,
  &#39;price&#39;: &#39;$2,767+ 1 bd&#39;,
  &#39;link&#39;: &#39;https://www.zillow.com/apartments/san-francisco-ca/soma-at-788/5XkGzw/&#39;},...]

An alternative would be to access the JSON requests of the api, may check: https://stackoverflow.com/q/76209111/14460824

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python BeautifulSoup 爬取和收集数据

问题

答案1

错误：在使用pygame时，视频系统未初始化。

如何确保D3条形图的条在响应式时不会移开其刻度？

TypeError – 读取 CSV 功能

在1个线程中的tpu_cluster_resolver上资源项目None的权限被拒绝。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。