Python BeautifulSoup 爬取和收集数据

huangapple go评论110阅读模式
英文:

Python beautifulsoup scraping and collecting data

问题

zillow中,我想获取无序列表中的每个项目并转到特定类:

“StyledPropertyCardDataWrapper-c11n-8-84-3__sc-1omp4c3-0 bKpguY property-card-data”

在这里,我可以获取价格、地址和链接,创建包括所需详细信息的3个不同列表。
我只得到了空白的结果

尝试了这个:

all_tags = soup.select(selector='body ul')
print(all_tags)

all_li_tags = [tags.findall('li') for tags in all_tags]
print(all_li_tags)

和很多其他变化 Python BeautifulSoup 爬取和收集数据

尝试创建这3个不同的列表,包括地址、价格和物业链接

英文:

From zillow I want to get every item in the unordered list and go to the specific class:

"StyledPropertyCardDataWrapper-c11n-8-84-3__sc-1omp4c3-0 bKpguY property-card-data"

Where I can get the price, address, and link creating 3 different lists including each of the wanted details.
I'm getting just empty blankets

tried this:

all_tags = soup.select(selector='body ul')
print(all_tags)

all_li_tags = [tags.findall('li') for tags in all_tags]
print(all_li_tags)

and a lot of other variations Python BeautifulSoup 爬取和收集数据

Tried to create those 3 different lists including the address, price, and link to the property

答案1

得分: 1

带有selenium标签,我假设您将获得完全呈现的源文本,您可以按以下方式选择您的元素(通过常量id或标签选择您的元素,并尽量避免使用诸如类名之类的动态标识符):

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

driver = webdriver.Chrome(ChromeDriverManager(version='114.0.5735.90').install())

driver.get(url)

soup = BeautifulSoup(driver.page_source)

data = []

for e in soup.select('#grid-search-results article'):
    data.append({
        'address': e.address.text,
        'condition': e.span.text,
        'link': e.a.get('href') if e.a.get('href').startswith('https') else 'https://www.zillow.com' + e.a.get('href')
    })

data

请记住,这只会提取第一页的信息,如果您想要爬取多于前41个项目的信息,您需要迭代所有后续的结果页面。

此外,您应该选择一种更结构化的存储形式,如dict,这样可以轻松转换为dataframe等:

[{'address': '1190 Mission at Trinity Place | 1190 Mission St, San Francisco, CA',
  'price': '$2,199+ 1 bd',
  'link': 'https://www.zillow.com/apartments/san-francisco-ca/1190-mission-at-trinity-place/5XjVtb/'},
 {'address': '1188 Mission at Trinity Place | 1188 Mission St, San Francisco, CA',
  'price': '$2,099+ 1 bd',
  'link': 'https://www.zillow.com/apartments/san-francisco-ca/1188-mission-at-trinity-place/5XjN4q/'},
 {'address': 'Soma at 788 | 788 Harrison St, San Francisco, CA',
  'price': '$2,767+ 1 bd',
  'link': 'https://www.zillow.com/apartments/san-francisco-ca/soma-at-788/5XkGzw/'}, ...]

另一种方法是访问API的JSON请求,可以查看:https://stackoverflow.com/q/76209111/14460824

英文:

Tagged with selenium I assume that you will get the completely rendered source text, with which you can select your elements as follows (select your elements by constant id or tag and try to avoid dynamic identifiers such as class names):

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

driver = webdriver.Chrome(ChromeDriverManager(version='114.0.5735.90').install())

driver.get(url)

soup = BeautifulSoup(driver.page_source)

data = []

for e in soup.select('#grid-search-results article'):
    data.append({
        'address': e.address.text,
        'condition': e.span.text,
        'link': e.a.get('href') if e.a.get('href').startswith('https') else 'https://www.zillow.com'+e.a.get('href')
    })

data

Keep in mind, that this will only pick the information from first page, you need to iterate all following result pages, if you like to scrape more then the first 41 itmes.

Furthermore, instead of different lists, you should choose a more structured form of storage like a dict, what could be easily transformed into dataframe, ...:

[{'address': '1190 Mission at Trinity Place | 1190 Mission St, San Francisco, CA',
  'price': '$2,199+ 1 bd',
  'link': 'https://www.zillow.com/apartments/san-francisco-ca/1190-mission-at-trinity-place/5XjVtb/'},
 {'address': '1188 Mission at Trinity Place | 1188 Mission St, San Francisco, CA',
  'price': '$2,099+ 1 bd',
  'link': 'https://www.zillow.com/apartments/san-francisco-ca/1188-mission-at-trinity-place/5XjN4q/'},
 {'address': 'Soma at 788 | 788 Harrison St, San Francisco, CA',
  'price': '$2,767+ 1 bd',
  'link': 'https://www.zillow.com/apartments/san-francisco-ca/soma-at-788/5XkGzw/'},...]

An alternative would be to access the JSON requests of the api, may check: https://stackoverflow.com/q/76209111/14460824

huangapple
  • 本文由 发表于 2023年7月31日 20:52:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/76803827.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定