英文:
Python beautifulsoup scraping and collecting data
问题
从zillow中,我想获取无序列表中的每个项目并转到特定类:
“StyledPropertyCardDataWrapper-c11n-8-84-3__sc-1omp4c3-0 bKpguY property-card-data”
在这里,我可以获取价格、地址和链接,创建包括所需详细信息的3个不同列表。
我只得到了空白的结果
尝试了这个:
all_tags = soup.select(selector='body ul')
print(all_tags)
all_li_tags = [tags.findall('li') for tags in all_tags]
print(all_li_tags)
和很多其他变化
尝试创建这3个不同的列表,包括地址、价格和物业链接
英文:
From zillow I want to get every item in the unordered list and go to the specific class:
"StyledPropertyCardDataWrapper-c11n-8-84-3__sc-1omp4c3-0 bKpguY property-card-data"
Where I can get the price, address, and link creating 3 different lists including each of the wanted details.
I'm getting just empty blankets
tried this:
all_tags = soup.select(selector='body ul')
print(all_tags)
all_li_tags = [tags.findall('li') for tags in all_tags]
print(all_li_tags)
and a lot of other variations
Tried to create those 3 different lists including the address, price, and link to the property
答案1
得分: 1
带有selenium
标签,我假设您将获得完全呈现的源文本,您可以按以下方式选择您的元素(通过常量id或标签选择您的元素,并尽量避免使用诸如类名之类的动态标识符):
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
driver = webdriver.Chrome(ChromeDriverManager(version='114.0.5735.90').install())
driver.get(url)
soup = BeautifulSoup(driver.page_source)
data = []
for e in soup.select('#grid-search-results article'):
data.append({
'address': e.address.text,
'condition': e.span.text,
'link': e.a.get('href') if e.a.get('href').startswith('https') else 'https://www.zillow.com' + e.a.get('href')
})
data
请记住,这只会提取第一页的信息,如果您想要爬取多于前41个项目的信息,您需要迭代所有后续的结果页面。
此外,您应该选择一种更结构化的存储形式,如dict
,这样可以轻松转换为dataframe
等:
[{'address': '1190 Mission at Trinity Place | 1190 Mission St, San Francisco, CA',
'price': '$2,199+ 1 bd',
'link': 'https://www.zillow.com/apartments/san-francisco-ca/1190-mission-at-trinity-place/5XjVtb/'},
{'address': '1188 Mission at Trinity Place | 1188 Mission St, San Francisco, CA',
'price': '$2,099+ 1 bd',
'link': 'https://www.zillow.com/apartments/san-francisco-ca/1188-mission-at-trinity-place/5XjN4q/'},
{'address': 'Soma at 788 | 788 Harrison St, San Francisco, CA',
'price': '$2,767+ 1 bd',
'link': 'https://www.zillow.com/apartments/san-francisco-ca/soma-at-788/5XkGzw/'}, ...]
另一种方法是访问API的JSON请求,可以查看:https://stackoverflow.com/q/76209111/14460824
英文:
Tagged with selenium
I assume that you will get the completely rendered source text, with which you can select your elements as follows (select your elements by constant id or tag and try to avoid dynamic identifiers such as class names):
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
driver = webdriver.Chrome(ChromeDriverManager(version='114.0.5735.90').install())
driver.get(url)
soup = BeautifulSoup(driver.page_source)
data = []
for e in soup.select('#grid-search-results article'):
data.append({
'address': e.address.text,
'condition': e.span.text,
'link': e.a.get('href') if e.a.get('href').startswith('https') else 'https://www.zillow.com'+e.a.get('href')
})
data
Keep in mind, that this will only pick the information from first page, you need to iterate all following result pages, if you like to scrape more then the first 41 itmes.
Furthermore, instead of different lists, you should choose a more structured form of storage like a dict
, what could be easily transformed into dataframe
, ...:
[{'address': '1190 Mission at Trinity Place | 1190 Mission St, San Francisco, CA',
'price': '$2,199+ 1 bd',
'link': 'https://www.zillow.com/apartments/san-francisco-ca/1190-mission-at-trinity-place/5XjVtb/'},
{'address': '1188 Mission at Trinity Place | 1188 Mission St, San Francisco, CA',
'price': '$2,099+ 1 bd',
'link': 'https://www.zillow.com/apartments/san-francisco-ca/1188-mission-at-trinity-place/5XjN4q/'},
{'address': 'Soma at 788 | 788 Harrison St, San Francisco, CA',
'price': '$2,767+ 1 bd',
'link': 'https://www.zillow.com/apartments/san-francisco-ca/soma-at-788/5XkGzw/'},...]
An alternative would be to access the JSON requests of the api, may check: https://stackoverflow.com/q/76209111/14460824
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论