英文:
(Python) Dictionary Within a list gives "IndexError: list index out of range" error
问题
这是您的代码的翻译部分,只包括注释和字符串文字:
import requests
from bs4 import BeautifulSoup
import json
from selenium import webdriver
import time
params = {
'searchQueryState': '{"mapBounds":{"north":42.009517,"east":-114.131253,"south":32.528832,"west":-124.482045},"mapZoom":5,"isMapVisible":true,"filterState":{"price":{"max":872627},"beds":{"min":1},"fore":{"value":false},"mp":{"max":3000},"auc":{"value":false},"nc":{"value":false},"fr":{"value":true},"fsbo":{"value":false},"cmsn":{"value":false},"fsba":{"value":false}},"isListVisible":true,"regionSelection":[{"regionId":9,"regionType":2}],"usersSearchTerm":"California","schoolId":null}'
}
class ZillowScraper():
results = []
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'tr-TR,tr;q=0.9,en-US;q=0.8,en;q=0.7',
'Cache-Control': 'max-age=0',
'Cookie': 'zguid=24|%249e975354-f675-419d-88e1-268640260ac4; _ga=GA1.2.124390097.1686931595; zjs_anonymous_id=%229e975354-f675-419d-88e1-268640260ac4%22; zg_anonymous_id=%2262a63b50-344d-4671-880c-04d453844549%22; _pxvid=c4d5254a-0c5f-11ee-9b64-15b9a4b973bd; _gcl_au=1.1.80251162.1686931598; __pdst=38dc54a14928440ea9e18bce80e1ec08; _fbp=fb.1.1686931597756.1157135805; _pin_unauth=dWlkPU9EVXhZelF4WmpZdE4yWXhNaTAwTTJReUxUaGhZVEl0TlRaak1tRXpZamxsTldWbA; g_state={"i_l":0}; userid=X|3|3dbd58da0e59b75f%7C2%7Cq9T-1RNvuHMHNFQyWSbr7sl7mTb0P-i9btbmtnsodtE%3D; loginmemento=1|2fd5fb94fc32e42f5344da80f8f13e75986fad0f163b2c5704da1ff0eae2afb6; zjs_user_id=%22X1-ZUrfq9bm176tc9_1zmzs%22; JSESSIONID=973ED705C2EDB91A7BF10D85CDFF5814; zgsession=1|3c6828c5-3e67-4639-9972-fd4736403a62; ZILLOW_SID=1|AAAAAVVbFRIBVVsVEjg1bj1MjO6GG9oxT7XzrfvIXY2RKMzKEvaz6DaBCZtWJSCGtWGKfElpHr957F4KqyLOxAtmIUA1BizCKg; _gid=GA1.2.1380352076.1688718740; pxcts=ca98aaaa-1ca0-11ee-b480-616f51557753; DoubleClickSession=true; _clck=1uqa94j|2|fd3|0|1262; tfpsi=dcb016eb-2d99-44e8-a777-9a8a6050caaf; _uetsid=333808901ca211eeab4a1f19a5284c94; _uetvid=c5ebbf500c5f11eeb6c3c9c655f631e0; _px3=9c5cade37141a8e60d6380f6a9cfb3c9330d4e390ed91edd6b011cb63e92933d:d7C2qYwntLRbpcZy/DTjqpGCduzy+AH7I9u+VkVNCzM4wMXrOL7OP2R//DfgEP1kRn1hrxo8uUkl8Opv4tdx3w==:1000:cS+n8Og6wNAKCZwozMRKoLCCrdTzCbVArgT+vhD5SwsRuMYxwoRCaJL7Y1Y8iIp2j1ARVrm9eNWUlJykb5sPuC1PxrsMJodXAyq9y7PFy4qFgv5GdyFWvmV0E2asllfSbIp15P9L4sGP8A+qTuFB5snLuOo5NWZidGglE+KYRQx0gieWD2BkI4BQjN3DnrU1bJRUa9+omJnH1zphmhIJ4w==; __gads=ID=0789e8e54f1dec08:T=1686931597:RT=1688720215:S=ALNI_MaqX2esiAR8pEZ2TAX0dGHUBNSrdA; __gpi=UID=00000c2fceeb5fe3:T=1686931597:RT=1688720215:S=ALNI_MaGvh-b1S5JJXngtBs3-TQBlDKz_w; _gat=1; _clsk=1gz2p94|1688720278842|12|0|s.clarity.ms/collect; AWSALB=MKF/saKP1r0geqVVfYxmvBt8BXaElPu2UvRc+unjgOvJ0lVSdn2+oATj6UI10oWkuLKOz9thWJZkdY+XiON9bnmZojeR0r3c11rEA6oSYhi1bOO/QS35UIMVD83K; AWSALBCORS=MKF/saKP1r0geqVVfYxmvBt8BXaElPu2UvRc+unjgOvJ0lVSdn2+oATj6UI10oWkuL
<details>
<summary>英文:</summary>
[{'url': '/apartments/san-francisco-ca/923-folsom/5Yy6Np/'}, {'url': 'https://www.zillow.com/apartments/san-francisco-ca/the-brady/C4XYzK/'}, {'url': 'https://www.zillow.com/apartments/san-francisco-ca/northpoint-apartments/5XjLPJ/'}, {'url': 'https://www.zillow.com/b/747-geary-street-oakland-ca-CYzGVt/'}, {'url': '/apartments/san-francisco-ca/edgewater/5XjVQc/'}, {'url': '/apartments/san-francisco-ca/1188-mission-at-trinity-place/5XjN4q/'}]
this is the self.results files output, its a list, len gives 6 elements, I want to reach the 'url' inside the dictionary
import requests
from bs4 import BeautifulSoup
import json
from selenium import webdriver
import time
params={
'searchQueryState': '{"mapBounds":{"north":42.009517,"east":-114.131253,"south":32.528832,"west":-124.482045},"mapZoom":5,"isMapVisible":true,"filterState":{"price":{"max":872627},"beds":{"min":1},"fore":{"value":false},"mp":{"max":3000},"auc":{"value":false},"nc":{"value":false},"fr":{"value":true},"fsbo":{"value":false},"cmsn":{"value":false},"fsba":{"value":false}},"isListVisible":true,"regionSelection":[{"regionId":9,"regionType":2}],"usersSearchTerm":"California","schoolId":null}'
}
class ZillowScraper():
results=[]
headers= {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'tr-TR,tr;q=0.9,en-US;q=0.8,en;q=0.7',
'Cache-Control':'max-age=0',
'Cookie':'zguid=24|%249e975354-f675-419d-88e1-268640260ac4; _ga=GA1.2.124390097.1686931595; zjs_anonymous_id=%229e975354-f675-419d-88e1-268640260ac4%22; zg_anonymous_id=%2262a63b50-344d-4671-880c-04d453844549%22; _pxvid=c4d5254a-0c5f-11ee-9b64-15b9a4b973bd; _gcl_au=1.1.80251162.1686931598; __pdst=38dc54a14928440ea9e18bce80e1ec08; _fbp=fb.1.1686931597756.1157135805; _pin_unauth=dWlkPU9EVXhZelF4WmpZdE4yWXhNaTAwTTJReUxUaGhZVEl0TlRaak1tRXpZamxsTldWbA; g_state={"i_l":0}; userid=X|3|3dbd58da0e59b75f%7C2%7Cq9T-1RNvuHMHNFQyWSbr7sl7mTb0P-i9btbmtnsodtE%3D; loginmemento=1|2fd5fb94fc32e42f5344da80f8f13e75986fad0f163b2c5704da1ff0eae2afb6; zjs_user_id=%22X1-ZUrfq9bm176tc9_1zmzs%22; JSESSIONID=973ED705C2EDB91A7BF10D85CDFF5814; zgsession=1|3c6828c5-3e67-4639-9972-fd4736403a62; ZILLOW_SID=1|AAAAAVVbFRIBVVsVEjg1bj1MjO6GG9oxT7XzrfvIXY2RKMzKEvaz6DaBCZtWJSCGtWGKfElpHr957F4KqyLOxAtmIUA1BizCKg; _gid=GA1.2.1380352076.1688718740; pxcts=ca98aaaa-1ca0-11ee-b480-616f51557753; DoubleClickSession=true; _clck=1uqa94j|2|fd3|0|1262; tfpsi=dcb016eb-2d99-44e8-a777-9a8a6050caaf; _uetsid=333808901ca211eeab4a1f19a5284c94; _uetvid=c5ebbf500c5f11eeb6c3c9c655f631e0; _px3=9c5cade37141a8e60d6380f6a9cfb3c9330d4e390ed91edd6b011cb63e92933d:d7C2qYwntLRbpcZy/DTjqpGCduzy+AH7I9u+VkVNCzM4wMXrOL7OP2R//DfgEP1kRn1hrxo8uUkl8Opv4tdx3w==:1000:cS+n8Og6wNAKCZwozMRKoLCCrdTzCbVArgT+vhD5SwsRuMYxwoRCaJL7Y1Y8iIp2j1ARVrm9eNWUlJykb5sPuC1PxrsMJodXAyq9y7PFy4qFgv5GdyFWvmV0E2asllfSbIp15P9L4sGP8A+qTuFB5snLuOo5NWZidGglE+KYRQx0gieWD2BkI4BQjN3DnrU1bJRUa9+omJnH1zphmhIJ4w==; __gads=ID=0789e8e54f1dec08:T=1686931597:RT=1688720215:S=ALNI_MaqX2esiAR8pEZ2TAX0dGHUBNSrdA; __gpi=UID=00000c2fceeb5fe3:T=1686931597:RT=1688720215:S=ALNI_MaGvh-b1S5JJXngtBs3-TQBlDKz_w; _gat=1; _clsk=1gz2p94|1688720278842|12|0|s.clarity.ms/collect; AWSALB=MKF/saKP1r0geqVVfYxmvBt8BXaElPu2UvRc+unjgOvJ0lVSdn2+oATj6UI10oWkuLKOz9thWJZkdY+XiON9bnmZojeR0r3c11rEA6oSYhi1bOO/QS35UIMVD83K; AWSALBCORS=MKF/saKP1r0geqVVfYxmvBt8BXaElPu2UvRc+unjgOvJ0lVSdn2+oATj6UI10oWkuLKOz9thWJZkdY+XiON9bnmZojeR0r3c11rEA6oSYhi1bOO/QS35UIMVD83K; search=6|1691312278512%7Crect%3D42.009517%252C-114.131253%252C32.528832%252C-124.482045%26rid%3D9%26disp%3Dmap%26mdm%3Dauto%26p%3D1%26z%3D0%26listPriceActive%3D1%26beds%3D1-%26price%3D0-872627%26mp%3D0-3000%26fs%3D0%26fr%3D1%26mmm%3D0%26rs%3D0%26ah%3D0%26singlestory%3D0%26housing-connector%3D0%26abo%3D0%26garage%3D0%26pool%3D0%26ac%3D0%26waterfront%3D0%26finished%3D0%26unfinished%3D0%26cityview%3D0%26mountainview%3D0%26parkview%3D0%26waterview%3D0%26hoadata%3D1%26zillow-owned%3D0%263dhome%3D0%26featuredMultiFamilyBuilding%3D0%26excludeNullAvailabilityDates%3D0%26commuteMode%3Ddriving%26commuteTimeOfDay%3Dnow%09%09%09%7B%22isList%22%3Atrue%2C%22isMap%22%3Atrue%7D%09%09%09%09%09',
'Sec-Ch-Ua':'"Not.A/Brand";v="8", "Chromium";v="114", "Google Chrome";v="114"',
'Sec-Ch-Ua-Mobile':'?0',
'Sec-Ch-Ua-Platform':'"Windows"',
'Sec-Fetch-Dest':'document',
'Sec-Fetch-Mode':'navigate',
'Sec-Fetch-Site':'same-origin',
'Sec-Fetch-User':'?1',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36}'
}
def fetch(self,url,params):
response = requests.get(url,headers=self.headers, params=params)
print(response)
return response
def parse(self, response):
content = BeautifulSoup(response,features='html.parser')
deck = content.find('ul', {'class': 'List-c11n-8-89-0__sc-1smrmqp-0 StyledSearchListWrapper-srp__sc-1ieen0c-0 kZyCWU fgiidE photo-cards'})
for card in deck.contents[0]:
script = card.find('script',{'type': 'application/ld+json'})
if script:
script_json = json.loads(script.contents)
print(script_json)
try:
self.results.append({
'latitude': script_json['geo']['latitude'],
'longtitude': script_json['json']['longtitude'],
'context': script_json['floorSize']['value'],
'url': script_json['url']
})
except KeyError:
self.results.append({
'url': script_json['url']
})
print(script_json['url'])
else:
article= card.find('article',{'role':'presentation'})
if article is None:
continue
else:
script=article.find('address',{'data-test':'property-card-addr'})
if script:
print(script.prettify())
pass
else:
return
print(self.results)
#print(type(self.results[0]))
print(self.results)
#if "http" not in self.results[0]{'url'}:
# self.results[-1]{'url'}=f"https://zillow.com{self.results[0]{'url'}}"
print(self.results)
def run(self):
url="https://www.zillow.com/san-francisco-ca/rentals/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22mapBounds%22%3A%7B%22north%22%3A37.88007392999709%2C%22east%22%3A-122.09541072195475%2C%22south%22%3A37.59118324417655%2C%22west%22%3A-122.823254960236%7D%2C%22usersSearchTerm%22%3A%22California%22%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A20330%2C%22regionType%22%3A6%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22price%22%3A%7B%22max%22%3A872627%7D%2C%22beds%22%3A%7B%22min%22%3A1%7D%2C%22fore%22%3A%7B%22value%22%3Afalse%7D%2C%22mp%22%3A%7B%22max%22%3A3000%7D%2C%22auc%22%3A%7B%22value%22%3Afalse%7D%2C%22nc%22%3A%7B%22value%22%3Afalse%7D%2C%22fr%22%3A%7B%22value%22%3Atrue%7D%2C%22fsbo%22%3A%7B%22value%22%3Afalse%7D%2C%22cmsn%22%3A%7B%22value%22%3Afalse%7D%2C%22fsba%22%3A%7B%22value%22%3Afalse%7D%7D%2C%22isListVisible%22%3Atrue%7D"
params={
'searchQueryState': '{"mapBounds":{"north":42.009517,"east":-114.131253,"south":32.528832,"west":-124.482045},"mapZoom":5,"isMapVisible":true,"filterState":{"price":{"max":872627},"beds":{"min":1},"fore":{"value":false},"mp":{"max":3000},"auc":{"value":false},"nc":{"value":false},"fr":{"value":true},"fsbo":{"value":false},"cmsn":{"value":false},"fsba":{"value":false}},"isListVisible":true,"regionSelection":[{"regionId":9,"regionType":2}],"usersSearchTerm":"California","schoolId":null}'
}
res= self.fetch(url,params)
self.parse(res.text)
if __name__== '__main__':
scraper = ZillowScraper()
scraper.run()
I tried to reach to the dictionary and extract the url elements content to compare with my own desired http values but couldn't get the indexes altough the type is a list and it involves 6 elements.
Edit I tried to post the full code but indentation problems occur altough I press <kbd>ctrl</kbd>+<kbd>K</kbd>
Traceback:
Traceback (most recent call last): File "C:\Users\Kubilay
> Kaan\Documents\Dev_Projects\python_projects\Python_Sketchs\zillow_class_oriented.py",
> line 89, in <module>
> scraper.run() File "C:\Users\Kubilay Kaan\Documents\Dev_Projects\python_projects\Python_Sketchs\zillow_class_oriented.py",
> line 85, in run
> self.parse(res.text) File "C:\Users\Kubilay Kaan\Documents\Dev_Projects\python_projects\Python_Sketchs\zillow_class_oriented.py",
> line 73, in parse
> print(type(self.results[0])) IndexError: list index out of range
</details>
# 答案1
**得分**: 0
看起来你正在尝试访问列表中的字典值,但你的索引方法可能存在问题。根据你的代码,如果你想访问存储在self.results中的每个字典的'url'值,你应该循环遍历self.results。
以下是一个示例,演示如何访问字典中的'url'并在'url'中不包含'http'时进行修改:
```python
for result in self.results: # 'result'是一个字典,'results'是一个列表
if "http" not in result['url']: # 这里'result'是一个字典
result['url'] = f"https://www.zillow.com{result['url']}"
这段代码循环遍历self.results(列表)中的每个字典,并检查'url'值是否包含'http'。如果没有,它将在'url'前添加"https://www.zillow.com"。请注意,这将直接修改self.results。更改是原地进行的,这意味着原始的self.results会发生更改。如果你需要原始数据,请在进行更改之前考虑制作一个副本。我认为你可能误用了花括号来尝试访问字典的键。也许可以使用方括号。
你需要确保正确连接URL路径。如果'url'中的路径以'/'开头,你可能会得到'//'连接到你的URL中。如果路径不以'/'开头,你的代码应该按预期工作。
我有时也使用.get()方法从JSON对象中获取项,如果正常索引不起作用。示例:
car = {
"brand": "Ford",
"model": "Mustang",
"year": 1964,
"price": 10234
}
x = car.get("price")
print(x) # 应该返回整数10234。
此外,我注意到你在格式化代码方面可能有困难。你可以在Stack Overflow编辑器中单击花括号按钮,以将代码格式化为代码块格式(每行前面添加4个空格,然后Stack Overflow知道将所有内容放入代码块以获得更好的显示)。
英文:
It looks like you're trying to access dictionary values within a list, but there may be a problem with your indexing method. Based on your code, if you want to access each 'url' value in the dictionaries stored in self.results, you should loop through self.results.
Below is an example on how you can access 'url' in the dictionaries and modify it if 'http' is not in 'url':
for result in self.results: #'result' is a dict and 'results' are a list
if "http" not in result['url']: # here 'result' is a dictionary
result['url'] = f"https://www.zillow.com{result['url']}"
This code block loops through each dictionary in self.results (list) and checks whether 'http' is part of the 'url' value. If not, it prepends the 'url' with "https://www.zillow.com". Please note that this will directly modify self.results. The changes are made in place, meaning the original self.results is changed. If you need the original data, consider making a copy before you make changes. I think you may have mistakenly used curly braces to try access the key of the dict. Use square brackets perhaps.
You'll need to make sure your url paths are correctly concatenated. If the paths in 'url' start with '/', you might end up with '//' in your concatenated url. If the paths do not start with '/', your code should work as expected.
I also sometimes use the .get() method to get items from a json object if the normal indexing doesn't work. Example:
car = {
"brand": "Ford",
"model": "Mustang",
"year": 1964,
"price": 10234
}
x = car.get("price")
print(x) # should return 10234 as integer.
Also, I've noticed you've struggled to nicely format the code. You can click the curly braces button in the Stack Overflow editor to format code to a code block format (4 spaces added before each line, then Stack Overflow knows to put everything into a code block for nicer display).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论