英文:
Python: can not save all links to a dict in json file, only the last one
问题
我用Python编写了这段代码,以获取所有链接并将它们放入一个JSON文件中,但由于某种原因,我只得到了最后一个链接(在代码中可以看到网站和类)。有任何想法,为什么它不能正常工作?
import requests
from bs4 import BeautifulSoup
import json
headers = {
"Accept": "*/*",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0Safari/537.36"
}
number = 0
for page_number in range(1, 2):
url = f"https://www.sothebysrealty.com/eng/associates/int/{page_number}-pg"
req = requests.get(url, headers=headers)
src = req.text
soup = BeautifulSoup(src, "lxml")
name_link = soup.find_all("a", class_="Entities-card__cta btn u-text-uppercase u-color-sir-blue palm--hide")
all_links_dict = {}
for item in name_link:
value_links = ("https://www.sothebysrealty.com" + item.get("href"))
all_links_dict[number + 1] = value_links
with open("all_links_dict.json", "w", encoding="utf-8-sig") as file:
json.dump(all_links_dict, file, indent=4, ensure_ascii=False)
英文:
I wrote this code in python to get all the links and to put them in a json file, but for some reason i am only getting the last link (website and class see in the code). Any ideas, why is it not working properly?
import requests
from bs4 import BeautifulSoup
import json
headers = {
> "Accept": "*/*",
> "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0Safari/537.36"
> }
number = 0
for page_number in range(1, 2):
url = f"https://www.sothebysrealty.com/eng/associates/int/{page_number}-pg"
req = requests.get(url, headers=headers)
src = req.text
soup = BeautifulSoup(src, "lxml")
name_link = soup.find_all("a", class_="Entities-card__cta btn u-text-uppercase u-color-sir-blue palm--hide")
all_links_dict = {}
for item in name_link:
value_links = ("https://www.sothebysrealty.com" + item.get("href"))
all_links_dict[number + 1] = value_links
with open("all_links_dict.json", "w", encoding="utf-8-sig") as file:
json.dump(all_links_dict, file, indent=4, ensure_ascii=False)
答案1
得分: 2
这是因为 all_links_dict[number + 1] = value_links
不在你的 for item in name_link
循环中。因此,你只添加一次到字典中。
你还必须在循环中递增 number。
for item in name_link:
value_links = ("https://www.sothebysrealty.com" + item.get("href"))
all_links_dict[number] = value_links
number += 1
英文:
This is because all_links_dict[number + 1] = value_links
is not in your for item in name_link
loop. Hence you only add to the dict once.
You must also increment number in the loop.
for item in name_link:
value_links = ("https://www.sothebysrealty.com" + item.get("href"))
all_links_dict[number] = value_links
number += 1
答案2
得分: 1
首先,您的页码范围为range(1,2)
。在Python中,范围不包括停止值,因此for循环只会运行一次,页面号为1。
其次,您的all_links_dict = {}
行会在每次迭代中将字典重置为空字典。
最后,您在循环的每次迭代中以'w'
模式打开文件,然后进行JSON转储,这将覆盖任何先前的内容。
我建议您调整范围,将字典的初始化移到for循环之外,并在for循环之外的最后一次将字典转储到文件中。
英文:
There's a few things I notice here.
Firstly, your page numbers range(1,2)
. In python the stop is not included in the range so the for loop will only run once with a page number of 1.
Secondly, your all_links_dict = {}
line is resetting the dictionary to an empty dict each time.
Lastly, you are opening the file each iteration of the loop in 'w'
mode and then json dumping which will overwrite any previous contents.
I would advise to adjust your range, move the dictionary initialisation out of the for loop and dump the dictionary to your file once at the end outside of the for loop.
答案3
得分: 0
有几个问题:
> py > all_links_dict = {} > for item in name_link: > value_links = ("https://www.sothebysrealty.com" + item.get("href")) > all_links_dict[number + 1] = value_links >
您没有在任何时候更新 number
,因此只有在每次循环中键入 1
的值才会被保存。要么使用 page_number
的某种导出,在每次迭代中更新自己,要么添加一行以增加 number
并将其放在内部循环内。
all_links_dict = {}
for item in name_link:
value_links = ("https://www.sothebysrealty.com" + item.get("href"))
number += 1
all_links_dict[number] = value_links
> py > with open("all_links_dict.json", "w", encoding="utf-8-sig") as file: > json.dump(all_links_dict, file, indent=4, ensure_ascii=False) >
您应该使用 mode="a"
而不是 "w"
以在每次迭代中进行追加而不是覆盖。但是,您应该知道,在第二次迭代后,该文件将不再是有效的 JSON(即,您无法再次解码它)。最好在循环后(或循环结束时)将数据附加到列表中,然后将列表写入 JSON。
还有一个问题是 for page_number in range(1, 2):
只会导致一次迭代(其中 page_number
为1),因此即使这些更改,只有一个页面的信息会被保存,除非扩展范围以包括更多页面。
英文:
There are several issues:
> py
> all_links_dict = {}
> for item in name_link:
> value_links = ("https://www.sothebysrealty.com" + item.get("href"))
> all_links_dict[number + 1] = value_links
>
You're not updating number
at any point, so only a value keyed to 1
gets saved every loop. Either use some derivation of page_number
updates itself in each iteration, or add a line to increment number
and bring it inside the inner loop.
all_links_dict = {}
for item in name_link:
value_links = ("https://www.sothebysrealty.com" + item.get("href"))
number += 1
all_links_dict[number] = value_links
> py
> with open("all_links_dict.json", "w", encoding="utf-8-sig") as file:
> json.dump(all_links_dict, file, indent=4, ensure_ascii=False)
>
You should use mode="a"
instead of "w"
to append instead of over-writing in each iteration. However, you should be aware that the file will not be valid json (ie, you can't decode it any more) after a second iteration. Might be better to have a list that you append to every time and then write the list to json after (or at the end of) the loop.
There's also the fact that for page_number in range(1, 2):
will only lead to one iteration (where page_number
is 1), so even with all this, only one page's info will be saved unless the range is expanded to include more pages.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论