Python:无法将所有链接保存到JSON文件中的字典中,只有最后一个。

huangapple go评论61阅读模式
英文:

Python: can not save all links to a dict in json file, only the last one

问题

我用Python编写了这段代码,以获取所有链接并将它们放入一个JSON文件中,但由于某种原因,我只得到了最后一个链接(在代码中可以看到网站和类)。有任何想法,为什么它不能正常工作?

import requests
from bs4 import BeautifulSoup
import json

headers = {
    "Accept": "*/*",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0Safari/537.36"
}

number = 0

for page_number in range(1, 2):
    url = f"https://www.sothebysrealty.com/eng/associates/int/{page_number}-pg"
    req = requests.get(url, headers=headers)
    src = req.text
    soup = BeautifulSoup(src, "lxml")
    name_link = soup.find_all("a", class_="Entities-card__cta btn u-text-uppercase u-color-sir-blue palm--hide")

    all_links_dict = {}
    for item in name_link:
        value_links = ("https://www.sothebysrealty.com" + item.get("href"))

    all_links_dict[number + 1] = value_links

    with open("all_links_dict.json", "w", encoding="utf-8-sig") as file:
        json.dump(all_links_dict, file, indent=4, ensure_ascii=False)
英文:

I wrote this code in python to get all the links and to put them in a json file, but for some reason i am only getting the last link (website and class see in the code). Any ideas, why is it not working properly?

import requests
from bs4 import BeautifulSoup
import json

headers = {
>     "Accept": "*/*",
>     "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0Safari/537.36"
> }

number = 0

for page_number in range(1, 2):
    url = f"https://www.sothebysrealty.com/eng/associates/int/{page_number}-pg"
    req = requests.get(url, headers=headers)
    src = req.text
    soup = BeautifulSoup(src, "lxml")
    name_link = soup.find_all("a", class_="Entities-card__cta btn u-text-uppercase u-color-sir-blue palm--hide")

    all_links_dict = {}
    for item in name_link:
        value_links = ("https://www.sothebysrealty.com" + item.get("href"))

    all_links_dict[number + 1] = value_links

    with open("all_links_dict.json", "w", encoding="utf-8-sig") as file:
    json.dump(all_links_dict, file, indent=4, ensure_ascii=False)

答案1

得分: 2

这是因为 all_links_dict[number + 1] = value_links 不在你的 for item in name_link 循环中。因此,你只添加一次到字典中。

你还必须在循环中递增 number。

for item in name_link:
    value_links = ("https://www.sothebysrealty.com" + item.get("href"))
    all_links_dict[number] = value_links
    number += 1
英文:

This is because all_links_dict[number + 1] = value_links is not in your for item in name_link loop. Hence you only add to the dict once.

You must also increment number in the loop.

for item in name_link:
    value_links = ("https://www.sothebysrealty.com" + item.get("href"))
    all_links_dict[number] = value_links
    number += 1

答案2

得分: 1

首先,您的页码范围为range(1,2)。在Python中,范围不包括停止值,因此for循环只会运行一次,页面号为1。

其次,您的all_links_dict = {}行会在每次迭代中将字典重置为空字典。

最后,您在循环的每次迭代中以'w'模式打开文件,然后进行JSON转储,这将覆盖任何先前的内容。

我建议您调整范围,将字典的初始化移到for循环之外,并在for循环之外的最后一次将字典转储到文件中。

英文:

There's a few things I notice here.

Firstly, your page numbers range(1,2). In python the stop is not included in the range so the for loop will only run once with a page number of 1.

Secondly, your all_links_dict = {} line is resetting the dictionary to an empty dict each time.

Lastly, you are opening the file each iteration of the loop in 'w' mode and then json dumping which will overwrite any previous contents.

I would advise to adjust your range, move the dictionary initialisation out of the for loop and dump the dictionary to your file once at the end outside of the for loop.

答案3

得分: 0

有几个问题:


> py > all_links_dict = {} > for item in name_link: > value_links = ("https://www.sothebysrealty.com" + item.get("href")) > all_links_dict[number + 1] = value_links >

您没有在任何时候更新 number,因此只有在每次循环中键入 1 的值才会被保存。要么使用 page_number 的某种导出,在每次迭代中更新自己,要么添加一行以增加 number 并将其放在内部循环内。

    all_links_dict = {}
    for item in name_link:
        value_links = ("https://www.sothebysrealty.com" + item.get("href"))
        number += 1
        all_links_dict[number] = value_links

> py > with open("all_links_dict.json", "w", encoding="utf-8-sig") as file: > json.dump(all_links_dict, file, indent=4, ensure_ascii=False) >

您应该使用 mode="a" 而不是 "w" 以在每次迭代中进行追加而不是覆盖。但是,您应该知道,在第二次迭代后,该文件将不再是有效的 JSON(即,您无法再次解码它)。最好在循环后(或循环结束时)将数据附加到列表中,然后将列表写入 JSON。


还有一个问题是 for page_number in range(1, 2): 只会导致一次迭代(其中 page_number 为1),因此即使这些更改,只有一个页面的信息会被保存,除非扩展范围以包括更多页面。

英文:

There are several issues:


> py
> all_links_dict = {}
> for item in name_link:
> value_links = ("https://www.sothebysrealty.com" + item.get("href"))
> all_links_dict[number + 1] = value_links
>

You're not updating number at any point, so only a value keyed to 1 gets saved every loop. Either use some derivation of page_number updates itself in each iteration, or add a line to increment number and bring it inside the inner loop.

    all_links_dict = {}
    for item in name_link:
        value_links = ("https://www.sothebysrealty.com" + item.get("href"))
        number += 1
        all_links_dict[number] = value_links

> py
> with open("all_links_dict.json", "w", encoding="utf-8-sig") as file:
> json.dump(all_links_dict, file, indent=4, ensure_ascii=False)
>

You should use mode="a" instead of "w" to append instead of over-writing in each iteration. However, you should be aware that the file will not be valid json (ie, you can't decode it any more) after a second iteration. Might be better to have a list that you append to every time and then write the list to json after (or at the end of) the loop.


There's also the fact that for page_number in range(1, 2): will only lead to one iteration (where page_number is 1), so even with all this, only one page's info will be saved unless the range is expanded to include more pages.

huangapple
  • 本文由 发表于 2023年6月22日 17:51:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76530648.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定