2023年6月22日 17:51:15go评论167阅读模式

英文:

Python: can not save all links to a dict in json file, only the last one

问题

我用Python编写了这段代码，以获取所有链接并将它们放入一个JSON文件中，但由于某种原因，我只得到了最后一个链接（在代码中可以看到网站和类）。有任何想法，为什么它不能正常工作？

import requests
from bs4 import BeautifulSoup
import json

headers = {
    "Accept": "*/*",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0Safari/537.36"
}

number = 0

for page_number in range(1, 2):
    url = f"https://www.sothebysrealty.com/eng/associates/int/{page_number}-pg"
    req = requests.get(url, headers=headers)
    src = req.text
    soup = BeautifulSoup(src, "lxml")
    name_link = soup.find_all("a", class_="Entities-card__cta btn u-text-uppercase u-color-sir-blue palm--hide")

    all_links_dict = {}
    for item in name_link:
        value_links = ("https://www.sothebysrealty.com" + item.get("href"))

    all_links_dict[number + 1] = value_links

    with open("all_links_dict.json", "w", encoding="utf-8-sig") as file:
        json.dump(all_links_dict, file, indent=4, ensure_ascii=False)

英文:

I wrote this code in python to get all the links and to put them in a json file, but for some reason i am only getting the last link (website and class see in the code). Any ideas, why is it not working properly?

import requests
from bs4 import BeautifulSoup
import json

headers = {
&gt;     &quot;Accept&quot;: &quot;*/*&quot;,
&gt;     &quot;User-Agent&quot;: &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0Safari/537.36&quot;
&gt; }

number = 0

for page_number in range(1, 2):
    url = f&quot;https://www.sothebysrealty.com/eng/associates/int/{page_number}-pg&quot;
    req = requests.get(url, headers=headers)
    src = req.text
    soup = BeautifulSoup(src, &quot;lxml&quot;)
    name_link = soup.find_all(&quot;a&quot;, class_=&quot;Entities-card__cta btn u-text-uppercase u-color-sir-blue palm--hide&quot;)

    all_links_dict = {}
    for item in name_link:
        value_links = (&quot;https://www.sothebysrealty.com&quot; + item.get(&quot;href&quot;))

    all_links_dict[number + 1] = value_links

    with open(&quot;all_links_dict.json&quot;, &quot;w&quot;, encoding=&quot;utf-8-sig&quot;) as file:
    json.dump(all_links_dict, file, indent=4, ensure_ascii=False)

答案1

得分: 2

这是因为 all_links_dict[number + 1] = value_links 不在你的 for item in name_link 循环中。因此，你只添加一次到字典中。

你还必须在循环中递增 number。

for item in name_link:
    value_links = ("https://www.sothebysrealty.com" + item.get("href"))
    all_links_dict[number] = value_links
    number += 1

英文:

This is because all_links_dict[number + 1] = value_links is not in your for item in name_link loop. Hence you only add to the dict once.

You must also increment number in the loop.

for item in name_link:
    value_links = (&quot;https://www.sothebysrealty.com&quot; + item.get(&quot;href&quot;))
    all_links_dict[number] = value_links
    number += 1

答案2

得分: 1

首先，您的页码范围为range(1,2)。在Python中，范围不包括停止值，因此for循环只会运行一次，页面号为1。

其次，您的all_links_dict = {}行会在每次迭代中将字典重置为空字典。

最后，您在循环的每次迭代中以'w'模式打开文件，然后进行JSON转储，这将覆盖任何先前的内容。

我建议您调整范围，将字典的初始化移到for循环之外，并在for循环之外的最后一次将字典转储到文件中。

英文:

There's a few things I notice here.

Firstly, your page numbers range(1,2). In python the stop is not included in the range so the for loop will only run once with a page number of 1.

Secondly, your all_links_dict = {} line is resetting the dictionary to an empty dict each time.

Lastly, you are opening the file each iteration of the loop in 'w' mode and then json dumping which will overwrite any previous contents.

I would advise to adjust your range, move the dictionary initialisation out of the for loop and dump the dictionary to your file once at the end outside of the for loop.

答案3

得分: 0

有几个问题：

> py > all_links_dict = {} > for item in name_link: > value_links = ("https://www.sothebysrealty.com" + item.get("href")) > all_links_dict[number + 1] = value_links >

您没有在任何时候更新 number，因此只有在每次循环中键入 1 的值才会被保存。要么使用 page_number 的某种导出，在每次迭代中更新自己，要么添加一行以增加 number 并将其放在内部循环内。

    all_links_dict = {}
    for item in name_link:
        value_links = ("https://www.sothebysrealty.com" + item.get("href"))
        number += 1
        all_links_dict[number] = value_links

> py > with open("all_links_dict.json", "w", encoding="utf-8-sig") as file: > json.dump(all_links_dict, file, indent=4, ensure_ascii=False) >

您应该使用 mode="a" 而不是 "w" 以在每次迭代中进行追加而不是覆盖。但是，您应该知道，在第二次迭代后，该文件将不再是有效的 JSON（即，您无法再次解码它）。最好在循环后（或循环结束时）将数据附加到列表中，然后将列表写入 JSON。

还有一个问题是 for page_number in range(1, 2): 只会导致一次迭代（其中 page_number 为1），因此即使这些更改，只有一个页面的信息会被保存，除非扩展范围以包括更多页面。

英文:

There are several issues:

> py > all_links_dict = {} > for item in name_link: > value_links = ("https://www.sothebysrealty.com" + item.get("href")) > all_links_dict[number + 1] = value_links >

You're not updating number at any point, so only a value keyed to 1 gets saved every loop. Either use some derivation of page_number updates itself in each iteration, or add a line to increment number and bring it inside the inner loop.

    all_links_dict = {}
    for item in name_link:
        value_links = (&quot;https://www.sothebysrealty.com&quot; + item.get(&quot;href&quot;))
        number += 1
        all_links_dict[number] = value_links

> py > with open("all_links_dict.json", "w", encoding="utf-8-sig") as file: > json.dump(all_links_dict, file, indent=4, ensure_ascii=False) >

You should use mode="a" instead of "w" to append instead of over-writing in each iteration. However, you should be aware that the file will not be valid json (ie, you can't decode it any more) after a second iteration. Might be better to have a list that you append to every time and then write the list to json after (or at the end of) the loop.

There's also the fact that for page_number in range(1, 2): will only lead to one iteration (where page_number is 1), so even with all this, only one page's info will be saved unless the range is expanded to include more pages.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python：无法将所有链接保存到JSON文件中的字典中，只有最后一个。

问题

答案1

答案2

答案3

JSON转CSV在Python中，CSV的行数多于JSON。

为什么arcpy中的len()函数与实际情况不匹配

反序列化不同类型到单个字段，使用Jackson

在R中在递归函数中使用Map()

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论