2023年7月12日 23:30:36go评论90阅读模式

英文:

My web scrapping program is pulling information that seemingly doesn't exist on the page it's crawling and I can't figure out why

问题

抱歉，我只能为您提供代码的翻译。以下是您提供的代码的翻译：

from bs4 import BeautifulSoup as bs
import requests
url = "https://www.aia.org/firm-directory?filter%5Bcountry%5D=UNITED%20STATES&amp;filter%5Bstate%5D=FL&amp;page%5Bnumber%5D="
firmName = []
for i in range(1):
    page = requests.get(url + str(24))
    print(url + str(24))
    soup = bs(page.content, "html.parser")
    info = soup.find('table', class_='data-table')
    for b in soup.find('tbody').find_all('b'):
        firmName.append(b.get_text())
    print("Frim names done.")
    print(firmName)

请注意，这只是代码的翻译，不包括问题的解答。如果您有其他问题或需要进一步的帮助，请随时提出。

英文:

from bs4 import BeautifulSoup as bs
import requests
url = &quot;https://www.aia.org/firm-directory?filter%5Bcountry%5D=UNITED%20STATES&amp;filter%5Bstate%5D=FL&amp;page%5Bnumber%5D=&quot;
firmName = []
for i in range(1):
	page = requests.get(url + str(24))
	print(url + str(24))
	soup = bs(page.content, &quot;html.parser&quot;)
	info = soup.find(&#39;table&#39;, class_=&#39;data-table&#39;)
	for b in soup.find(&#39;tbody&#39;).find_all(&#39;b&#39;):
		firmName.append(b.get_text())
	print(&quot;Frim names done.&quot;)
	print(firmName)

I'm attempting to pull a list of AIA Firms located in Florida. This is a snippet from a slightly larger piece of code, but it still has the problem I'm encountering. Normally I have the range set to cover all 24 pages of AIA Firms located in Florida, but for this example I have it set to just loop once and crawl the last page (page 24). Typically I have 'i' instead of '24' in the page variable.

When I run this code, I am getting the correct URL, but it is pulling information that seemingly doesn't exist on the page.

https://www.aia.org/firm-directory?filter%5Bcountry%5D=UNITED%20STATES&filter%5Bstate%5D=FL&page%5Bnumber%5D=24

This is the page I'm pulling from and this is the list of firm names I'm getting:

['Architect David P. Godwin', 'Architect ECW LLC', 'Architect Roseland, P.L.', 'Architect Stergas & Associates', 'Architectonic Inc', 'Architects Design Collaborative LLC', 'Architects Design Group Inc', 'Architects Design Group, Inc.', 'Architects Design Group, Inc.', 'Architects International Inc.', 'Architects Stergas & Associates', 'Architects Unlimited', 'Architects: Lewis + Whitlock, PA', 'Architectura Group', 'Architectural Partnership, Inc.', 'Architectural Studio, Inc.', 'Architecture Artistica Inc.', 'Architecture by Design', 'Architecture Joyce Owens', 'Architecture Joyce Owens LLC', 'ArchitectureWorks, LLC', 'Architeknics, Inc.', 'Arcticstar Design, Inc.', 'Arcwerks, Incorporated', 'ARK1TEK', 'Arkidesign, Inc.', 'Aron Temkin, Architect', 'Arquitectonica', 'Arquitectonica International', 'Art Castellanos, Architect', "Artisan's Architecture", 'Asbacher Architecture', 'Aspen Group', 'Atelier AEC, Inc.', 'ATELIER305, LLC', 'Atkins', 'Atkins North America Inc dba Faithful+Gould', 'Atlantic-AE, LLC', 'Atlas Safety & Security Design Inc.', 'Aude Smith Architecture', 'Austin Fox Architecture', 'AW Architects', 'B. Anderson Strait Architect', 'B1 Architects LLC', 'Bacon Group, Inc.', 'Baker Barrios Architects', 'Baker Barrios Architects', 'Banov Architects, PA', 'Barnett Fronczak Barlowe & Shuler Architects', 'Barr Architectural Studio, Inc.']

I've checked all over the page, but can't find these names anywhere. I also previously encountered a problem where my code suddenly seemed to not work after making no changes. The error was related to my page variable, so I believe it was unable to crawl the page. I'm guessing I'm not crawling the page I think I'm crawling, but I don't exactly understand how to check that.

I'm losing my mind here because I just ran it again as I was typing this but had it print the 'info' variable to see exactly what my program was pulling and it was the correct information. The code didn't change except for a 'print(info)' line and now it's pulling a completely different set of information—the right information, but unless I figure out why it's randomly pulling different things this is a useless learning project for myself.

I've tried pulling from specific pages rather than looping through all 24. I've tried checking the information I'm pulling. I've tried adding a 5 second delay between page crawls in case there was some sort of rate limit.

I guess at this point, I'm trying to figure out why my program seemingly pulls random information? Is it related to the site I'm trying to crawl? Is there some sort of limit that I'm hitting? Should I add a 10 second timer between each page crawl? Truly at a loss here.

答案1

得分: 1

以下是翻译好的部分：

我不确定你从哪里获取数据，运行你问题中的脚本时，我遇到了一个异常。

页面上显示的数据是通过JavaScript从外部URL加载的。因此，要加载有关公司的信息，您可以模拟此请求：

import requests
import pandas as pd
api_url = 'https://api.aia.org/firm-directory'
params = {
    "filter[country]": "UNITED STATES",
    "filter[state]": "FL",
    "page[number]": "1",
    "page[size]": "50",
    "q": "",
    "sort[criteria]": "firm_name",
    "sort[order]": "asc",
}
all_data = []
for params["page[number]"] in range(1, 3):    # < -- 增加页面数量在这里
    data = requests.get(api_url, params=params).json()
    all_data.extend(d['attributes'] for d in data['data'])
df = pd.DataFrame(all_data)
print(df)

打印输出：

                                                      firm_name                                  address_line_1 address_line_2                city state        country         zip                          firm_url
0                                            (allegedly) design                      3015 W Santiago St , Apt 2                              Tampa    FL  UNITED STATES  33629-8189           www.allegedlydesign.com
1                                                 2+ Architects                               260 Andalusia Ave                       Coral Gables    FL  UNITED STATES  33134-5902          www.2plus-architects.com
2                                          A BOHEME Design, LLC                   PO BOX 611328, 31 Main Street                     Rosemary Beach    FL  UNITED STATES  32461-1002             www.abohemedesign.com
3                                                 A Calist, LLC                                  6872 Caviro Ln                      Boynton Beach    FL  UNITED STATES  33437-3700                              None
4                                      A.T. Franco &amp; Associates                                  500 Se 11Th Ct                    Fort Lauderdale    FL  UNITED STATES  33316-1146                  www.atfranco.com
5                                         AB Design Group, Inc.                       1441 N Ronald Reagan Blvd                           Longwood    FL  UNITED STATES  32750-3404             www.abdesigngroup.com
6                                         ACAI Associates, Inc.               2937 W Cypress Creek Rd , Ste 200                    Fort Lauderdale    FL  UNITED STATES  33309-1761                 www.acaiworld.com
7                                         Acme Architects, Inc.                                  3575 Linden Ln                              Miami    FL  UNITED STATES  33133-5614                       www.acme.ac
...以此类推。

英文:

I'm not sure from where you get the data from, running the script in your question I get an exception.

The data you see on the page is loaded from external URL (via JavaScript). So to load the info about the companies you can simulate this request:

import requests
import pandas as pd
api_url = &#39;https://api.aia.org/firm-directory&#39;
params = {
    &quot;filter[country]&quot;: &quot;UNITED STATES&quot;,
    &quot;filter[state]&quot;: &quot;FL&quot;,
    &quot;page[number]&quot;: &quot;1&quot;,
    &quot;page[size]&quot;: &quot;50&quot;,
    &quot;q&quot;: &quot;&quot;,
    &quot;sort[criteria]&quot;: &quot;firm_name&quot;,
    &quot;sort[order]&quot;: &quot;asc&quot;,
}
all_data = []
for params[&quot;page[number]&quot;] in range(1, 3):    # &lt;-- increase number of pages here
    data = requests.get(api_url, params=params).json()
    all_data.extend(d[&#39;attributes&#39;] for d in data[&#39;data&#39;])
df = pd.DataFrame(all_data)
print(df)

Prints:

                                                      firm_name                                  address_line_1 address_line_2                city state        country         zip                          firm_url
0                                            (allegedly) design                      3015 W Santiago St , Apt 2                              Tampa    FL  UNITED STATES  33629-8189           www.allegedlydesign.com
1                                                 2+ Architects                               260 Andalusia Ave                       Coral Gables    FL  UNITED STATES  33134-5902          www.2plus-architects.com
2                                          A BOHEME Design, LLC                   PO BOX 611328, 31 Main Street                     Rosemary Beach    FL  UNITED STATES  32461-1002             www.abohemedesign.com
3                                                 A Calist, LLC                                  6872 Caviro Ln                      Boynton Beach    FL  UNITED STATES  33437-3700                              None
4                                      A.T. Franco &amp; Associates                                  500 Se 11Th Ct                    Fort Lauderdale    FL  UNITED STATES  33316-1146                  www.atfranco.com
5                                         AB Design Group, Inc.                       1441 N Ronald Reagan Blvd                           Longwood    FL  UNITED STATES  32750-3404             www.abdesigngroup.com
6                                         ACAI Associates, Inc.               2937 W Cypress Creek Rd , Ste 200                    Fort Lauderdale    FL  UNITED STATES  33309-1761                 www.acaiworld.com
7                                         Acme Architects, Inc.                                  3575 Linden Ln                              Miami    FL  UNITED STATES  33133-5614                       www.acme.ac
...and so on.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

My web scrapping program is pulling information that seemingly doesn't exist on the page it's crawling and I can't figure out why

问题

答案1

如何将一个值附加到变量名称？

Dictionary unpacking in python Python中的字典解包

不同运行结果（pyspark）

如何在行中计算“Y”？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。