英文:
My web scrapping program is pulling information that seemingly doesn't exist on the page it's crawling and I can't figure out why
问题
抱歉,我只能为您提供代码的翻译。以下是您提供的代码的翻译:
from bs4 import BeautifulSoup as bs
import requests
url = "https://www.aia.org/firm-directory?filter%5Bcountry%5D=UNITED%20STATES&filter%5Bstate%5D=FL&page%5Bnumber%5D="
firmName = []
for i in range(1):
page = requests.get(url + str(24))
print(url + str(24))
soup = bs(page.content, "html.parser")
info = soup.find('table', class_='data-table')
for b in soup.find('tbody').find_all('b'):
firmName.append(b.get_text())
print("Frim names done.")
print(firmName)
请注意,这只是代码的翻译,不包括问题的解答。如果您有其他问题或需要进一步的帮助,请随时提出。
英文:
from bs4 import BeautifulSoup as bs
import requests
url = "https://www.aia.org/firm-directory?filter%5Bcountry%5D=UNITED%20STATES&filter%5Bstate%5D=FL&page%5Bnumber%5D="
firmName = []
for i in range(1):
page = requests.get(url + str(24))
print(url + str(24))
soup = bs(page.content, "html.parser")
info = soup.find('table', class_='data-table')
for b in soup.find('tbody').find_all('b'):
firmName.append(b.get_text())
print("Frim names done.")
print(firmName)
I'm attempting to pull a list of AIA Firms located in Florida. This is a snippet from a slightly larger piece of code, but it still has the problem I'm encountering. Normally I have the range set to cover all 24 pages of AIA Firms located in Florida, but for this example I have it set to just loop once and crawl the last page (page 24). Typically I have 'i' instead of '24' in the page variable.
When I run this code, I am getting the correct URL, but it is pulling information that seemingly doesn't exist on the page.
This is the page I'm pulling from and this is the list of firm names I'm getting:
['Architect David P. Godwin', 'Architect ECW LLC', 'Architect Roseland, P.L.', 'Architect Stergas & Associates', 'Architectonic Inc', 'Architects Design Collaborative LLC', 'Architects Design Group Inc', 'Architects Design Group, Inc.', 'Architects Design Group, Inc.', 'Architects International Inc.', 'Architects Stergas & Associates', 'Architects Unlimited', 'Architects: Lewis + Whitlock, PA', 'Architectura Group', 'Architectural Partnership, Inc.', 'Architectural Studio, Inc.', 'Architecture Artistica Inc.', 'Architecture by Design', 'Architecture Joyce Owens', 'Architecture Joyce Owens LLC', 'ArchitectureWorks, LLC', 'Architeknics, Inc.', 'Arcticstar Design, Inc.', 'Arcwerks, Incorporated', 'ARK1TEK', 'Arkidesign, Inc.', 'Aron Temkin, Architect', 'Arquitectonica', 'Arquitectonica International', 'Art Castellanos, Architect', "Artisan's Architecture", 'Asbacher Architecture', 'Aspen Group', 'Atelier AEC, Inc.', 'ATELIER305, LLC', 'Atkins', 'Atkins North America Inc dba Faithful+Gould', 'Atlantic-AE, LLC', 'Atlas Safety & Security Design Inc.', 'Aude Smith Architecture', 'Austin Fox Architecture', 'AW Architects', 'B. Anderson Strait Architect', 'B1 Architects LLC', 'Bacon Group, Inc.', 'Baker Barrios Architects', 'Baker Barrios Architects', 'Banov Architects, PA', 'Barnett Fronczak Barlowe & Shuler Architects', 'Barr Architectural Studio, Inc.']
I've checked all over the page, but can't find these names anywhere. I also previously encountered a problem where my code suddenly seemed to not work after making no changes. The error was related to my page variable, so I believe it was unable to crawl the page. I'm guessing I'm not crawling the page I think I'm crawling, but I don't exactly understand how to check that.
I'm losing my mind here because I just ran it again as I was typing this but had it print the 'info' variable to see exactly what my program was pulling and it was the correct information. The code didn't change except for a 'print(info)' line and now it's pulling a completely different set of information—the right information, but unless I figure out why it's randomly pulling different things this is a useless learning project for myself.
I've tried pulling from specific pages rather than looping through all 24. I've tried checking the information I'm pulling. I've tried adding a 5 second delay between page crawls in case there was some sort of rate limit.
I guess at this point, I'm trying to figure out why my program seemingly pulls random information? Is it related to the site I'm trying to crawl? Is there some sort of limit that I'm hitting? Should I add a 10 second timer between each page crawl? Truly at a loss here.
答案1
得分: 1
以下是翻译好的部分:
我不确定你从哪里获取数据,运行你问题中的脚本时,我遇到了一个异常。
页面上显示的数据是通过JavaScript从外部URL加载的。因此,要加载有关公司的信息,您可以模拟此请求:
import requests
import pandas as pd
api_url = 'https://api.aia.org/firm-directory'
params = {
"filter[country]": "UNITED STATES",
"filter[state]": "FL",
"page[number]": "1",
"page[size]": "50",
"q": "",
"sort[criteria]": "firm_name",
"sort[order]": "asc",
}
all_data = []
for params["page[number]"] in range(1, 3): # < -- 增加页面数量在这里
data = requests.get(api_url, params=params).json()
all_data.extend(d['attributes'] for d in data['data'])
df = pd.DataFrame(all_data)
print(df)
打印输出:
firm_name address_line_1 address_line_2 city state country zip firm_url
0 (allegedly) design 3015 W Santiago St , Apt 2 Tampa FL UNITED STATES 33629-8189 www.allegedlydesign.com
1 2+ Architects 260 Andalusia Ave Coral Gables FL UNITED STATES 33134-5902 www.2plus-architects.com
2 A BOHEME Design, LLC PO BOX 611328, 31 Main Street Rosemary Beach FL UNITED STATES 32461-1002 www.abohemedesign.com
3 A Calist, LLC 6872 Caviro Ln Boynton Beach FL UNITED STATES 33437-3700 None
4 A.T. Franco & Associates 500 Se 11Th Ct Fort Lauderdale FL UNITED STATES 33316-1146 www.atfranco.com
5 AB Design Group, Inc. 1441 N Ronald Reagan Blvd Longwood FL UNITED STATES 32750-3404 www.abdesigngroup.com
6 ACAI Associates, Inc. 2937 W Cypress Creek Rd , Ste 200 Fort Lauderdale FL UNITED STATES 33309-1761 www.acaiworld.com
7 Acme Architects, Inc. 3575 Linden Ln Miami FL UNITED STATES 33133-5614 www.acme.ac
...以此类推。
英文:
I'm not sure from where you get the data from, running the script in your question I get an exception.
The data you see on the page is loaded from external URL (via JavaScript). So to load the info about the companies you can simulate this request:
import requests
import pandas as pd
api_url = 'https://api.aia.org/firm-directory'
params = {
"filter[country]": "UNITED STATES",
"filter[state]": "FL",
"page[number]": "1",
"page[size]": "50",
"q": "",
"sort[criteria]": "firm_name",
"sort[order]": "asc",
}
all_data = []
for params["page[number]"] in range(1, 3): # <-- increase number of pages here
data = requests.get(api_url, params=params).json()
all_data.extend(d['attributes'] for d in data['data'])
df = pd.DataFrame(all_data)
print(df)
Prints:
firm_name address_line_1 address_line_2 city state country zip firm_url
0 (allegedly) design 3015 W Santiago St , Apt 2 Tampa FL UNITED STATES 33629-8189 www.allegedlydesign.com
1 2+ Architects 260 Andalusia Ave Coral Gables FL UNITED STATES 33134-5902 www.2plus-architects.com
2 A BOHEME Design, LLC PO BOX 611328, 31 Main Street Rosemary Beach FL UNITED STATES 32461-1002 www.abohemedesign.com
3 A Calist, LLC 6872 Caviro Ln Boynton Beach FL UNITED STATES 33437-3700 None
4 A.T. Franco & Associates 500 Se 11Th Ct Fort Lauderdale FL UNITED STATES 33316-1146 www.atfranco.com
5 AB Design Group, Inc. 1441 N Ronald Reagan Blvd Longwood FL UNITED STATES 32750-3404 www.abdesigngroup.com
6 ACAI Associates, Inc. 2937 W Cypress Creek Rd , Ste 200 Fort Lauderdale FL UNITED STATES 33309-1761 www.acaiworld.com
7 Acme Architects, Inc. 3575 Linden Ln Miami FL UNITED STATES 33133-5614 www.acme.ac
...and so on.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论