英文:
I keep Getting a IndexError that is affecting my webscraping script
问题
以下是您提供的代码的翻译部分:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import pandas as pd
from time import sleep
from random import randint
name = []
address1 = []
address2 = []
def getPageResults(postcode, page):
url = 'https://www.ukrlp.co.uk/ukrlp/ukrlp_provider.page_pls_searchProviders'
url += '?pv_pc=' + postcode
url += '&pn_pc_d=5' # 设置距离:0、1、5或10英里
url += '&pn_pageNo=' + str(page)
url += '&pv_layout=SEARCH'
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
results = False
for a in soup.findAll("div", attrs={"class": "govuk-grid-row"}):
result = a.select("h2")
if len(result) > 0:
name.append(a.select("h2")[0].text)
address1.append(a.select("div.govuk-body")[0].text)
address2.append(a.select("div.govuk-body")[0].text)
results = True
return results
print("__________________________")
postcodes = [
# 列出您要搜索的邮政编码
]
for postcode in postcodes:
print(postcode)
page = 1
while True:
results = getPageResults(postcode, page)
sleep(randint(1, 3))
if results == False:
break
page += 1
print(page)
serve = pd.DataFrame({
"name": name,
"address1": address1,
"address2": address2
})
df = pd.DataFrame(columns=["name", "address1", "address2"])
df = df.append(serve)
df.to_excel("Five_miles_9.xlsx", index=False)
请注意,上述代码是基于您提供的信息翻译的。如果需要任何进一步的帮助或解释,请随时告诉我。
英文:
Hi there I keep getting this index error on my code when I am webscraping and I am not sure how to quite solve it so any help with this would be very helpful. I will put a code sample in this and the error and where on the code the error is saying is affecting the code.
The highlighted sections are where the error is occurring and I will now put the error message is showing and these below are the bits of code that are appearing to have affected my script any help will be very much appreciated.
Traceback (most recent call last):
results = getPageResults(postcode, page)
in getPageResults
address1.append(a.select("div.govuk-body")[0].text)
IndexError: list index out of range
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import pandas as pd
from time import sleep
from random import randint
name = []
address1 = []
address2 = []
def getPageResults(postcode, page):
url = 'https://www.ukrlp.co.uk/ukrlp/ukrlp_provider.page_pls_searchProviders'
url += '?pv_pc=' + postcode
url += '&pn_pc_d=5' # set up distance here: 0, 1, 5 or 10 miles
url += '&pn_pageNo=' + str(page)
url += '&pv_layout=SEARCH'
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
results = False
for a in soup.findAll("div", attrs={"class": "govuk-grid-row"}):
result = a.select("h2")
if len(result) > 0:
name.append(a.select("h2")[0].text)
address1.append(a.select("div.govuk-body")[0].text)
address2.append(a.select("div.govuk-body")[0].text)
results = True
return results
print("__________________________")
postcodes = [ # "BB0", "BB1", "BB10", "BB11", "BB12", "BB18", "BB2", "BB3", "BB4", "BB5", "BB6", "BB7",
# "BB8", "BB9", "BB94", "BD23", "BL0", "BL6","BL7", "BL8", "BL9", "BN1", "BN10", "BN2", "BN20", "BN21", "BN22", "BN23", "BN24", "BN25", "BN26", "BN27"
# "BN3",
# "BN4", "BN41", "BN42", "BN45", "BN50", "BN51", "BN52", "BN6", "BN7", "BN8", "BN88", "BN9", "BR1", "BR3"
# "DE1", "DE11", "DE12", "DE13", "DE14", "DE15", "DE2", "DE21", "DE22", "DE23", "DE24",
# "DE3", "DE4", "DE45", "DE5", "DE55", "DE56",
# "DE6", "DE65", "DE7", "DE72", "DE73", "DE74", "DE75", "DE99", "DN10", "DN11","DN21", "DN22", "DN9"
# "FY0", "FY1", "FY2", "FY3", "FY4", "FY5", "FY6", "FY7", "FY8"
# "HA0", "HA1",
#"HA3", "HA7", #"HA8", "HA9", "HU1", "HU11", "HU12", "HU13", "HU2", "HU3", "HU4", "HU5", "HU6", "HU7", "HU8", "HU9",
"L31", #"L33", "L37", "L39", "L40", "LA1", "LA2", "LA3", "LA4", "LA5", "LA6", "LA7", "LE12", "LE14", "LE6", "LE65", "LN1", "LN6"
# "N10", "N11", "N13", "N15", "N17", "N18", "N2", "N22", "N4", "N6", "N8", "N81", "NG1", "NG10", "NG11", "NG12", "NG13", "NG14", "NG15",
# "NG16", "NG17", "NG18", "NG19", "NG2", "NG20", "NG21", "NG22", "NG23", "NG24", "NG25", "NG3", "NG4", "NG5", "NG6", "NG7", "NG70", "NG8",
# "NG80", "NG9", "NG90", "NW10", "NW2", "NW26", "NW6", "NW8", "NW9", "OL12", "OL13", "OL14", "PR0", "PR1", "PR11", "PR2", "PR25", "PR26",
# "PR3", "PR4", "PR5", "PR6", "PR7", "PR8", "PR9", "RH15", "RH16", "RH17", "RH18", "RH19",
# "S1", "S11", "S12", "S17", "S18", "S19", "S21", "S26", "S30", "S31", "S32", "S33", "S40", "S41", "S42", "S43", "S44", "S45", "S49", "S8",
# "S80", "S81", "SE10", "SE12", "SE13", "SE14", "SE15", "SE16", "SE23", "SE26", "SE3", "SE4", "SE6", "SE8", "SE9", "SK12", "SK13", "SK14",
# "SK17", "SK22", "SK23", "ST14", "TN18", "TN19", "TN2", "TN20", "TN21", "TN22", "TN3", "TN31", "TN32", "TN33", "TN34", "TN35", "TN36",
# "TN37", "TN38", "TN39", "TN40", "TN5", "TN6", "TN7", "TN8", "UB6", "W10", "W9", "WA11", "WN5", "WN6", "WN8"
]
for postcode in postcodes:
print(postcode)
page = 1
while True:
results = getPageResults(postcode, page)
sleep(randint(1, 3))
if results == False:
break
page += 1
print(page)
serve = pd.DataFrame({
"name": name,
"address1": address1,
"address2": address2
})
df = pd.DataFrame(columns=["name", "address1", "address2"])
df = df.append(serve)
df.to_excel("Five_miles_9.xlsx", index=False)
答案1
得分: 1
以下是您提供的代码的翻译:
当页面中没有元素时,您没有退出循环,这就是为什么在没有元素的情况下会出现索引错误,例如在这里 `result = a.select("div.govuk-body")[0]`,因为 `result = a.select("h2")` 返回一个空数组
for a in soup.findAll("div", attrs={"class": "govuk-grid-row"}):
result = a.select("h2")
if len(result) > 0:
# 仅在长度大于0时追加
name.append(result[0].text)
address1.append(a.select("div.govuk-body")[0].text)
address2.append(a.select("div.govuk-body")[0].text)
results = True
else:
# 由于没有更多结果,退出循环
results = False
break
return results
同时,在下面我已经更新了中断逻辑
for postcode in postcodes:
print(postcode)
page = 1
while True:
print(page)
results = getPageResults(postcode, page)
sleep(randint(1, 3))
page += 1
if not results:
break
尝试这个更新后的代码,您的Excel文件应该会生成
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
from time import sleep
from random import randint
name = []
address1 = []
address2 = []
def getPageResults(postcode, page):
url = 'https://www.ukrlp.co.uk/ukrlp/ukrlp_provider.page_pls_searchProviders'
url += '?pv_pc=' + postcode
url += '&pn_pc_d=5' # 在这里设置距离:0、1、5或10英里
url += '&pn_pageNo=' + str(page)
url += '&pv_layout=SEARCH'
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
results = False
for a in soup.findAll("div", attrs={"class": "govuk-grid-row"}):
result = a.select("h2")
if len(result) > 0:
name.append(result[0].text)
address1.append(a.select("div.govuk-body")[0].text)
address2.append(a.select("div.govuk-body")[0].text)
results = True
else:
results = False
break
return results
print("__________________________")
postcodes = [
"L31"
]
for postcode in postcodes:
print(postcode)
page = 1
while True:
print(page)
results = getPageResults(postcode, page)
sleep(randint(1, 3))
page += 1
if not results:
break
serve = pd.DataFrame({
"name": name,
"address1": address1,
"address2": address2
})
df = pd.DataFrame(columns=["name", "address1", "address2"])
df = pd.concat([df, serve])
df.to_excel("Five_miles_9.xlsx", index=False)
请注意,代码中的注释已被保留,以便您更容易理解代码的功能。
英文:
When the are no elements in a page you are not breaking out of it the loop
and that why you are getting Index error when there are no elements a.select("div.govuk-body")[0]
here since result = a.select("h2")
returns empty array
for a in soup.findAll("div", attrs={"class": "govuk-grid-row"}):
result = a.select("h2")
if len(result) > 0:
#Only Append when length >0
name.append(result[0].text)
address1.append(a.select("div.govuk-body")[0].text)
address2.append(a.select("div.govuk-body")[0].text)
results = True
else:
#Since no more results breaking out
results = False
break
return results
Also in below i have updated logic to break out
for postcode in postcodes:
print(postcode)
page = 1
while True:
print(page)
results = getPageResults(postcode, page)
sleep(randint(1, 3))
page += 1
if not results:
break
Try this Updated code your excel should be generated
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
from time import sleep
from random import randint
name = []
address1 = []
address2 = []
def getPageResults(postcode, page):
url = 'https://www.ukrlp.co.uk/ukrlp/ukrlp_provider.page_pls_searchProviders'
url += '?pv_pc=' + postcode
url += '&pn_pc_d=5' # set up distance here: 0, 1, 5 or 10 miles
url += '&pn_pageNo=' + str(page)
url += '&pv_layout=SEARCH'
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
results = False
for a in soup.findAll("div", attrs={"class": "govuk-grid-row"}):
result = a.select("h2")
if len(result) > 0:
name.append(result[0].text)
address1.append(a.select("div.govuk-body")[0].text)
address2.append(a.select("div.govuk-body")[0].text)
results = True
else:
results = False
break
return results
print("__________________________")
postcodes = [ # "BB0", "BB1", "BB10", "BB11", "BB12", "BB18", "BB2", "BB3", "BB4", "BB5", "BB6", "BB7",
# "BB8", "BB9", "BB94", "BD23", "BL0", "BL6","BL7", "BL8", "BL9", "BN1", "BN10", "BN2", "BN20", "BN21", "BN22", "BN23", "BN24", "BN25", "BN26", "BN27"
# "BN3",
# "BN4", "BN41", "BN42", "BN45", "BN50", "BN51", "BN52", "BN6", "BN7", "BN8", "BN88", "BN9", "BR1", "BR3"
# "DE1", "DE11", "DE12", "DE13", "DE14", "DE15", "DE2", "DE21", "DE22", "DE23", "DE24",
# "DE3", "DE4", "DE45", "DE5", "DE55", "DE56",
# "DE6", "DE65", "DE7", "DE72", "DE73", "DE74", "DE75", "DE99", "DN10", "DN11","DN21", "DN22", "DN9"
# "FY0", "FY1", "FY2", "FY3", "FY4", "FY5", "FY6", "FY7", "FY8"
# "HA0", "HA1",
# "HA3", "HA7", #"HA8", "HA9", "HU1", "HU11", "HU12", "HU13", "HU2", "HU3", "HU4", "HU5", "HU6", "HU7", "HU8", "HU9",
"L31",
# "L33", "L37", "L39", "L40", "LA1", "LA2", "LA3", "LA4", "LA5", "LA6", "LA7", "LE12", "LE14", "LE6", "LE65", "LN1", "LN6"
# "N10", "N11", "N13", "N15", "N17", "N18", "N2", "N22", "N4", "N6", "N8", "N81", "NG1", "NG10", "NG11", "NG12", "NG13", "NG14", "NG15",
# "NG16", "NG17", "NG18", "NG19", "NG2", "NG20", "NG21", "NG22", "NG23", "NG24", "NG25", "NG3", "NG4", "NG5", "NG6", "NG7", "NG70", "NG8",
# "NG80", "NG9", "NG90", "NW10", "NW2", "NW26", "NW6", "NW8", "NW9", "OL12", "OL13", "OL14", "PR0", "PR1", "PR11", "PR2", "PR25", "PR26",
# "PR3", "PR4", "PR5", "PR6", "PR7", "PR8", "PR9", "RH15", "RH16", "RH17", "RH18", "RH19",
# "S1", "S11", "S12", "S17", "S18", "S19", "S21", "S26", "S30", "S31", "S32", "S33", "S40", "S41", "S42", "S43", "S44", "S45", "S49", "S8",
# "S80", "S81", "SE10", "SE12", "SE13", "SE14", "SE15", "SE16", "SE23", "SE26", "SE3", "SE4", "SE6", "SE8", "SE9", "SK12", "SK13", "SK14",
# "SK17", "SK22", "SK23", "ST14", "TN18", "TN19", "TN2", "TN20", "TN21", "TN22", "TN3", "TN31", "TN32", "TN33", "TN34", "TN35", "TN36",
# "TN37", "TN38", "TN39", "TN40", "TN5", "TN6", "TN7", "TN8", "UB6", "W10", "W9", "WA11", "WN5", "WN6", "WN8"
]
for postcode in postcodes:
print(postcode)
page = 1
while True:
print(page)
results = getPageResults(postcode, page)
sleep(randint(1, 3))
page += 1
if not results:
break
serve = pd.DataFrame({
"name": name,
"address1": address1,
"address2": address2
})
df = pd.DataFrame(columns=["name", "address1", "address2"])
df = pd.concat([df, serve])
df.to_excel("Five_miles_9.xlsx", index=False)
答案2
得分: 0
看起来在您提供的 url
网站中,在 govuk-grid-row
下没有 govuk-body
。govuk-body
实际上在另一个 div
中,也是 govuk-width-container
的子元素。也许您想要提取的是 govuk-body-1
。由于找不到任何内容,a
返回一个空列表,正如上面的评论中 @DarkKnight 所说。
为了更容易调试,确保在分配之前 a
的长度不为零。可以通过添加以下内容实现:assert len(a.select("div.govuk-body")) != 0
。它会在尝试分配任何内容之前使断言失败,告诉您在执行下一步之前 a
的问题。
英文:
It seems that within the website that you have as url
, there is no govuk-body
under govuk-grid-row
. govuk-body
is in a different div
that is also a child of govuk-width-container
. Perhaps what you meant to scrape is govuk-body-1
. Since it can't find anything, a
is returning an empty list, as @DarkKnight is saying in the comment above.
To make this easier to debug in the future, make sure that a doesn't have a length of zero before assigning it. You can do this by adding the following: assert len(a.select("div.govuk-body")) != 0
. It will fail the assertion before trying to assign anything, telling you the problem with a
before trying to execute the next step.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论