我遇到了一个IndexError错误,影响了我的网络爬虫脚本。

huangapple go评论59阅读模式
英文:

I keep Getting a IndexError that is affecting my webscraping script

问题

以下是您提供的代码的翻译部分:

from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import pandas as pd
from time import sleep
from random import randint

name = []
address1 = []
address2 = []

def getPageResults(postcode, page):
    url = 'https://www.ukrlp.co.uk/ukrlp/ukrlp_provider.page_pls_searchProviders'
    url += '?pv_pc=' + postcode
    url += '&pn_pc_d=5'  # 设置距离:0、1、5或10英里
    url += '&pn_pageNo=' + str(page)
    url += '&pv_layout=SEARCH'

    page = urlopen(url)
    html = page.read().decode("utf-8")
    soup = BeautifulSoup(html, "html.parser")

    results = False

    for a in soup.findAll("div", attrs={"class": "govuk-grid-row"}):
        result = a.select("h2")
        if len(result) > 0:
            name.append(a.select("h2")[0].text)

        address1.append(a.select("div.govuk-body")[0].text)
        address2.append(a.select("div.govuk-body")[0].text)
        results = True

    return results

print("__________________________")

postcodes = [
    # 列出您要搜索的邮政编码
]

for postcode in postcodes:
    print(postcode)

    page = 1

    while True:
        results = getPageResults(postcode, page)
        sleep(randint(1, 3))

        if results == False:
            break
        page += 1
        print(page)

serve = pd.DataFrame({
    "name": name,
    "address1": address1,
    "address2": address2
})

df = pd.DataFrame(columns=["name", "address1", "address2"])

df = df.append(serve)

df.to_excel("Five_miles_9.xlsx", index=False)

请注意,上述代码是基于您提供的信息翻译的。如果需要任何进一步的帮助或解释,请随时告诉我。

英文:

Hi there I keep getting this index error on my code when I am webscraping and I am not sure how to quite solve it so any help with this would be very helpful. I will put a code sample in this and the error and where on the code the error is saying is affecting the code.

The highlighted sections are where the error is occurring and I will now put the error message is showing and these below are the bits of code that are appearing to have affected my script any help will be very much appreciated.

Traceback (most recent call last):
results = getPageResults(postcode, page)
in getPageResults
address1.append(a.select("div.govuk-body")[0].text)
IndexError: list index out of range
    from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import pandas as pd
from time import sleep
from random import randint
name = []
address1 = []
address2 = []
def getPageResults(postcode, page):
url = 'https://www.ukrlp.co.uk/ukrlp/ukrlp_provider.page_pls_searchProviders'
url += '?pv_pc=' + postcode
url += '&pn_pc_d=5'  # set up distance here: 0, 1, 5 or 10 miles
url += '&pn_pageNo=' + str(page)
url += '&pv_layout=SEARCH'
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
results = False
for a in soup.findAll("div", attrs={"class": "govuk-grid-row"}):
result = a.select("h2")
if len(result) > 0:
name.append(a.select("h2")[0].text)
address1.append(a.select("div.govuk-body")[0].text)
address2.append(a.select("div.govuk-body")[0].text)
results = True
return results
print("__________________________")
postcodes = [  # "BB0", "BB1", "BB10", "BB11", "BB12", "BB18", "BB2", "BB3", "BB4", "BB5", "BB6", "BB7",
# "BB8", "BB9", "BB94", "BD23", "BL0", "BL6","BL7", "BL8", "BL9", "BN1", "BN10", "BN2", "BN20", "BN21", "BN22", "BN23", "BN24", "BN25", "BN26", "BN27"
# "BN3",
# "BN4", "BN41", "BN42", "BN45", "BN50", "BN51", "BN52", "BN6", "BN7", "BN8", "BN88", "BN9", "BR1", "BR3"
# "DE1", "DE11", "DE12", "DE13", "DE14", "DE15", "DE2", "DE21", "DE22", "DE23", "DE24",
# "DE3", "DE4", "DE45", "DE5", "DE55", "DE56",
# "DE6", "DE65", "DE7", "DE72", "DE73", "DE74", "DE75", "DE99", "DN10", "DN11","DN21", "DN22", "DN9"
# "FY0", "FY1", "FY2", "FY3", "FY4", "FY5", "FY6", "FY7", "FY8"
# "HA0", "HA1",
#"HA3", "HA7", #"HA8", "HA9", "HU1", "HU11", "HU12", "HU13", "HU2", "HU3", "HU4", "HU5", "HU6", "HU7", "HU8", "HU9",
"L31", #"L33", "L37", "L39", "L40", "LA1", "LA2", "LA3", "LA4", "LA5", "LA6", "LA7", "LE12", "LE14", "LE6", "LE65", "LN1", "LN6"
# "N10", "N11", "N13", "N15", "N17", "N18", "N2", "N22", "N4", "N6", "N8", "N81", "NG1", "NG10", "NG11", "NG12", "NG13", "NG14", "NG15",
# "NG16", "NG17", "NG18", "NG19", "NG2", "NG20", "NG21", "NG22", "NG23", "NG24", "NG25", "NG3", "NG4", "NG5", "NG6", "NG7", "NG70", "NG8",
# "NG80", "NG9", "NG90", "NW10", "NW2", "NW26", "NW6", "NW8", "NW9", "OL12", "OL13", "OL14", "PR0", "PR1", "PR11", "PR2", "PR25", "PR26",
# "PR3", "PR4", "PR5", "PR6", "PR7", "PR8", "PR9", "RH15", "RH16", "RH17", "RH18", "RH19",
# "S1", "S11", "S12", "S17", "S18", "S19", "S21", "S26", "S30", "S31", "S32", "S33", "S40", "S41", "S42", "S43", "S44", "S45", "S49", "S8",
# "S80", "S81", "SE10", "SE12", "SE13", "SE14", "SE15", "SE16", "SE23", "SE26", "SE3", "SE4", "SE6", "SE8", "SE9", "SK12", "SK13", "SK14",
# "SK17", "SK22", "SK23", "ST14", "TN18", "TN19", "TN2", "TN20", "TN21", "TN22", "TN3", "TN31", "TN32", "TN33", "TN34", "TN35", "TN36",
# "TN37", "TN38", "TN39", "TN40", "TN5", "TN6", "TN7", "TN8", "UB6", "W10", "W9", "WA11", "WN5", "WN6", "WN8"
]
for postcode in postcodes:
print(postcode)
page = 1
while True:
results = getPageResults(postcode, page)
sleep(randint(1, 3))
if results == False:
break
page += 1
print(page)
serve = pd.DataFrame({
"name": name,
"address1": address1,
"address2": address2
})
df = pd.DataFrame(columns=["name", "address1", "address2"])
df = df.append(serve)  
df.to_excel("Five_miles_9.xlsx", index=False) 

答案1

得分: 1

以下是您提供的代码的翻译:

当页面中没有元素时您没有退出循环这就是为什么在没有元素的情况下会出现索引错误例如在这里 `result = a.select("div.govuk-body")[0]`,因为 `result = a.select("h2")` 返回一个空数组

for a in soup.findAll("div", attrs={"class": "govuk-grid-row"}):
    result = a.select("h2")
    if len(result) > 0:
        # 仅在长度大于0时追加
        name.append(result[0].text)
        address1.append(a.select("div.govuk-body")[0].text)
        address2.append(a.select("div.govuk-body")[0].text)
        results = True
    else:
        # 由于没有更多结果,退出循环
        results = False
        break
return results

同时在下面我已经更新了中断逻辑

for postcode in postcodes:
    print(postcode)

    page = 1
    while True:
        print(page)
        results = getPageResults(postcode, page)
        sleep(randint(1, 3))
        page += 1

        if not results:
            break

尝试这个更新后的代码您的Excel文件应该会生成

from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
from time import sleep
from random import randint

name = []
address1 = []
address2 = []


def getPageResults(postcode, page):
    url = 'https://www.ukrlp.co.uk/ukrlp/ukrlp_provider.page_pls_searchProviders'
    url += '?pv_pc=' + postcode
    url += '&pn_pc_d=5'  # 在这里设置距离:0、1、5或10英里
    url += '&pn_pageNo=' + str(page)
    url += '&pv_layout=SEARCH'

    page = urlopen(url)
    html = page.read().decode("utf-8")
    soup = BeautifulSoup(html, "html.parser")

    results = False

    for a in soup.findAll("div", attrs={"class": "govuk-grid-row"}):
        result = a.select("h2")
        if len(result) > 0:
            name.append(result[0].text)
            address1.append(a.select("div.govuk-body")[0].text)
            address2.append(a.select("div.govuk-body")[0].text)
            results = True
        else:
            results = False
            break
    return results


print("__________________________")

postcodes = [
    "L31"
]

for postcode in postcodes:
    print(postcode)

    page = 1
    while True:
        print(page)
        results = getPageResults(postcode, page)
        sleep(randint(1, 3))
        page += 1

        if not results:
            break

serve = pd.DataFrame({
    "name": name,
    "address1": address1,
    "address2": address2
})

df = pd.DataFrame(columns=["name", "address1", "address2"])

df = pd.concat([df, serve])

df.to_excel("Five_miles_9.xlsx", index=False)

请注意,代码中的注释已被保留,以便您更容易理解代码的功能。

英文:

When the are no elements in a page you are not breaking out of it the loop
and that why you are getting Index error when there are no elements a.select("div.govuk-body")[0] here since result = a.select("h2") returns empty array

for a in soup.findAll("div", attrs={"class": "govuk-grid-row"}):
result = a.select("h2")
if len(result) > 0:
#Only Append when length >0
name.append(result[0].text)
address1.append(a.select("div.govuk-body")[0].text)
address2.append(a.select("div.govuk-body")[0].text)
results = True
else:
#Since no more results breaking out
results = False
break
return results

Also in below i have updated logic to break out

 for postcode in postcodes:
print(postcode)
page = 1
while True:
print(page)
results = getPageResults(postcode, page)
sleep(randint(1, 3))
page += 1
if not results:
break

Try this Updated code your excel should be generated

from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
from time import sleep
from random import randint
name = []
address1 = []
address2 = []
def getPageResults(postcode, page):
url = 'https://www.ukrlp.co.uk/ukrlp/ukrlp_provider.page_pls_searchProviders'
url += '?pv_pc=' + postcode
url += '&pn_pc_d=5'  # set up distance here: 0, 1, 5 or 10 miles
url += '&pn_pageNo=' + str(page)
url += '&pv_layout=SEARCH'
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
results = False
for a in soup.findAll("div", attrs={"class": "govuk-grid-row"}):
result = a.select("h2")
if len(result) > 0:
name.append(result[0].text)
address1.append(a.select("div.govuk-body")[0].text)
address2.append(a.select("div.govuk-body")[0].text)
results = True
else:
results = False
break
return results
print("__________________________")
postcodes = [  # "BB0", "BB1", "BB10", "BB11", "BB12", "BB18", "BB2", "BB3", "BB4", "BB5", "BB6", "BB7",
# "BB8", "BB9", "BB94", "BD23", "BL0", "BL6","BL7", "BL8", "BL9", "BN1", "BN10", "BN2", "BN20", "BN21", "BN22", "BN23", "BN24", "BN25", "BN26", "BN27"
# "BN3",
# "BN4", "BN41", "BN42", "BN45", "BN50", "BN51", "BN52", "BN6", "BN7", "BN8", "BN88", "BN9", "BR1", "BR3"
# "DE1", "DE11", "DE12", "DE13", "DE14", "DE15", "DE2", "DE21", "DE22", "DE23", "DE24",
# "DE3", "DE4", "DE45", "DE5", "DE55", "DE56",
# "DE6", "DE65", "DE7", "DE72", "DE73", "DE74", "DE75", "DE99", "DN10", "DN11","DN21", "DN22", "DN9"
# "FY0", "FY1", "FY2", "FY3", "FY4", "FY5", "FY6", "FY7", "FY8"
# "HA0", "HA1",
# "HA3", "HA7", #"HA8", "HA9", "HU1", "HU11", "HU12", "HU13", "HU2", "HU3", "HU4", "HU5", "HU6", "HU7", "HU8", "HU9",
"L31",
# "L33", "L37", "L39", "L40", "LA1", "LA2", "LA3", "LA4", "LA5", "LA6", "LA7", "LE12", "LE14", "LE6", "LE65", "LN1", "LN6"
# "N10", "N11", "N13", "N15", "N17", "N18", "N2", "N22", "N4", "N6", "N8", "N81", "NG1", "NG10", "NG11", "NG12", "NG13", "NG14", "NG15",
# "NG16", "NG17", "NG18", "NG19", "NG2", "NG20", "NG21", "NG22", "NG23", "NG24", "NG25", "NG3", "NG4", "NG5", "NG6", "NG7", "NG70", "NG8",
# "NG80", "NG9", "NG90", "NW10", "NW2", "NW26", "NW6", "NW8", "NW9", "OL12", "OL13", "OL14", "PR0", "PR1", "PR11", "PR2", "PR25", "PR26",
# "PR3", "PR4", "PR5", "PR6", "PR7", "PR8", "PR9", "RH15", "RH16", "RH17", "RH18", "RH19",
# "S1", "S11", "S12", "S17", "S18", "S19", "S21", "S26", "S30", "S31", "S32", "S33", "S40", "S41", "S42", "S43", "S44", "S45", "S49", "S8",
# "S80", "S81", "SE10", "SE12", "SE13", "SE14", "SE15", "SE16", "SE23", "SE26", "SE3", "SE4", "SE6", "SE8", "SE9", "SK12", "SK13", "SK14",
# "SK17", "SK22", "SK23", "ST14", "TN18", "TN19", "TN2", "TN20", "TN21", "TN22", "TN3", "TN31", "TN32", "TN33", "TN34", "TN35", "TN36",
# "TN37", "TN38", "TN39", "TN40", "TN5", "TN6", "TN7", "TN8", "UB6", "W10", "W9", "WA11", "WN5", "WN6", "WN8"
]
for postcode in postcodes:
print(postcode)
page = 1
while True:
print(page)
results = getPageResults(postcode, page)
sleep(randint(1, 3))
page += 1
if not results:
break
serve = pd.DataFrame({
"name": name,
"address1": address1,
"address2": address2
})
df = pd.DataFrame(columns=["name", "address1", "address2"])
df = pd.concat([df, serve])
df.to_excel("Five_miles_9.xlsx", index=False)

答案2

得分: 0

看起来在您提供的 url 网站中,在 govuk-grid-row 下没有 govuk-bodygovuk-body 实际上在另一个 div 中,也是 govuk-width-container 的子元素。也许您想要提取的是 govuk-body-1。由于找不到任何内容,a 返回一个空列表,正如上面的评论中 @DarkKnight 所说。

为了更容易调试,确保在分配之前 a 的长度不为零。可以通过添加以下内容实现:assert len(a.select("div.govuk-body")) != 0。它会在尝试分配任何内容之前使断言失败,告诉您在执行下一步之前 a 的问题。

英文:

It seems that within the website that you have as url, there is no govuk-body under govuk-grid-row. govuk-body is in a different div that is also a child of govuk-width-container. Perhaps what you meant to scrape is govuk-body-1. Since it can't find anything, a is returning an empty list, as @DarkKnight is saying in the comment above.

To make this easier to debug in the future, make sure that a doesn't have a length of zero before assigning it. You can do this by adding the following: assert len(a.select("div.govuk-body")) != 0. It will fail the assertion before trying to assign anything, telling you the problem with a before trying to execute the next step.

huangapple
  • 本文由 发表于 2023年5月11日 00:02:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76220532.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定