我的 for 循环没有遍历 URL 列表,只执行第一个项目。

huangapple go评论67阅读模式
英文:

my for loop is not iterating through a list of urls, only executing for the first item

问题

I'm very much a beginner and after banging my head against the wall, am asking for any help at all with this. I want to scrape a list of urls but my for loop is only returning the first item on the list.

我是一个初学者,头疼不已,请求任何帮助。我想爬取一个URL列表,但我的for循环只返回列表中的第一项。

I have a list of urls, a function to scrape the json data into a dictionary, convert the dictionary to a dataframe and export to the csv. Everything is working except the for loop so that only the first url on the list gets scraped:

我有一个URL列表,一个用于将JSON数据爬取到字典中的函数,将字典转换为数据框并导出到CSV。除了for循环之外,一切都正常,因此只有列表中的第一个URL被爬取:

url_list_str = ['https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast-legaspi-village',
 'https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5',
 'https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5']

for url in url_list_str:
  url = url_list_str[0]
  response = req.get(url, headers = headers)
  pause(5)
  html = BeautifulSoup(response.content, 'html.parser')
  data = foodpanda_data(html)
  restaurant_name = data['Name']
  df = pd.DataFrame([data])

foodpanda() is a function above the for loop which scrapes the json and turns it into a dictionary. Here's a preview because it's pretty long:

foodpanda_data()是for循环上面的一个函数,它会爬取JSON数据并将其转换为字典。这里是一个预览,因为它相当长:

def foodpanda_data(html):
  script_tag = html.find("script", {"data-testid": "restaurant-seo-schema"})
  json_text = script_tag.string
  json_dict = json.loads(json_text)
  
  extracted_data = {}
  keys_to_extract = ["Name", "streetAddress", "addressLocality", "postalCode", "latitude", "longitude", "url", "ratingValue", "ratingCount", "bestRating", "worstRating", "servesCuisine", "priceRange"]
  for key in keys_to_extract:
    if key.lower() == 'name':
      extracted_data[key] = json_dict.get('name', '') #... etc.

     return extracted_data

I also tried writing the for loop as:

我还尝试将for循环编写为:

for u in range(len(url_list_str)):
  url = url_list_str[u]

but that didn't work either. There must be something really obvious here that I'm not getting so thank you!

但那也没有奏效。这里一定有一些非常明显的东西我没有弄清楚,所以谢谢!

英文:

I'm very much a beginner and after banging my head against the wall, am asking for any help at all with this. I want to scrape a list of urls but my for loop is only returning the first item on the list.

I have a list of urls, a function to scrape the json data into a dictionary, convert the dictionary to a dataframe and export to the csv. Everything is working except the for loop so that only the first url on the list gets scraped:

url_list_str = ['https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast-legaspi-village',
 'https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5',
 'https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5']

for url in url_list_str:
  url = url_list_str[0]
  response = req.get(url, headers = headers)
  pause(5)
  html = BeautifulSoup(response.content, 'html.parser')
  data = foodpanda_data(html)
  restaurant_name = data['Name']
  df = pd.DataFrame([data])

foodpanda() is a function above the for loop which scrapes the json and turns it into a dictionary. Here's a preview because it's pretty long:

def foodpanda_data(html):
  script_tag = html.find("script", {"data-testid": "restaurant-seo-schema"})
  json_text = script_tag.string
  json_dict = json.loads(json_text)
  
  extracted_data = {}
  keys_to_extract = ["Name", "streetAddress", "addressLocality", "postalCode", "latitude", "longitude", "url", "ratingValue", "ratingCount", "bestRating", "worstRating", "servesCuisine", "priceRange"]
  for key in keys_to_extract:
    if key.lower() == 'name':
      extracted_data[key] = json_dict.get('name', '') #... etc.

     return extracted_data

I also tried writing the for loop as:

for u in range(len(url_list_str)):
  url = url_list_str[u]

but that didn't work either. There must be something really obvious here that I'm not getting so thank you!

答案1

得分: 1

因为在每次迭代中,您都从列表中选择第一个URL(url = url_list_str[0])。只需删除它。

url_list_str = ['https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast- legaspi-village',
             'https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5',
             'https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5']

for url in url_list_str:
    response = req.get(url, headers = headers)
    pause(5)
    html = BeautifulSoup(response.content, 'html.parser')
    data = foodpanda_data(html)
    restaurant_name = data['Name']
    df = pd.DataFrame([data])
英文:

because in every iteration, you're picking the first URL from the list here (url = url_list_str[0]). Simply remove it.

url_list_str = ['https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast- legaspi-village',
         'https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5',
         'https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5']

for url in url_list_str:
    response = req.get(url, headers = headers)
    pause(5)
    html = BeautifulSoup(response.content, 'html.parser')
    data = foodpanda_data(html)
    restaurant_name = data['Name']
    df = pd.DataFrame([data])

答案2

得分: 0

我猜你想做类似这样的事情:

import json
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

def foodpanda_data(html):
    script_tag = html.find("script", {"data-testid": "restaurant-seo-schema"})
    json_text = script_tag.string
    json_dict = json.loads(json_text)
    extracted_data = {
        "name": json_dict['name'],
        "streetAddress": json_dict['address']['streetAddress'],
        "addressLocality": json_dict['address']['addressLocality'],
        "postalCode": json_dict['address']['postalCode'],
        "latitude": json_dict['geo']['latitude'],
        "longitude": json_dict['geo']['longitude'],
        "url": json_dict['url'],
        "ratingValue": json_dict['aggregateRating']['ratingValue'],
        "ratingCount": json_dict['aggregateRating']['ratingCount'],
        "bestRating": json_dict['aggregateRating']['bestRating'],
        "worstRating": json_dict['aggregateRating']['worstRating'],
        "servesCuisine": json_dict['servesCuisine'],
        "priceRange": json_dict['priceRange']
    }
    return extracted_data

url_list_str = ['https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast-legaspi-village',
                'https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5',
                'https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5']

all_data = []
for url in url_list_str:
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"
    }
    response = requests.get(url, headers=headers)
    html = BeautifulSoup(response.content, 'html.parser')
    data = foodpanda_data(html)
    all_data.append(data)
    time.sleep(1)

df = pd.DataFrame(all_data)
print(df.head())

输出:

                                         name                                      streetAddress addressLocality postalCode   latitude   longitude                                                url  ratingValue  ratingCount  bestRating  worstRating                           servesCuisine priceRange
    0         Sicilian Roast - Legaspi Village  100 Don Carlos Palanca corner Dela Rosa Street...     Makati City       1229  14.556083  121.019540  https://www.foodpanda.ph/restaurant/vh2d/sicil...          4.4           29           5            1                 [Italian, Pizza, Pasta]         ₱₱
    1  Tokyo Milk Cheese Factory - Greenbelt 5  2nd Floor Greenbelt 5 Legazpi Street Legazpi V...     Makati City       1229  14.553329  121.022054  https://www.foodpanda.ph/restaurant/ns76/tokyo...          5.0           58           5            1    [Desserts, Fast Food, Snacks, Cakes]        ₱₱₱
    2                       PAUL - Greenbelt 5  Ground Floor Greenbelt 5 Legazpi Street Barang...     Makati City       1223  14.552704  121.020531  https://www.foodpanda.ph/restaurant/hksd/paul-...          4.7           12           5            1  [Sandwiches, American, Western, Bread]         ₱₱
英文:

I guess, you're trying to do something like this

import json
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

def foodpanda_data(html):
    script_tag = html.find("script", {"data-testid": "restaurant-seo-schema"})
    json_text = script_tag.string
    json_dict = json.loads(json_text)
    extracted_data = {
        "name": json_dict['name'],
        "streetAddress": json_dict['address']['streetAddress'],
        "addressLocality": json_dict['address']['addressLocality'],
        "postalCode": json_dict['address']['postalCode'],
        "latitude": json_dict['geo']['latitude'],
        "longitude": json_dict['geo']['longitude'],
        "url": json_dict['url'],
        "ratingValue": json_dict['aggregateRating']['ratingValue'],
        "ratingCount": json_dict['aggregateRating']['ratingCount'],
        "bestRating": json_dict['aggregateRating']['bestRating'],
        "worstRating": json_dict['aggregateRating']['worstRating'],
        "servesCuisine": json_dict['servesCuisine'],
        "priceRange": json_dict['priceRange']
    }
    return extracted_data


url_list_str = ['https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast- legaspi-village',
            'https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5',
            'https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5']

all_data = []
for url in url_list_str:
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"
              }
    response = requests.get(url, headers=headers)
    html = BeautifulSoup(response.content, 'html.parser')
    data = foodpanda_data(html)
    all_data.append(data)
    time.sleep(1)

df = pd.DataFrame(all_data)
print(df.head())

output:

                                     name                                      streetAddress addressLocality postalCode   latitude   longitude                                                url  ratingValue  ratingCount  bestRating  worstRating                           servesCuisine priceRange
0         Sicilian Roast - Legaspi Village  100 Don Carlos Palanca corner Dela Rosa Street...     Makati City       1229  14.556083  121.019540  https://www.foodpanda.ph/restaurant/vh2d/sicil...          4.4           29           5            1                 [Italian, Pizza, Pasta]         ₱₱
1  Tokyo Milk Cheese Factory - Greenbelt 5  2nd Floor Greenbelt 5 Legazpi Street Legazpi V...     Makati City       1229  14.553329  121.022054  https://www.foodpanda.ph/restaurant/ns76/tokyo...          5.0           58           5            1    [Desserts, Fast Food, Snacks, Cakes]        ₱₱₱
2                       PAUL - Greenbelt 5  Ground Floor Greenbelt 5 Legazpi Street Barang...     Makati City       1223  14.552704  121.020531  https://www.foodpanda.ph/restaurant/hksd/paul-...          4.7           12           5            1  [Sandwiches, American, Western, Bread]         ₱₱

huangapple
  • 本文由 发表于 2023年4月10日 22:41:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/75978057.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定