2023年4月10日 22:41:38go评论72阅读模式

英文:

my for loop is not iterating through a list of urls, only executing for the first item

问题

I'm very much a beginner and after banging my head against the wall, am asking for any help at all with this. I want to scrape a list of urls but my for loop is only returning the first item on the list.

我是一个初学者，头疼不已，请求任何帮助。我想爬取一个URL列表，但我的for循环只返回列表中的第一项。

I have a list of urls, a function to scrape the json data into a dictionary, convert the dictionary to a dataframe and export to the csv. Everything is working except the for loop so that only the first url on the list gets scraped:

我有一个URL列表，一个用于将JSON数据爬取到字典中的函数，将字典转换为数据框并导出到CSV。除了for循环之外，一切都正常，因此只有列表中的第一个URL被爬取：

url_list_str = ['https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast-legaspi-village',
 'https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5',
 'https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5']

for url in url_list_str:
  url = url_list_str[0]
  response = req.get(url, headers = headers)
  pause(5)
  html = BeautifulSoup(response.content, 'html.parser')
  data = foodpanda_data(html)
  restaurant_name = data['Name']
  df = pd.DataFrame([data])

foodpanda() is a function above the for loop which scrapes the json and turns it into a dictionary. Here's a preview because it's pretty long:

foodpanda_data()是for循环上面的一个函数，它会爬取JSON数据并将其转换为字典。这里是一个预览，因为它相当长：

def foodpanda_data(html):
  script_tag = html.find("script", {"data-testid": "restaurant-seo-schema"})
  json_text = script_tag.string
  json_dict = json.loads(json_text)
  
  extracted_data = {}
  keys_to_extract = ["Name", "streetAddress", "addressLocality", "postalCode", "latitude", "longitude", "url", "ratingValue", "ratingCount", "bestRating", "worstRating", "servesCuisine", "priceRange"]
  for key in keys_to_extract:
    if key.lower() == 'name':
      extracted_data[key] = json_dict.get('name', '') #... etc.

     return extracted_data

I also tried writing the for loop as:

我还尝试将for循环编写为：

for u in range(len(url_list_str)):
  url = url_list_str[u]

but that didn't work either. There must be something really obvious here that I'm not getting so thank you!

但那也没有奏效。这里一定有一些非常明显的东西我没有弄清楚，所以谢谢！

英文:

url_list_str = [&#39;https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast-legaspi-village&#39;,
 &#39;https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5&#39;,
 &#39;https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5&#39;]

for url in url_list_str:
  url = url_list_str[0]
  response = req.get(url, headers = headers)
  pause(5)
  html = BeautifulSoup(response.content, &#39;html.parser&#39;)
  data = foodpanda_data(html)
  restaurant_name = data[&#39;Name&#39;]
  df = pd.DataFrame([data])

foodpanda() is a function above the for loop which scrapes the json and turns it into a dictionary. Here's a preview because it's pretty long:

def foodpanda_data(html):
  script_tag = html.find(&quot;script&quot;, {&quot;data-testid&quot;: &quot;restaurant-seo-schema&quot;})
  json_text = script_tag.string
  json_dict = json.loads(json_text)
  
  extracted_data = {}
  keys_to_extract = [&quot;Name&quot;, &quot;streetAddress&quot;, &quot;addressLocality&quot;, &quot;postalCode&quot;, &quot;latitude&quot;, &quot;longitude&quot;, &quot;url&quot;, &quot;ratingValue&quot;, &quot;ratingCount&quot;, &quot;bestRating&quot;, &quot;worstRating&quot;, &quot;servesCuisine&quot;, &quot;priceRange&quot;]
  for key in keys_to_extract:
    if key.lower() == &#39;name&#39;:
      extracted_data[key] = json_dict.get(&#39;name&#39;, &#39;&#39;) #... etc.

     return extracted_data

I also tried writing the for loop as:

for u in range(len(url_list_str)):
  url = url_list_str[u]

but that didn't work either. There must be something really obvious here that I'm not getting so thank you!

答案1

得分: 1

因为在每次迭代中，您都从列表中选择第一个URL（url = url_list_str[0]）。只需删除它。

url_list_str = ['https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast- legaspi-village',
             'https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5',
             'https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5']

for url in url_list_str:
    response = req.get(url, headers = headers)
    pause(5)
    html = BeautifulSoup(response.content, 'html.parser')
    data = foodpanda_data(html)
    restaurant_name = data['Name']
    df = pd.DataFrame([data])

英文:

because in every iteration, you're picking the first URL from the list here (url = url_list_str[0]). Simply remove it.

url_list_str = [&#39;https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast- legaspi-village&#39;,
         &#39;https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5&#39;,
         &#39;https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5&#39;]

for url in url_list_str:
    response = req.get(url, headers = headers)
    pause(5)
    html = BeautifulSoup(response.content, &#39;html.parser&#39;)
    data = foodpanda_data(html)
    restaurant_name = data[&#39;Name&#39;]
    df = pd.DataFrame([data])

答案2

得分: 0

我猜你想做类似这样的事情：

import json
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

def foodpanda_data(html):
    script_tag = html.find("script", {"data-testid": "restaurant-seo-schema"})
    json_text = script_tag.string
    json_dict = json.loads(json_text)
    extracted_data = {
        "name": json_dict['name'],
        "streetAddress": json_dict['address']['streetAddress'],
        "addressLocality": json_dict['address']['addressLocality'],
        "postalCode": json_dict['address']['postalCode'],
        "latitude": json_dict['geo']['latitude'],
        "longitude": json_dict['geo']['longitude'],
        "url": json_dict['url'],
        "ratingValue": json_dict['aggregateRating']['ratingValue'],
        "ratingCount": json_dict['aggregateRating']['ratingCount'],
        "bestRating": json_dict['aggregateRating']['bestRating'],
        "worstRating": json_dict['aggregateRating']['worstRating'],
        "servesCuisine": json_dict['servesCuisine'],
        "priceRange": json_dict['priceRange']
    }
    return extracted_data

url_list_str = ['https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast-legaspi-village',
                'https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5',
                'https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5']

all_data = []
for url in url_list_str:
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"
    }
    response = requests.get(url, headers=headers)
    html = BeautifulSoup(response.content, 'html.parser')
    data = foodpanda_data(html)
    all_data.append(data)
    time.sleep(1)

df = pd.DataFrame(all_data)
print(df.head())

输出：

                                         name                                      streetAddress addressLocality postalCode   latitude   longitude                                                url  ratingValue  ratingCount  bestRating  worstRating                           servesCuisine priceRange
    0         Sicilian Roast - Legaspi Village  100 Don Carlos Palanca corner Dela Rosa Street...     Makati City       1229  14.556083  121.019540  https://www.foodpanda.ph/restaurant/vh2d/sicil...          4.4           29           5            1                 [Italian, Pizza, Pasta]         ₱₱
    1  Tokyo Milk Cheese Factory - Greenbelt 5  2nd Floor Greenbelt 5 Legazpi Street Legazpi V...     Makati City       1229  14.553329  121.022054  https://www.foodpanda.ph/restaurant/ns76/tokyo...          5.0           58           5            1    [Desserts, Fast Food, Snacks, Cakes]        ₱₱₱
    2                       PAUL - Greenbelt 5  Ground Floor Greenbelt 5 Legazpi Street Barang...     Makati City       1223  14.552704  121.020531  https://www.foodpanda.ph/restaurant/hksd/paul-...          4.7           12           5            1  [Sandwiches, American, Western, Bread]         ₱₱

英文:

I guess, you're trying to do something like this

import json
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup
pd.set_option(&#39;display.max_rows&#39;, 500)
pd.set_option(&#39;display.max_columns&#39;, 500)
pd.set_option(&#39;display.width&#39;, 1000)

def foodpanda_data(html):
    script_tag = html.find(&quot;script&quot;, {&quot;data-testid&quot;: &quot;restaurant-seo-schema&quot;})
    json_text = script_tag.string
    json_dict = json.loads(json_text)
    extracted_data = {
        &quot;name&quot;: json_dict[&#39;name&#39;],
        &quot;streetAddress&quot;: json_dict[&#39;address&#39;][&#39;streetAddress&#39;],
        &quot;addressLocality&quot;: json_dict[&#39;address&#39;][&#39;addressLocality&#39;],
        &quot;postalCode&quot;: json_dict[&#39;address&#39;][&#39;postalCode&#39;],
        &quot;latitude&quot;: json_dict[&#39;geo&#39;][&#39;latitude&#39;],
        &quot;longitude&quot;: json_dict[&#39;geo&#39;][&#39;longitude&#39;],
        &quot;url&quot;: json_dict[&#39;url&#39;],
        &quot;ratingValue&quot;: json_dict[&#39;aggregateRating&#39;][&#39;ratingValue&#39;],
        &quot;ratingCount&quot;: json_dict[&#39;aggregateRating&#39;][&#39;ratingCount&#39;],
        &quot;bestRating&quot;: json_dict[&#39;aggregateRating&#39;][&#39;bestRating&#39;],
        &quot;worstRating&quot;: json_dict[&#39;aggregateRating&#39;][&#39;worstRating&#39;],
        &quot;servesCuisine&quot;: json_dict[&#39;servesCuisine&#39;],
        &quot;priceRange&quot;: json_dict[&#39;priceRange&#39;]
    }
    return extracted_data


url_list_str = [&#39;https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast- legaspi-village&#39;,
            &#39;https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5&#39;,
            &#39;https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5&#39;]

all_data = []
for url in url_list_str:
    headers = {
        &quot;user-agent&quot;: &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36&quot;
              }
    response = requests.get(url, headers=headers)
    html = BeautifulSoup(response.content, &#39;html.parser&#39;)
    data = foodpanda_data(html)
    all_data.append(data)
    time.sleep(1)

df = pd.DataFrame(all_data)
print(df.head())

output:

                                     name                                      streetAddress addressLocality postalCode   latitude   longitude                                                url  ratingValue  ratingCount  bestRating  worstRating                           servesCuisine priceRange
0         Sicilian Roast - Legaspi Village  100 Don Carlos Palanca corner Dela Rosa Street...     Makati City       1229  14.556083  121.019540  https://www.foodpanda.ph/restaurant/vh2d/sicil...          4.4           29           5            1                 [Italian, Pizza, Pasta]         ₱₱
1  Tokyo Milk Cheese Factory - Greenbelt 5  2nd Floor Greenbelt 5 Legazpi Street Legazpi V...     Makati City       1229  14.553329  121.022054  https://www.foodpanda.ph/restaurant/ns76/tokyo...          5.0           58           5            1    [Desserts, Fast Food, Snacks, Cakes]        ₱₱₱
2                       PAUL - Greenbelt 5  Ground Floor Greenbelt 5 Legazpi Street Barang...     Makati City       1223  14.552704  121.020531  https://www.foodpanda.ph/restaurant/hksd/paul-...          4.7           12           5            1  [Sandwiches, American, Western, Bread]         ₱₱

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

我的 for 循环没有遍历 URL 列表，只执行第一个项目。

问题

答案1

答案2

Python Boto3创建存储桶的文档令人困惑？

Python in VSCode works on file in "old" folder after moving it to sub-folder

重塑 torch 张量

在后台处理异步迭代器

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论