英文:
my for loop is not iterating through a list of urls, only executing for the first item
问题
I'm very much a beginner and after banging my head against the wall, am asking for any help at all with this. I want to scrape a list of urls but my for loop is only returning the first item on the list.
我是一个初学者,头疼不已,请求任何帮助。我想爬取一个URL列表,但我的for循环只返回列表中的第一项。
I have a list of urls, a function to scrape the json data into a dictionary, convert the dictionary to a dataframe and export to the csv. Everything is working except the for loop so that only the first url on the list gets scraped:
我有一个URL列表,一个用于将JSON数据爬取到字典中的函数,将字典转换为数据框并导出到CSV。除了for循环之外,一切都正常,因此只有列表中的第一个URL被爬取:
url_list_str = ['https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast-legaspi-village',
'https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5',
'https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5']
for url in url_list_str:
url = url_list_str[0]
response = req.get(url, headers = headers)
pause(5)
html = BeautifulSoup(response.content, 'html.parser')
data = foodpanda_data(html)
restaurant_name = data['Name']
df = pd.DataFrame([data])
foodpanda()
is a function above the for loop which scrapes the json and turns it into a dictionary. Here's a preview because it's pretty long:
foodpanda_data()
是for循环上面的一个函数,它会爬取JSON数据并将其转换为字典。这里是一个预览,因为它相当长:
def foodpanda_data(html):
script_tag = html.find("script", {"data-testid": "restaurant-seo-schema"})
json_text = script_tag.string
json_dict = json.loads(json_text)
extracted_data = {}
keys_to_extract = ["Name", "streetAddress", "addressLocality", "postalCode", "latitude", "longitude", "url", "ratingValue", "ratingCount", "bestRating", "worstRating", "servesCuisine", "priceRange"]
for key in keys_to_extract:
if key.lower() == 'name':
extracted_data[key] = json_dict.get('name', '') #... etc.
return extracted_data
I also tried writing the for loop as:
我还尝试将for循环编写为:
for u in range(len(url_list_str)):
url = url_list_str[u]
but that didn't work either. There must be something really obvious here that I'm not getting so thank you!
但那也没有奏效。这里一定有一些非常明显的东西我没有弄清楚,所以谢谢!
英文:
I'm very much a beginner and after banging my head against the wall, am asking for any help at all with this. I want to scrape a list of urls but my for loop is only returning the first item on the list.
I have a list of urls, a function to scrape the json data into a dictionary, convert the dictionary to a dataframe and export to the csv. Everything is working except the for loop so that only the first url on the list gets scraped:
url_list_str = ['https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast-legaspi-village',
'https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5',
'https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5']
for url in url_list_str:
url = url_list_str[0]
response = req.get(url, headers = headers)
pause(5)
html = BeautifulSoup(response.content, 'html.parser')
data = foodpanda_data(html)
restaurant_name = data['Name']
df = pd.DataFrame([data])
foodpanda()
is a function above the for loop which scrapes the json and turns it into a dictionary. Here's a preview because it's pretty long:
def foodpanda_data(html):
script_tag = html.find("script", {"data-testid": "restaurant-seo-schema"})
json_text = script_tag.string
json_dict = json.loads(json_text)
extracted_data = {}
keys_to_extract = ["Name", "streetAddress", "addressLocality", "postalCode", "latitude", "longitude", "url", "ratingValue", "ratingCount", "bestRating", "worstRating", "servesCuisine", "priceRange"]
for key in keys_to_extract:
if key.lower() == 'name':
extracted_data[key] = json_dict.get('name', '') #... etc.
return extracted_data
I also tried writing the for loop as:
for u in range(len(url_list_str)):
url = url_list_str[u]
but that didn't work either. There must be something really obvious here that I'm not getting so thank you!
答案1
得分: 1
因为在每次迭代中,您都从列表中选择第一个URL(url = url_list_str[0])。只需删除它。
url_list_str = ['https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast- legaspi-village',
'https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5',
'https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5']
for url in url_list_str:
response = req.get(url, headers = headers)
pause(5)
html = BeautifulSoup(response.content, 'html.parser')
data = foodpanda_data(html)
restaurant_name = data['Name']
df = pd.DataFrame([data])
英文:
because in every iteration, you're picking the first URL from the list here (url = url_list_str[0]). Simply remove it.
url_list_str = ['https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast- legaspi-village',
'https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5',
'https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5']
for url in url_list_str:
response = req.get(url, headers = headers)
pause(5)
html = BeautifulSoup(response.content, 'html.parser')
data = foodpanda_data(html)
restaurant_name = data['Name']
df = pd.DataFrame([data])
答案2
得分: 0
我猜你想做类似这样的事情:
import json
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
def foodpanda_data(html):
script_tag = html.find("script", {"data-testid": "restaurant-seo-schema"})
json_text = script_tag.string
json_dict = json.loads(json_text)
extracted_data = {
"name": json_dict['name'],
"streetAddress": json_dict['address']['streetAddress'],
"addressLocality": json_dict['address']['addressLocality'],
"postalCode": json_dict['address']['postalCode'],
"latitude": json_dict['geo']['latitude'],
"longitude": json_dict['geo']['longitude'],
"url": json_dict['url'],
"ratingValue": json_dict['aggregateRating']['ratingValue'],
"ratingCount": json_dict['aggregateRating']['ratingCount'],
"bestRating": json_dict['aggregateRating']['bestRating'],
"worstRating": json_dict['aggregateRating']['worstRating'],
"servesCuisine": json_dict['servesCuisine'],
"priceRange": json_dict['priceRange']
}
return extracted_data
url_list_str = ['https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast-legaspi-village',
'https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5',
'https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5']
all_data = []
for url in url_list_str:
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)
html = BeautifulSoup(response.content, 'html.parser')
data = foodpanda_data(html)
all_data.append(data)
time.sleep(1)
df = pd.DataFrame(all_data)
print(df.head())
输出:
name streetAddress addressLocality postalCode latitude longitude url ratingValue ratingCount bestRating worstRating servesCuisine priceRange
0 Sicilian Roast - Legaspi Village 100 Don Carlos Palanca corner Dela Rosa Street... Makati City 1229 14.556083 121.019540 https://www.foodpanda.ph/restaurant/vh2d/sicil... 4.4 29 5 1 [Italian, Pizza, Pasta] ₱₱
1 Tokyo Milk Cheese Factory - Greenbelt 5 2nd Floor Greenbelt 5 Legazpi Street Legazpi V... Makati City 1229 14.553329 121.022054 https://www.foodpanda.ph/restaurant/ns76/tokyo... 5.0 58 5 1 [Desserts, Fast Food, Snacks, Cakes] ₱₱₱
2 PAUL - Greenbelt 5 Ground Floor Greenbelt 5 Legazpi Street Barang... Makati City 1223 14.552704 121.020531 https://www.foodpanda.ph/restaurant/hksd/paul-... 4.7 12 5 1 [Sandwiches, American, Western, Bread] ₱₱
英文:
I guess, you're trying to do something like this
import json
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
def foodpanda_data(html):
script_tag = html.find("script", {"data-testid": "restaurant-seo-schema"})
json_text = script_tag.string
json_dict = json.loads(json_text)
extracted_data = {
"name": json_dict['name'],
"streetAddress": json_dict['address']['streetAddress'],
"addressLocality": json_dict['address']['addressLocality'],
"postalCode": json_dict['address']['postalCode'],
"latitude": json_dict['geo']['latitude'],
"longitude": json_dict['geo']['longitude'],
"url": json_dict['url'],
"ratingValue": json_dict['aggregateRating']['ratingValue'],
"ratingCount": json_dict['aggregateRating']['ratingCount'],
"bestRating": json_dict['aggregateRating']['bestRating'],
"worstRating": json_dict['aggregateRating']['worstRating'],
"servesCuisine": json_dict['servesCuisine'],
"priceRange": json_dict['priceRange']
}
return extracted_data
url_list_str = ['https://www.foodpanda.ph/restaurant/vh2d/sicilian-roast- legaspi-village',
'https://www.foodpanda.ph/restaurant/ns76/tokyo-milk-cheese-factory-greenbelt-5',
'https://www.foodpanda.ph/restaurant/hksd/paul-greenbelt-5']
all_data = []
for url in url_list_str:
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)
html = BeautifulSoup(response.content, 'html.parser')
data = foodpanda_data(html)
all_data.append(data)
time.sleep(1)
df = pd.DataFrame(all_data)
print(df.head())
output:
name streetAddress addressLocality postalCode latitude longitude url ratingValue ratingCount bestRating worstRating servesCuisine priceRange
0 Sicilian Roast - Legaspi Village 100 Don Carlos Palanca corner Dela Rosa Street... Makati City 1229 14.556083 121.019540 https://www.foodpanda.ph/restaurant/vh2d/sicil... 4.4 29 5 1 [Italian, Pizza, Pasta] ₱₱
1 Tokyo Milk Cheese Factory - Greenbelt 5 2nd Floor Greenbelt 5 Legazpi Street Legazpi V... Makati City 1229 14.553329 121.022054 https://www.foodpanda.ph/restaurant/ns76/tokyo... 5.0 58 5 1 [Desserts, Fast Food, Snacks, Cakes] ₱₱₱
2 PAUL - Greenbelt 5 Ground Floor Greenbelt 5 Legazpi Street Barang... Makati City 1223 14.552704 121.020531 https://www.foodpanda.ph/restaurant/hksd/paul-... 4.7 12 5 1 [Sandwiches, American, Western, Bread] ₱₱
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论