英文:
getting data out of a website - using BS4 and request fails permanently - need another method now
问题
I am trying to scrape the data from the site https://www.startupblink.com with BeautifulSoup, Python, and requests.
from bs4 import BeautifulSoup
import requests
url = "https://www.startupblink.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
links = soup.find_all("a")
for link in links:
print(link.get("href"))
Using the pandas library can make the process of scraping and processing data even easier. Pandas provides powerful data manipulation and analysis tools, including convenient functions for reading HTML tables directly from a URL. Here's an idea of how to use pandas to scrape data from the website https://www.startupblink.com:
import pandas as pd
import requests
# Send a GET request to the website:
url = "https://www.startupblink.com"
response = requests.get(url)
# Read the HTML table using pandas:
tables = pd.read_html(response.content)
# Process and use the data:
# Once we have the DataFrame objects representing the tables, we can process and analyze the data using pandas' built-in functions and methods.
What do you think about these different approaches?
英文:
i am trying to scrape the data from the site https://www.startupblink.com with beautiful soup, Pyhon and request
from bs4 import BeautifulSoup
import requests
url = "https://www.startupblink.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
links = soup.find_all("a")
for link in links:
print(link.get("href"))
This will finds all the <a> tags on the page and prints out the values of their href attributes. my data extraction requirements are the following: i want to get all the data OUT OF THE SITE
by the way - with pandas it would be even easier ?!
using the pandas library can make the process of scraping and processing data even easier. Pandas provides powerful data manipulation and analysis tools, including convenient functions for reading HTML tables directly from a URL. Here are some of my ideas how to use pandas to scrape data from the website https://www.startupblink.com:
import pandas as pd
import requests
Send a GET request to the website: Send a GET request to the URL we want to scrape and store the response in a variable:
url = "https://www.startupblink.com"
response = requests.get(url)
Well: first we read the HTML table using pandas: here we use the read_html() function from pandas to parse the HTML and extract the tables present on the page. This function returns a list of DataFrame objects representing the tables found. In this case, since we're interested in the tables on the entire page, you can pass the response.content to read_html():
tables = pd.read_html(response.content)
Process and use the data: Once we have the DataFrame objects representing the tables, we can process and analyze the data using pandas' built-in functions and methods.
what do you tink about these different approaches...?
答案1
得分: 1
以下是您要翻译的代码部分:
Not sure what info you are after but perhaps the api will work for you?
import json
import pandas as pd
import requests
def get_data() -> pd.DataFrame:
url = "https://www.startupblink.com/api/leaderboards?leaderboard_type=Cities&industry=leaderboard&year=2022"
with requests.Session() as request:
response = request.get(url, timeout=10)
if response.status_code != 200:
print(response.raise_for_status())
data = json.loads(response.text)
return pd.DataFrame(data=data)
print(get_data())
Output (few of the rows):
global_rank prev_global_rank national_rank prev_national_rank population country_id country_name city_id display_name quantity_score quality_score business_score quality_factor1 quality_factor change_national change_global change
0 1 1 1 1 9666055.0 1 United States 5 San Francisco Bay, United States 36.186 510.423 3.660 550.2690 550.269 0 0 new
1 2 2 2 2 21045000.0 1 United States 15 New York, United States 18.339 195.003 3.660 217.0020 217.002 0 0 new
2 3 5 1 1 9176530.0 5 United Kingdom 11 London, United Kingdom 21.673 100.171 3.793 125.6370 125.637 0 2 new
3 4 4 3 3 3971883.0 1 United States 21 Los Angeles Area, United States 14.677 95.518 3.660 113.8550 113.855 0 0 new
4 5 6 4 4 4771936.0 1 United States 63 Boston Area, United States 8.663 95.727 3.660 108.0500 108.05 0 1 new
5 6 3 1 1 20383994.0 45 China 171 Beijing, China 7.112 92.931 2.652 102.6950 102.695 0 -3 new
6 7 7 2 2 22315474.0 45 China 293 Shanghai, China 5.097 62.868 2.652 70.6165 70.6165 0 0 new
希望这对您有所帮助!
英文:
Not sure what info you are after but perhaps the api will work for you?
import json
import pandas as pd
import requests
def get_data() -> pd.DataFrame:
url = "https://www.startupblink.com/api/leaderboards?leaderboard_type=Cities&industry=leaderboard&year=2022"
with requests.Session() as request:
response = request.get(url, timeout=10)
if response.status_code != 200:
print(response.raise_for_status())
data = json.loads(response.text)
return pd.DataFrame(data=data)
print(get_data())
Output (few of the rows):
global_rank prev_global_rank national_rank prev_national_rank population country_id country_name city_id display_name quantity_score quality_score business_score quality_factor1 quality_factor change_national change_global change
0 1 1 1 1 9666055.0 1 United States 5 San Francisco Bay, United States 36.186 510.423 3.660 550.2690 550.269 0 0 new
1 2 2 2 2 21045000.0 1 United States 15 New York, United States 18.339 195.003 3.660 217.0020 217.002 0 0 new
2 3 5 1 1 9176530.0 5 United Kingdom 11 London, United Kingdom 21.673 100.171 3.793 125.6370 125.637 0 2 new
3 4 4 3 3 3971883.0 1 United States 21 Los Angeles Area, United States 14.677 95.518 3.660 113.8550 113.855 0 0 new
4 5 6 4 4 4771936.0 1 United States 63 Boston Area, United States 8.663 95.727 3.660 108.0500 108.05 0 1 new
5 6 3 1 1 20383994.0 45 China 171 Beijing, China 7.112 92.931 2.652 102.6950 102.695 0 -3 new
6 7 7 2 2 22315474.0 45 China 293 Shanghai, China 5.097 62.868 2.652 70.6165 70.6165 0 0 new
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论