使用BS4和请求无法永久获取网站数据 – 现在需要另一种方法

huangapple go评论103阅读模式
英文:

getting data out of a website - using BS4 and request fails permanently - need another method now

问题

I am trying to scrape the data from the site https://www.startupblink.com with BeautifulSoup, Python, and requests.

  1. from bs4 import BeautifulSoup
  2. import requests
  3. url = "https://www.startupblink.com"
  4. response = requests.get(url)
  5. soup = BeautifulSoup(response.content, "html.parser")
  6. links = soup.find_all("a")
  7. for link in links:
  8. print(link.get("href"))

Using the pandas library can make the process of scraping and processing data even easier. Pandas provides powerful data manipulation and analysis tools, including convenient functions for reading HTML tables directly from a URL. Here's an idea of how to use pandas to scrape data from the website https://www.startupblink.com:

  1. import pandas as pd
  2. import requests
  3. # Send a GET request to the website:
  4. url = "https://www.startupblink.com"
  5. response = requests.get(url)
  6. # Read the HTML table using pandas:
  7. tables = pd.read_html(response.content)
  8. # Process and use the data:
  9. # Once we have the DataFrame objects representing the tables, we can process and analyze the data using pandas' built-in functions and methods.

What do you think about these different approaches?

英文:

i am trying to scrape the data from the site https://www.startupblink.com with beautiful soup, Pyhon and request

  1. from bs4 import BeautifulSoup
  2. import requests
  3. url = "https://www.startupblink.com"
  4. response = requests.get(url)
  5. soup = BeautifulSoup(response.content, "html.parser")
  6. links = soup.find_all("a")
  7. for link in links:
  8. print(link.get("href"))

This will finds all the <a> tags on the page and prints out the values of their href attributes. my data extraction requirements are the following: i want to get all the data OUT OF THE SITE

by the way - with pandas it would be even easier ?!

using the pandas library can make the process of scraping and processing data even easier. Pandas provides powerful data manipulation and analysis tools, including convenient functions for reading HTML tables directly from a URL. Here are some of my ideas how to use pandas to scrape data from the website https://www.startupblink.com:

  1. import pandas as pd
  2. import requests
  3. Send a GET request to the website: Send a GET request to the URL we want to scrape and store the response in a variable:
  4. url = &quot;https://www.startupblink.com&quot;
  5. response = requests.get(url)

Well: first we read the HTML table using pandas: here we use the read_html() function from pandas to parse the HTML and extract the tables present on the page. This function returns a list of DataFrame objects representing the tables found. In this case, since we're interested in the tables on the entire page, you can pass the response.content to read_html():

tables = pd.read_html(response.content)
Process and use the data: Once we have the DataFrame objects representing the tables, we can process and analyze the data using pandas' built-in functions and methods.

what do you tink about these different approaches...?

答案1

得分: 1

以下是您要翻译的代码部分:

  1. Not sure what info you are after but perhaps the api will work for you?
  2. import json
  3. import pandas as pd
  4. import requests
  5. def get_data() -> pd.DataFrame:
  6. url = "https://www.startupblink.com/api/leaderboards?leaderboard_type=Cities&industry=leaderboard&year=2022"
  7. with requests.Session() as request:
  8. response = request.get(url, timeout=10)
  9. if response.status_code != 200:
  10. print(response.raise_for_status())
  11. data = json.loads(response.text)
  12. return pd.DataFrame(data=data)
  13. print(get_data())
  14. Output (few of the rows):
  15. global_rank prev_global_rank national_rank prev_national_rank population country_id country_name city_id display_name quantity_score quality_score business_score quality_factor1 quality_factor change_national change_global change
  16. 0 1 1 1 1 9666055.0 1 United States 5 San Francisco Bay, United States 36.186 510.423 3.660 550.2690 550.269 0 0 new
  17. 1 2 2 2 2 21045000.0 1 United States 15 New York, United States 18.339 195.003 3.660 217.0020 217.002 0 0 new
  18. 2 3 5 1 1 9176530.0 5 United Kingdom 11 London, United Kingdom 21.673 100.171 3.793 125.6370 125.637 0 2 new
  19. 3 4 4 3 3 3971883.0 1 United States 21 Los Angeles Area, United States 14.677 95.518 3.660 113.8550 113.855 0 0 new
  20. 4 5 6 4 4 4771936.0 1 United States 63 Boston Area, United States 8.663 95.727 3.660 108.0500 108.05 0 1 new
  21. 5 6 3 1 1 20383994.0 45 China 171 Beijing, China 7.112 92.931 2.652 102.6950 102.695 0 -3 new
  22. 6 7 7 2 2 22315474.0 45 China 293 Shanghai, China 5.097 62.868 2.652 70.6165 70.6165 0 0 new

希望这对您有所帮助!

英文:

Not sure what info you are after but perhaps the api will work for you?

  1. import json
  2. import pandas as pd
  3. import requests
  4. def get_data() -&gt; pd.DataFrame:
  5. url = &quot;https://www.startupblink.com/api/leaderboards?leaderboard_type=Cities&amp;industry=leaderboard&amp;year=2022&quot;
  6. with requests.Session() as request:
  7. response = request.get(url, timeout=10)
  8. if response.status_code != 200:
  9. print(response.raise_for_status())
  10. data = json.loads(response.text)
  11. return pd.DataFrame(data=data)
  12. print(get_data())

Output (few of the rows):

  1. global_rank prev_global_rank national_rank prev_national_rank population country_id country_name city_id display_name quantity_score quality_score business_score quality_factor1 quality_factor change_national change_global change
  2. 0 1 1 1 1 9666055.0 1 United States 5 San Francisco Bay, United States 36.186 510.423 3.660 550.2690 550.269 0 0 new
  3. 1 2 2 2 2 21045000.0 1 United States 15 New York, United States 18.339 195.003 3.660 217.0020 217.002 0 0 new
  4. 2 3 5 1 1 9176530.0 5 United Kingdom 11 London, United Kingdom 21.673 100.171 3.793 125.6370 125.637 0 2 new
  5. 3 4 4 3 3 3971883.0 1 United States 21 Los Angeles Area, United States 14.677 95.518 3.660 113.8550 113.855 0 0 new
  6. 4 5 6 4 4 4771936.0 1 United States 63 Boston Area, United States 8.663 95.727 3.660 108.0500 108.05 0 1 new
  7. 5 6 3 1 1 20383994.0 45 China 171 Beijing, China 7.112 92.931 2.652 102.6950 102.695 0 -3 new
  8. 6 7 7 2 2 22315474.0 45 China 293 Shanghai, China 5.097 62.868 2.652 70.6165 70.6165 0 0 new

huangapple
  • 本文由 发表于 2023年5月25日 02:47:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/76326571.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定