使用BS4和请求无法永久获取网站数据 – 现在需要另一种方法

huangapple go评论62阅读模式
英文:

getting data out of a website - using BS4 and request fails permanently - need another method now

问题

I am trying to scrape the data from the site https://www.startupblink.com with BeautifulSoup, Python, and requests.

from bs4 import BeautifulSoup
import requests

url = "https://www.startupblink.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

links = soup.find_all("a")

for link in links:
    print(link.get("href"))

Using the pandas library can make the process of scraping and processing data even easier. Pandas provides powerful data manipulation and analysis tools, including convenient functions for reading HTML tables directly from a URL. Here's an idea of how to use pandas to scrape data from the website https://www.startupblink.com:

import pandas as pd
import requests

# Send a GET request to the website:
url = "https://www.startupblink.com"
response = requests.get(url)

# Read the HTML table using pandas:
tables = pd.read_html(response.content)

# Process and use the data:
# Once we have the DataFrame objects representing the tables, we can process and analyze the data using pandas' built-in functions and methods.

What do you think about these different approaches?

英文:

i am trying to scrape the data from the site https://www.startupblink.com with beautiful soup, Pyhon and request

from bs4 import BeautifulSoup
import requests

url = "https://www.startupblink.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

links = soup.find_all("a")

for link in links:
    print(link.get("href"))

This will finds all the <a> tags on the page and prints out the values of their href attributes. my data extraction requirements are the following: i want to get all the data OUT OF THE SITE

by the way - with pandas it would be even easier ?!

using the pandas library can make the process of scraping and processing data even easier. Pandas provides powerful data manipulation and analysis tools, including convenient functions for reading HTML tables directly from a URL. Here are some of my ideas how to use pandas to scrape data from the website https://www.startupblink.com:

 import pandas as pd
    import requests
    Send a GET request to the website: Send a GET request to the URL we want to scrape and store the response in a variable:
    
    url = &quot;https://www.startupblink.com&quot;
    response = requests.get(url)

Well: first we read the HTML table using pandas: here we use the read_html() function from pandas to parse the HTML and extract the tables present on the page. This function returns a list of DataFrame objects representing the tables found. In this case, since we're interested in the tables on the entire page, you can pass the response.content to read_html():

tables = pd.read_html(response.content)
Process and use the data: Once we have the DataFrame objects representing the tables, we can process and analyze the data using pandas' built-in functions and methods.

what do you tink about these different approaches...?

答案1

得分: 1

以下是您要翻译的代码部分:

Not sure what info you are after but perhaps the api will work for you?

import json

import pandas as pd
import requests


def get_data() -> pd.DataFrame:
    url = "https://www.startupblink.com/api/leaderboards?leaderboard_type=Cities&industry=leaderboard&year=2022"

    with requests.Session() as request:
        response = request.get(url, timeout=10)
    if response.status_code != 200:
        print(response.raise_for_status())

    data = json.loads(response.text)

    return pd.DataFrame(data=data)


print(get_data())

Output (few of the rows):

     global_rank prev_global_rank  national_rank prev_national_rank  population  country_id            country_name  city_id                                display_name quantity_score quality_score business_score quality_factor1  quality_factor change_national change_global change
0              1                1              1                  1   9666055.0           1           United States        5            San Francisco Bay, United States         36.186       510.423          3.660        550.2690         550.269               0             0    new
1              2                2              2                  2  21045000.0           1           United States       15                     New York, United States         18.339       195.003          3.660        217.0020         217.002               0             0    new
2              3                5              1                  1   9176530.0           5          United Kingdom       11                      London, United Kingdom         21.673       100.171          3.793        125.6370         125.637               0             2    new
3              4                4              3                  3   3971883.0           1           United States       21             Los Angeles Area, United States         14.677        95.518          3.660        113.8550         113.855               0             0    new
4              5                6              4                  4   4771936.0           1           United States       63                  Boston Area, United States          8.663        95.727          3.660        108.0500          108.05               0             1    new
5              6                3              1                  1  20383994.0          45                   China      171                              Beijing, China          7.112        92.931          2.652        102.6950         102.695               0            -3    new
6              7                7              2                  2  22315474.0          45                   China      293                             Shanghai, China          5.097        62.868          2.652         70.6165         70.6165               0             0    new

希望这对您有所帮助!

英文:

Not sure what info you are after but perhaps the api will work for you?

import json
import pandas as pd
import requests
def get_data() -&gt; pd.DataFrame:
url = &quot;https://www.startupblink.com/api/leaderboards?leaderboard_type=Cities&amp;industry=leaderboard&amp;year=2022&quot;
with requests.Session() as request:
response = request.get(url, timeout=10)
if response.status_code != 200:
print(response.raise_for_status())
data = json.loads(response.text)
return pd.DataFrame(data=data)
print(get_data())

Output (few of the rows):

     global_rank prev_global_rank  national_rank prev_national_rank  population  country_id            country_name  city_id                                display_name quantity_score quality_score business_score quality_factor1  quality_factor change_national change_global change
0              1                1              1                  1   9666055.0           1           United States        5            San Francisco Bay, United States         36.186       510.423          3.660        550.2690         550.269               0             0    new
1              2                2              2                  2  21045000.0           1           United States       15                     New York, United States         18.339       195.003          3.660        217.0020         217.002               0             0    new
2              3                5              1                  1   9176530.0           5          United Kingdom       11                      London, United Kingdom         21.673       100.171          3.793        125.6370         125.637               0             2    new
3              4                4              3                  3   3971883.0           1           United States       21             Los Angeles Area, United States         14.677        95.518          3.660        113.8550         113.855               0             0    new
4              5                6              4                  4   4771936.0           1           United States       63                  Boston Area, United States          8.663        95.727          3.660        108.0500          108.05               0             1    new
5              6                3              1                  1  20383994.0          45                   China      171                              Beijing, China          7.112        92.931          2.652        102.6950         102.695               0            -3    new
6              7                7              2                  2  22315474.0          45                   China      293                             Shanghai, China          5.097        62.868          2.652         70.6165         70.6165               0             0    new

huangapple
  • 本文由 发表于 2023年5月25日 02:47:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/76326571.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定