2023年5月25日 02:47:45go评论103阅读模式

英文:

getting data out of a website - using BS4 and request fails permanently - need another method now

问题

I am trying to scrape the data from the site https://www.startupblink.com with BeautifulSoup, Python, and requests.

from bs4 import BeautifulSoup
import requests
url = "https://www.startupblink.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
links = soup.find_all("a")
for link in links:
    print(link.get("href"))

Using the pandas library can make the process of scraping and processing data even easier. Pandas provides powerful data manipulation and analysis tools, including convenient functions for reading HTML tables directly from a URL. Here's an idea of how to use pandas to scrape data from the website https://www.startupblink.com:

import pandas as pd
import requests
# Send a GET request to the website:
url = "https://www.startupblink.com"
response = requests.get(url)
# Read the HTML table using pandas:
tables = pd.read_html(response.content)
# Process and use the data:
# Once we have the DataFrame objects representing the tables, we can process and analyze the data using pandas' built-in functions and methods.

What do you think about these different approaches?

英文:

i am trying to scrape the data from the site https://www.startupblink.com with beautiful soup, Pyhon and request

from bs4 import BeautifulSoup
import requests
url = &quot;https://www.startupblink.com&quot;
response = requests.get(url)
soup = BeautifulSoup(response.content, &quot;html.parser&quot;)
links = soup.find_all(&quot;a&quot;)
for link in links:
    print(link.get(&quot;href&quot;))

This will finds all the <a> tags on the page and prints out the values of their href attributes. my data extraction requirements are the following: i want to get all the data OUT OF THE SITE

by the way - with pandas it would be even easier ?!

using the pandas library can make the process of scraping and processing data even easier. Pandas provides powerful data manipulation and analysis tools, including convenient functions for reading HTML tables directly from a URL. Here are some of my ideas how to use pandas to scrape data from the website https://www.startupblink.com:

 import pandas as pd
    import requests
    Send a GET request to the website: Send a GET request to the URL we want to scrape and store the response in a variable:
    
    url = &quot;https://www.startupblink.com&quot;
    response = requests.get(url)

Well: first we read the HTML table using pandas: here we use the read_html() function from pandas to parse the HTML and extract the tables present on the page. This function returns a list of DataFrame objects representing the tables found. In this case, since we're interested in the tables on the entire page, you can pass the response.content to read_html():

tables = pd.read_html(response.content)
Process and use the data: Once we have the DataFrame objects representing the tables, we can process and analyze the data using pandas' built-in functions and methods.

what do you tink about these different approaches...?

答案1

得分: 1

以下是您要翻译的代码部分：

Not sure what info you are after but perhaps the api will work for you?
import json
import pandas as pd
import requests
def get_data() -> pd.DataFrame:
    url = "https://www.startupblink.com/api/leaderboards?leaderboard_type=Cities&industry=leaderboard&year=2022"
    with requests.Session() as request:
        response = request.get(url, timeout=10)
    if response.status_code != 200:
        print(response.raise_for_status())
    data = json.loads(response.text)
    return pd.DataFrame(data=data)
print(get_data())
Output (few of the rows):
     global_rank prev_global_rank  national_rank prev_national_rank  population  country_id            country_name  city_id                                display_name quantity_score quality_score business_score quality_factor1  quality_factor change_national change_global change
0              1                1              1                  1   9666055.0           1           United States        5            San Francisco Bay, United States         36.186       510.423          3.660        550.2690         550.269               0             0    new
1              2                2              2                  2  21045000.0           1           United States       15                     New York, United States         18.339       195.003          3.660        217.0020         217.002               0             0    new
2              3                5              1                  1   9176530.0           5          United Kingdom       11                      London, United Kingdom         21.673       100.171          3.793        125.6370         125.637               0             2    new
3              4                4              3                  3   3971883.0           1           United States       21             Los Angeles Area, United States         14.677        95.518          3.660        113.8550         113.855               0             0    new
4              5                6              4                  4   4771936.0           1           United States       63                  Boston Area, United States          8.663        95.727          3.660        108.0500          108.05               0             1    new
5              6                3              1                  1  20383994.0          45                   China      171                              Beijing, China          7.112        92.931          2.652        102.6950         102.695               0            -3    new
6              7                7              2                  2  22315474.0          45                   China      293                             Shanghai, China          5.097        62.868          2.652         70.6165         70.6165               0             0    new

希望这对您有所帮助！

英文:

Not sure what info you are after but perhaps the api will work for you?

import json
import pandas as pd
import requests
def get_data() -&gt; pd.DataFrame:
url = &quot;https://www.startupblink.com/api/leaderboards?leaderboard_type=Cities&amp;industry=leaderboard&amp;year=2022&quot;
with requests.Session() as request:
response = request.get(url, timeout=10)
if response.status_code != 200:
print(response.raise_for_status())
data = json.loads(response.text)
return pd.DataFrame(data=data)
print(get_data())

Output (few of the rows):

     global_rank prev_global_rank  national_rank prev_national_rank  population  country_id            country_name  city_id                                display_name quantity_score quality_score business_score quality_factor1  quality_factor change_national change_global change
0              1                1              1                  1   9666055.0           1           United States        5            San Francisco Bay, United States         36.186       510.423          3.660        550.2690         550.269               0             0    new
1              2                2              2                  2  21045000.0           1           United States       15                     New York, United States         18.339       195.003          3.660        217.0020         217.002               0             0    new
2              3                5              1                  1   9176530.0           5          United Kingdom       11                      London, United Kingdom         21.673       100.171          3.793        125.6370         125.637               0             2    new
3              4                4              3                  3   3971883.0           1           United States       21             Los Angeles Area, United States         14.677        95.518          3.660        113.8550         113.855               0             0    new
4              5                6              4                  4   4771936.0           1           United States       63                  Boston Area, United States          8.663        95.727          3.660        108.0500          108.05               0             1    new
5              6                3              1                  1  20383994.0          45                   China      171                              Beijing, China          7.112        92.931          2.652        102.6950         102.695               0            -3    new
6              7                7              2                  2  22315474.0          45                   China      293                             Shanghai, China          5.097        62.868          2.652         70.6165         70.6165               0             0    new

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用BS4和请求无法永久获取网站数据 – 现在需要另一种方法

问题

答案1

使用boto3和Python如何获取Lambda函数的上次修改时间？

用Python根据条件替换字符串。

Python解包列表以在格式化字符串中使用

Most efficient way of writing code for this specific case.

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。