2023年2月24日 02:47:21go评论117阅读模式

英文:

I am trying to webscrape basketball reference, and I am using proxy networks to avoid error 429, but I it keeps giving an error

问题

以下是代码的翻译部分：

这是我试图做的一个示例，一个有效但不旋转代理的代码：
link = 'https://www.basketball-reference.com/teams/NOP/2023.html#advanced'
dfs = pd.read_html(link)
stats = dfs[1].dropna()
d = stats.to_dict('records')

这是我尝试旋转代理，但它不起作用的部分：

link = 'https://www.basketball-reference.com/teams/NOP/2023.html#advanced'
page = requests.get(link, proxies = {'http://': temp, 'https://': temp}, timeout = 5).text
dfs = pd.read_csv(StringIO(page),error_bad_lines=False)

有什么建议或快速修复吗？

英文:

Here's an example of what I am trying to do, a code that works but does not rotate proxies:

link = &#39;https://www.basketball-reference.com/teams/NOP/2023.html#advanced&#39;
dfs = pd.read_html(link)
stats = dfs[1].dropna()
d = stats.to_dict(&#39;records&#39;)

This is me trying to rotate proxies, but it doesn't work:

link = &#39;https://www.basketball-reference.com/teams/NOP/2023.html#advanced&#39;
page = requests.get(link, proxies = {&#39;http://&#39;: temp, &#39;https://&#39;:temp}, timeout = 5).text
dfs = pd.read_csv(StringIO(page),error_bad_lines=False)

Any advice or quick fixes here?

答案1

得分: 3

以下是你要的翻译内容：

"It's not immediately clear from your code what problem you're having.

I quickly threw together this working example. Maybe it'll help?

Notes:

I'm splitting each proxy string by ":" to extract host, port, username, and password from each and creating an HTTPProxyAuth object from them, but you can skip passing auth if your proxies don't need it.
I couldn't get results with dfs = pd.read_csv(page) so I swapped it out for read_html instead.

import requests
from requests.auth import HTTPProxyAuth
import pandas as pd
proxies = [  'proxy1:port:username:password',
             'proxy2:port:username:password',
             'proxy3:port:username:password',
             # add more proxies as needed
           ]
link = 'https://www.basketball-reference.com/teams/NOP/2023.html#advanced'
success = False
for proxy_str in proxies:
    try:
        host, port, username, password = proxy_str.split(':')
        proxy = {
            'http': f'http://{username}:{password}@{host}:{port}',
            'https': f'https://{username}:{password}@{host}:{port}'
        }
        # Use this instead if proxies don't need auth
        # host, port = proxy_str.split(':')
        # proxy = {
        #    'http': f'http://{host}:{port}',
        #    'https': f'https://{host}:{port}'
        # }
        auth = HTTPProxyAuth(username, password) # remove if proxies don't need auth
        page = requests.get(link, proxies=proxy, timeout=5, auth=auth).text
        dfs = pd.read_html(page)
        stats = dfs[1].dropna()
        d = stats.to_dict('records')
        print(d) # show results
        success is True
        break
    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")
if not success:
    print("All proxies failed.")

This gets me 15 players' stats (printing the variable 'd') like this, which I assume is what you want.

[
{
'Rk': 1,
'Player': 'CJ McCollum',
'Age': 31,
'G': 62,
'GS': 62,
'MP': 35.0,
'FG': 7.9,
'FGA': 18.2,
'FG%': 0.434,
'3P': 2.8,
'3PA': 7.4,
'3P%': 0.379,
'2P': 5.1,
'2PA': 10.8,
'2P%': 0.472,
'eFG%': 0.511,
'FT': 2.5,
'FTA': 3.2,
'FT%': 0.789,
'ORB': 0.8,
'DRB': 3.5,
'TRB': 4.4,
'AST': 6.0,
'STL': 0.9,
'BLK': 0.5,
'TOV': 2.6,
'PF': 2.0,
'PTS': 21.1
},
#...14 more
]

For the proxies themselves I'm assuming you have your own, just know that using premium residential proxies like these will give you consistent results over free ones from geonode or similar."

英文:

It's not immediately clear from your code what problem you're having.

I quickly threw together this working example. Maybe it'll help?

Notes:

I'm splitting each proxy string by ":" to extract host, port,
username, and password from each and creating a HTTPProxyAuth object
from them, but you can skip passing auth if your proxies dont need
it.
I couldn't get results with dfs = pd.read_csv(page) so I swapped it
out for read_html instead.

        import requests
        from requests.auth import HTTPProxyAuth
        import pandas as pd
        
        proxies = [  &#39;proxy1:port:username:password&#39;,
                     &#39;proxy2:port:username:password&#39;,
                     &#39;proxy3:port:username:password&#39;,
                     # add more proxies as needed
                   ]
        
        link = &#39;https://www.basketball-reference.com/teams/NOP/2023.html#advanced&#39;
        success = False
        for proxy_str in proxies:
            try:
                host, port, username, password = proxy_str.split(&#39;:&#39;)
                proxy = {
                    &#39;http&#39;: f&#39;http://{username}:{password}@{host}:{port}&#39;,
                    &#39;https&#39;: f&#39;https://{username}:{password}@{host}:{port}&#39;
                }
        		# Use this instead if proxies dont need auth
        		# host, port = proxy_str.split(&#39;:&#39;)
                # proxy = {
                #    &#39;http&#39;: f&#39;http://{host}:{port}&#39;,
                #    &#39;https&#39;: f&#39;https://{host}:{port}&#39;
                # }
        		
                auth = HTTPProxyAuth(username, password) # remove if proxies dont need auth
        		
                page = requests.get(link, proxies=proxy, timeout=5, auth=auth).text
                dfs = pd.read_html(page)
                stats = dfs[1].dropna()
                d = stats.to_dict(&#39;records&#39;)
                print(d) # show results
                success = True
                break
            except requests.exceptions.RequestException as e:
                print(f&quot;Error: {e}&quot;)
  if not success:
        print(&quot;All proxies failed.&quot;)

This gets me 15 players' stats (printing the variable 'd') like this, which I assume is what you want.

[
{
    &#39;Rk&#39;: 1,
    &#39;Player&#39;: &#39;CJ McCollum&#39;,
    &#39;Age&#39;: 31,
    &#39;G&#39;: 62,
    &#39;GS&#39;: 62,
    &#39;MP&#39;: 35.0,
    &#39;FG&#39;: 7.9,
    &#39;FGA&#39;: 18.2,
    &#39;FG%&#39;: 0.434,
    &#39;3P&#39;: 2.8,
    &#39;3PA&#39;: 7.4,
    &#39;3P%&#39;: 0.379,
    &#39;2P&#39;: 5.1,
    &#39;2PA&#39;: 10.8,
    &#39;2P%&#39;: 0.472,
    &#39;eFG%&#39;: 0.511,
    &#39;FT&#39;: 2.5,
    &#39;FTA&#39;: 3.2,
    &#39;FT%&#39;: 0.789,
    &#39;ORB&#39;: 0.8,
    &#39;DRB&#39;: 3.5,
    &#39;TRB&#39;: 4.4,
    &#39;AST&#39;: 6.0,
    &#39;STL&#39;: 0.9,
    &#39;BLK&#39;: 0.5,
    &#39;TOV&#39;: 2.6,
    &#39;PF&#39;: 2.0,
    &#39;PTS&#39;: 21.1
  },
  #...14 more
 ]

For the proxies themselves I'm assuming you have your own, just know that using premium residential proxies like these will give you consistent results over free ones from geonode or similar.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

I am trying to webscrape basketball reference, and I am using proxy networks to avoid error 429, but I it keeps giving an error

问题

答案1

将矩阵数据框中的值插入到另一个数据框中，使用索引。

如何通过Python中的Selenium关闭可点击的弹出窗口以继续进行网页数据抓取

Python删除每个分组中第一次出现后的行

Pyglet在我的Python Chip-8模拟器中不注册按键按下，即使没有使用keyboard模块。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。