I am trying to webscrape basketball reference, and I am using proxy networks to avoid error 429, but I it keeps giving an error

huangapple go评论69阅读模式
英文:

I am trying to webscrape basketball reference, and I am using proxy networks to avoid error 429, but I it keeps giving an error

问题

以下是代码的翻译部分:

这是我试图做的一个示例一个有效但不旋转代理的代码

link = 'https://www.basketball-reference.com/teams/NOP/2023.html#advanced'
dfs = pd.read_html(link)
stats = dfs[1].dropna()
d = stats.to_dict('records')

这是我尝试旋转代理,但它不起作用的部分:

link = 'https://www.basketball-reference.com/teams/NOP/2023.html#advanced'
page = requests.get(link, proxies = {'http://': temp, 'https://': temp}, timeout = 5).text
dfs = pd.read_csv(StringIO(page),error_bad_lines=False)

有什么建议或快速修复吗?

英文:

Here's an example of what I am trying to do, a code that works but does not rotate proxies:

link = 'https://www.basketball-reference.com/teams/NOP/2023.html#advanced'
dfs = pd.read_html(link)
stats = dfs[1].dropna()
d = stats.to_dict('records')

This is me trying to rotate proxies, but it doesn't work:

link = 'https://www.basketball-reference.com/teams/NOP/2023.html#advanced'
page = requests.get(link, proxies = {'http://': temp, 'https://':temp}, timeout = 5).text
dfs = pd.read_csv(StringIO(page),error_bad_lines=False)

Any advice or quick fixes here?

答案1

得分: 3

以下是你要的翻译内容:

"It's not immediately clear from your code what problem you're having.

I quickly threw together this working example. Maybe it'll help?

Notes:

  1. I'm splitting each proxy string by ":" to extract host, port, username, and password from each and creating an HTTPProxyAuth object from them, but you can skip passing auth if your proxies don't need it.
  2. I couldn't get results with dfs = pd.read_csv(page) so I swapped it out for read_html instead.
import requests
from requests.auth import HTTPProxyAuth
import pandas as pd

proxies = [  'proxy1:port:username:password',
             'proxy2:port:username:password',
             'proxy3:port:username:password',
             # add more proxies as needed
           ]

link = 'https://www.basketball-reference.com/teams/NOP/2023.html#advanced'
success = False
for proxy_str in proxies:
    try:
        host, port, username, password = proxy_str.split(':')
        proxy = {
            'http': f'http://{username}:{password}@{host}:{port}',
            'https': f'https://{username}:{password}@{host}:{port}'
        }
        # Use this instead if proxies don't need auth
        # host, port = proxy_str.split(':')
        # proxy = {
        #    'http': f'http://{host}:{port}',
        #    'https': f'https://{host}:{port}'
        # }

        auth = HTTPProxyAuth(username, password) # remove if proxies don't need auth

        page = requests.get(link, proxies=proxy, timeout=5, auth=auth).text
        dfs = pd.read_html(page)
        stats = dfs[1].dropna()
        d = stats.to_dict('records')
        print(d) # show results
        success is True
        break
    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")
if not success:
    print("All proxies failed.")

This gets me 15 players' stats (printing the variable 'd') like this, which I assume is what you want.

[
{
'Rk': 1,
'Player': 'CJ McCollum',
'Age': 31,
'G': 62,
'GS': 62,
'MP': 35.0,
'FG': 7.9,
'FGA': 18.2,
'FG%': 0.434,
'3P': 2.8,
'3PA': 7.4,
'3P%': 0.379,
'2P': 5.1,
'2PA': 10.8,
'2P%': 0.472,
'eFG%': 0.511,
'FT': 2.5,
'FTA': 3.2,
'FT%': 0.789,
'ORB': 0.8,
'DRB': 3.5,
'TRB': 4.4,
'AST': 6.0,
'STL': 0.9,
'BLK': 0.5,
'TOV': 2.6,
'PF': 2.0,
'PTS': 21.1
},
#...14 more
]

For the proxies themselves I'm assuming you have your own, just know that using premium residential proxies like these will give you consistent results over free ones from geonode or similar."

英文:

It's not immediately clear from your code what problem you're having.

I quickly threw together this working example. Maybe it'll help?

Notes:

  1. I'm splitting each proxy string by ":" to extract host, port,
    username, and password from each and creating a HTTPProxyAuth object
    from them, but you can skip passing auth if your proxies dont need
    it.
  2. I couldn't get results with dfs = pd.read_csv(page) so I swapped it
    out for read_html instead.
        import requests
        from requests.auth import HTTPProxyAuth
        import pandas as pd
        
        proxies = [  'proxy1:port:username:password',
                     'proxy2:port:username:password',
                     'proxy3:port:username:password',
                     # add more proxies as needed
                   ]
        
        link = 'https://www.basketball-reference.com/teams/NOP/2023.html#advanced'
        success = False
        for proxy_str in proxies:
            try:
                host, port, username, password = proxy_str.split(':')
                proxy = {
                    'http': f'http://{username}:{password}@{host}:{port}',
                    'https': f'https://{username}:{password}@{host}:{port}'
                }
        		# Use this instead if proxies dont need auth
        		# host, port = proxy_str.split(':')
                # proxy = {
                #    'http': f'http://{host}:{port}',
                #    'https': f'https://{host}:{port}'
                # }
        		
                auth = HTTPProxyAuth(username, password) # remove if proxies dont need auth
        		
                page = requests.get(link, proxies=proxy, timeout=5, auth=auth).text
                dfs = pd.read_html(page)
                stats = dfs[1].dropna()
                d = stats.to_dict('records')
                print(d) # show results
                success = True
                break
            except requests.exceptions.RequestException as e:
                print(f"Error: {e}")
  if not success:
        print("All proxies failed.")

This gets me 15 players' stats (printing the variable 'd') like this, which I assume is what you want.

[
{
    'Rk': 1,
    'Player': 'CJ McCollum',
    'Age': 31,
    'G': 62,
    'GS': 62,
    'MP': 35.0,
    'FG': 7.9,
    'FGA': 18.2,
    'FG%': 0.434,
    '3P': 2.8,
    '3PA': 7.4,
    '3P%': 0.379,
    '2P': 5.1,
    '2PA': 10.8,
    '2P%': 0.472,
    'eFG%': 0.511,
    'FT': 2.5,
    'FTA': 3.2,
    'FT%': 0.789,
    'ORB': 0.8,
    'DRB': 3.5,
    'TRB': 4.4,
    'AST': 6.0,
    'STL': 0.9,
    'BLK': 0.5,
    'TOV': 2.6,
    'PF': 2.0,
    'PTS': 21.1
  },
  #...14 more
 ]

For the proxies themselves I'm assuming you have your own, just know that using premium residential proxies like these will give you consistent results over free ones from geonode or similar.

huangapple
  • 本文由 发表于 2023年2月24日 02:47:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/75549089.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定