英文:
I am trying to webscrape basketball reference, and I am using proxy networks to avoid error 429, but I it keeps giving an error
问题
以下是代码的翻译部分:
这是我试图做的一个示例,一个有效但不旋转代理的代码:
link = 'https://www.basketball-reference.com/teams/NOP/2023.html#advanced'
dfs = pd.read_html(link)
stats = dfs[1].dropna()
d = stats.to_dict('records')
这是我尝试旋转代理,但它不起作用的部分:
link = 'https://www.basketball-reference.com/teams/NOP/2023.html#advanced'
page = requests.get(link, proxies = {'http://': temp, 'https://': temp}, timeout = 5).text
dfs = pd.read_csv(StringIO(page),error_bad_lines=False)
有什么建议或快速修复吗?
英文:
Here's an example of what I am trying to do, a code that works but does not rotate proxies:
link = 'https://www.basketball-reference.com/teams/NOP/2023.html#advanced'
dfs = pd.read_html(link)
stats = dfs[1].dropna()
d = stats.to_dict('records')
This is me trying to rotate proxies, but it doesn't work:
link = 'https://www.basketball-reference.com/teams/NOP/2023.html#advanced'
page = requests.get(link, proxies = {'http://': temp, 'https://':temp}, timeout = 5).text
dfs = pd.read_csv(StringIO(page),error_bad_lines=False)
Any advice or quick fixes here?
答案1
得分: 3
以下是你要的翻译内容:
"It's not immediately clear from your code what problem you're having.
I quickly threw together this working example. Maybe it'll help?
Notes:
- I'm splitting each proxy string by ":" to extract host, port, username, and password from each and creating an HTTPProxyAuth object from them, but you can skip passing auth if your proxies don't need it.
- I couldn't get results with dfs = pd.read_csv(page) so I swapped it out for read_html instead.
import requests
from requests.auth import HTTPProxyAuth
import pandas as pd
proxies = [ 'proxy1:port:username:password',
'proxy2:port:username:password',
'proxy3:port:username:password',
# add more proxies as needed
]
link = 'https://www.basketball-reference.com/teams/NOP/2023.html#advanced'
success = False
for proxy_str in proxies:
try:
host, port, username, password = proxy_str.split(':')
proxy = {
'http': f'http://{username}:{password}@{host}:{port}',
'https': f'https://{username}:{password}@{host}:{port}'
}
# Use this instead if proxies don't need auth
# host, port = proxy_str.split(':')
# proxy = {
# 'http': f'http://{host}:{port}',
# 'https': f'https://{host}:{port}'
# }
auth = HTTPProxyAuth(username, password) # remove if proxies don't need auth
page = requests.get(link, proxies=proxy, timeout=5, auth=auth).text
dfs = pd.read_html(page)
stats = dfs[1].dropna()
d = stats.to_dict('records')
print(d) # show results
success is True
break
except requests.exceptions.RequestException as e:
print(f"Error: {e}")
if not success:
print("All proxies failed.")
This gets me 15 players' stats (printing the variable 'd') like this, which I assume is what you want.
[
{
'Rk': 1,
'Player': 'CJ McCollum',
'Age': 31,
'G': 62,
'GS': 62,
'MP': 35.0,
'FG': 7.9,
'FGA': 18.2,
'FG%': 0.434,
'3P': 2.8,
'3PA': 7.4,
'3P%': 0.379,
'2P': 5.1,
'2PA': 10.8,
'2P%': 0.472,
'eFG%': 0.511,
'FT': 2.5,
'FTA': 3.2,
'FT%': 0.789,
'ORB': 0.8,
'DRB': 3.5,
'TRB': 4.4,
'AST': 6.0,
'STL': 0.9,
'BLK': 0.5,
'TOV': 2.6,
'PF': 2.0,
'PTS': 21.1
},
#...14 more
]
For the proxies themselves I'm assuming you have your own, just know that using premium residential proxies like these will give you consistent results over free ones from geonode or similar."
英文:
It's not immediately clear from your code what problem you're having.
I quickly threw together this working example. Maybe it'll help?
Notes:
- I'm splitting each proxy string by ":" to extract host, port,
username, and password from each and creating a HTTPProxyAuth object
from them, but you can skip passing auth if your proxies dont need
it. - I couldn't get results with dfs = pd.read_csv(page) so I swapped it
out for read_html instead.
import requests
from requests.auth import HTTPProxyAuth
import pandas as pd
proxies = [ 'proxy1:port:username:password',
'proxy2:port:username:password',
'proxy3:port:username:password',
# add more proxies as needed
]
link = 'https://www.basketball-reference.com/teams/NOP/2023.html#advanced'
success = False
for proxy_str in proxies:
try:
host, port, username, password = proxy_str.split(':')
proxy = {
'http': f'http://{username}:{password}@{host}:{port}',
'https': f'https://{username}:{password}@{host}:{port}'
}
# Use this instead if proxies dont need auth
# host, port = proxy_str.split(':')
# proxy = {
# 'http': f'http://{host}:{port}',
# 'https': f'https://{host}:{port}'
# }
auth = HTTPProxyAuth(username, password) # remove if proxies dont need auth
page = requests.get(link, proxies=proxy, timeout=5, auth=auth).text
dfs = pd.read_html(page)
stats = dfs[1].dropna()
d = stats.to_dict('records')
print(d) # show results
success = True
break
except requests.exceptions.RequestException as e:
print(f"Error: {e}")
if not success:
print("All proxies failed.")
This gets me 15 players' stats (printing the variable 'd') like this, which I assume is what you want.
[
{
'Rk': 1,
'Player': 'CJ McCollum',
'Age': 31,
'G': 62,
'GS': 62,
'MP': 35.0,
'FG': 7.9,
'FGA': 18.2,
'FG%': 0.434,
'3P': 2.8,
'3PA': 7.4,
'3P%': 0.379,
'2P': 5.1,
'2PA': 10.8,
'2P%': 0.472,
'eFG%': 0.511,
'FT': 2.5,
'FTA': 3.2,
'FT%': 0.789,
'ORB': 0.8,
'DRB': 3.5,
'TRB': 4.4,
'AST': 6.0,
'STL': 0.9,
'BLK': 0.5,
'TOV': 2.6,
'PF': 2.0,
'PTS': 21.1
},
#...14 more
]
For the proxies themselves I'm assuming you have your own, just know that using premium residential proxies like these will give you consistent results over free ones from geonode or similar.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论