英文:
using Python requests library to log into reddit
问题
我正在尝试从已登录的 Reddit 页面中抓取 HTML 数据,因为我需要的信息包含在已登录的页面中,而不是在未登录时的页面中(来自 https://stackoverflow.com/questions/76843989/find-elements-by-xpath-does-not-work-and-returns-an-empty-list)。
我正在使用以下代码请求登录,假设登录的 URL 是 https://www.reddit.com/login/。
import requests
username = "myuser"
password = "password"
payload = {
'loginUsername': username,
'loginPassword': password
}
# 使用 'with' 确保会话上下文在使用后关闭。
s = requests.Session()
headers = {'user-Agent': 'Mozilla/5.0'}
s.headers = headers
p = s.post("https://www.reddit.com/login/", data=payload)
# 打印返回的 HTML 或其他更智能的内容,以查看是否成功登录。
print(p.text)
print(p.status_code)
然而,返回的状态码是 404,p.text
的内容如下:
<!DOCTYPE html>
<html lang="en-CA">
<head>
<title>
reddit.com: Not found
</title>
<link rel="shortcut icon" type="image/png" sizes="512x512" href="https://www.redditstatic.com/accountmanager/favicon/favicon-512x512.png">
<link rel="shortcut icon" type="image/png" sizes="192x192" href="https://www.redditstatic.com/accountmanager/favicon/favicon-192x192.png">
<link rel="shortcut icon" type="image/png" sizes="32x32" href="https://www.redditstatic.com/accountmanager/favicon/favicon-32x32.png">
<link rel="shortcut icon" type="image/png" sizes="16x16" href="https://www.redditstatic.com/accountmanager/favicon/favicon-16x16.png">
<link rel="apple-touch-icon" sizes="180x180" href="https://www.redditstatic.com/accountmanager/favicon/apple-touch-icon-180x180.png">
<link rel="mask-icon" href="https://www.redditstatic.com/accountmanager/favicon/safari-pinned-tab.svg" color="#5bbad5">
<meta name="viewport" content="width=device-width, initial-scale=1, viewport-fit=cover">
<meta name="msapplication-TileColor" content="#ffffff"/>
<meta name="msapplication-TileImage" content="https://www.redditstatic.com/accountmanager/favicon/mstile-310x310.png"/>
<meta name="msapplication-TileImage" content="https://www.redditstatic.com/accountmanager/favicon/mstile-310x150.png"/>
<meta name="theme-color" content="#ffffff"/>
<link rel="stylesheet" href="https://www.redditstatic.com/accountmanager/vendor.4edfac426c2c4357e34e.css">
<link rel="stylesheet" href="https://www.redditstatic.com/accountmanager/theme.02a88d7effc337a0c765.css">
</head>
<body>
<div class="Container m-desktop">
<div class="PageColumns">
<div class="PageColumn PageColumn__left">
<div class="Art"></div>
</div>
<div class="PageColumn PageColumn__right">
<div class="ColumnContainer">
<div class="SnooIcon"></div>
<h1 class="Title">404—Not found</h1>
<p>
The page you are looking for does not exist.
</p>
</div>
</div>
</div>
</div>
<script>
//<![CDATA
window.___r = {"config": {"tracker_endpoint": "https://events.reddit.com/v2", "tracker_key": "AccountManager3", "tracker_secret": "V2FpZ2FlMlZpZTJ3aWVyMWFpc2hhaGhvaHNoZWl3"}};
//]]>
</script>
<script type="text/javascript" src="https://www.redditstatic.com/accountmanager/vendor.33ac2d92b89a211b0483.js"></script>
<script type="text/javascript" src="https://www.redditstatic.com/accountmanager/theme.5333e8893b6d5b30d258.js"></script>
<script type="text/javascript" src="https://www.redditstatic.com/accountmanager/sentry.d25b8843def9b86b36ac.js"></script>
</body>
</html>
我尝试将登录 URL 设置为 login_url = f"https://www.reddit.com/user/{username}"
,但仍然不起作用。
我尝试使用 https://www.reddit.com/login
,但没有斜杠结尾,状态码是 400,p.text
没有输出。
我相信我输入的用户名和密码是正确的。登录 URL 应该是不同的吗?
我注意到在 https://www.reddit.com/login
中,表单的 action 如下:
<form class="AnimatedForm" action="/login" method="post">
英文:
I'm trying to scrape html data from reddit when I am logged in, as the information I need is included in the logged-in page, not in the webpage when I am logged out(from https://stackoverflow.com/questions/76843989/find-elements-by-xpath-does-not-work-and-returns-an-empty-list).
I am using the following code to request login, assuming the login URL is https://www.reddit.com/login/.
import requests
username="myuser"
password="password"
payload = {
'loginUsername': username,
'loginPassword': password
}
# Use 'with' to ensure the session context is closed after use.
s = requests.Session()
headers = {'user-Agent': 'Mozilla/5.0'}
s.headers = headers
#login_url = f"https://www.reddit.com/user/{username}"
#print(login_url)
p = s.post("https://www.reddit.com/login/", data=payload)
# print the html returned or something more intelligent to see if it's a successful login page.
print(p.text)
print(p.status_code)
However, the status code returned is 404 and I get the following for p.text
:
<!DOCTYPE html>
<html lang="en-CA">
<head>
<title>
reddit.com: Not found
</title>
<link rel="shortcut icon" type="image/png" sizes="512x512" href="https://www.redditstatic.com/accountmanager/favicon/favicon-512x512.png">
<link rel="shortcut icon" type="image/png" sizes="192x192" href="https://www.redditstatic.com/accountmanager/favicon/favicon-192x192.png">
<link rel="shortcut icon" type="image/png" sizes="32x32" href="https://www.redditstatic.com/accountmanager/favicon/favicon-32x32.png">
<link rel="shortcut icon" type="image/png" sizes="16x16" href="https://www.redditstatic.com/accountmanager/favicon/favicon-16x16.png">
<link rel="apple-touch-icon" sizes="180x180" href="https://www.redditstatic.com/accountmanager/favicon/apple-touch-icon-180x180.png">
<link rel="mask-icon" href="https://www.redditstatic.com/accountmanager/favicon/safari-pinned-tab.svg" color="#5bbad5">
<meta name="viewport" content="width=device-width, initial-scale=1, viewport-fit=cover">
<meta name="msapplication-TileColor" content="#ffffff"/>
<meta name="msapplication-TileImage" content="https://www.redditstatic.com/accountmanager/favicon/mstile-310x310.png"/>
<meta name="msapplication-TileImage" content="https://www.redditstatic.com/accountmanager/favicon/mstile-310x150.png"/>
<meta name="theme-color" content="#ffffff">
<link rel="stylesheet" href="https://www.redditstatic.com/accountmanager/vendor.4edfac426c2c4357e34e.css">
<link rel="stylesheet" href="https://www.redditstatic.com/accountmanager/theme.02a88d7effc337a0c765.css">
</head>
<body>
<div class="Container m-desktop">
<div class="PageColumns">
<div class="PageColumn PageColumn__left">
<div class="Art"></div>
</div>
<div class="PageColumn PageColumn__right">
<div class="ColumnContainer">
<div class="SnooIcon"></div>
<h1 class="Title">404&mdash;Not found</h1>
<p>
The page you are looking for does not exist.
</p>
</div>
</div>
</div>
</div>
<script>
//<![CDATA
window.___r = {"config": {"tracker_endpoint": "https://events.reddit.com/v2", "tracker_key": "AccountManager3", "tracker_secret": "V2FpZ2FlMlZpZTJ3aWVyMWFpc2hhaGhvaHNoZWl3"}};
//]]>
</script>
<script type="text/javascript" src="https://www.redditstatic.com/accountmanager/vendor.33ac2d92b89a211b0483.js"></script>
<script type="text/javascript" src="https://www.redditstatic.com/accountmanager/theme.5333e8893b6d5b30d258.js"></script>
<script type="text/javascript" src="https://www.redditstatic.com/accountmanager/sentry.d25b8843def9b86b36ac.js"></script>
</body>
</html>
I tried using login URL as login_url = f"https://www.reddit.com/user/{username}"
, but it still does not work.
I tried using https://www.reddit.com/login
without the slash at the end, and the status is 400 and there is no output for p.text
.
I believe the username and password I put in is correct. Should the login URL be something different?
I noticed at https://www.reddit.com/login
, the action is as follows:
<form class="AnimatedForm" action="/login" method="post">
答案1
得分: 1
收集信息
如果你检查网络调用,你会发现它请求将以下数据传递给请求:
或者
login_data = {
"csrf_token": "<RANDOM_VALUE>",
"otp": "",
"password": "PASSWORD",
"dest": "https://www.reddit.com",
"username": "USERNAME"
}
问题是,csrf_token
是动态的,每个请求都会改变。那么我们该怎么办?
查找 csrf_token
当发送一个 GET
请求到页面时,csrf_token
是可用的。所以,你可以使用像 BeautifulSoup
这样的库来提取令牌。
注意事项
我发现你需要将 content-type
标头设置为 application/x-www-form-urlencoded
。
代码示例
import requests
from bs4 import BeautifulSoup
LOGIN_URL = "https://www.reddit.com/login"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
"content-type": "application/x-www-form-urlencoded",
}
login_data = {
"otp": "",
"password": "PASSWORD", # 用你的 Reddit 密码替换
"dest": "https://www.reddit.com",
"username": "USERNAME", # 用你的 Reddit 用户名替换
}
with requests.Session() as session:
session.headers.update(headers)
# 获取 CSRF 令牌
response = session.get(LOGIN_URL)
soup = BeautifulSoup(response.content, "html.parser")
csrf_token = soup.find("input", {"name": "csrf_token"})["value"]
login_data["csrf_token"] = csrf_token
# 执行登录
with session.post(LOGIN_URL, data=login_data) as response:
print(response)
参考资料
英文:
Gathering information
If you insepct the Network calls, you'll see that it request the following data to be passed into the request:
Or
login_data = {
"csrf_token" "<RANDOM_VALUE>"
"otp": "",
"password": "PASSWORD", password
"dest": "https://www.reddit.com",
"username": "USERNAME", username
}
The problem is, that the csrf_token
is dynamic, and changes for every request. So, what do we do?
Finding the csrf_token
The csrf_token
is available when sending a GET
request to the page. So, you can use a library such as BeautifulSoup
to extract the token.
Notes
I found that you need to set the content-type
header to application/x-www-form-urlencoded
.
Code example
import requests
from bs4 import BeautifulSoup
LOGIN_URL = "https://www.reddit.com/login"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
"content-type": "application/x-www-form-urlencoded",
}
login_data = {
"otp": "",
"password": "PASSWORD", # Replace with your Reddit password
"dest": "https://www.reddit.com",
"username": "USERNAME", # Replace with your Reddit username
}
with requests.Session() as session:
session.headers.update(headers)
# Get the CSRF token
response = session.get(LOGIN_URL)
soup = BeautifulSoup(response.content, "html.parser")
csrf_token = soup.find("input", {"name": "csrf_token"})["value"]
login_data["csrf_token"] = csrf_token
# Perform login
with session.post(LOGIN_URL, data=login_data) as response:
print(response)
See also
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论