2023年8月9日 01:11:35go评论112阅读模式

英文:

using Python requests library to log into reddit

问题

我正在尝试从已登录的 Reddit 页面中抓取 HTML 数据，因为我需要的信息包含在已登录的页面中，而不是在未登录时的页面中（来自 https://stackoverflow.com/questions/76843989/find-elements-by-xpath-does-not-work-and-returns-an-empty-list）。

我正在使用以下代码请求登录，假设登录的 URL 是 https://www.reddit.com/login/。

import requests

username = "myuser"
password = "password"
payload = {
    'loginUsername': username,
    'loginPassword': password
}

# 使用 'with' 确保会话上下文在使用后关闭。
s = requests.Session()
headers = {'user-Agent': 'Mozilla/5.0'}

s.headers = headers
p = s.post("https://www.reddit.com/login/", data=payload)
# 打印返回的 HTML 或其他更智能的内容，以查看是否成功登录。
print(p.text)
print(p.status_code)

然而，返回的状态码是 404，p.text 的内容如下：

<!DOCTYPE html>
<html lang="en-CA">
    <head>
        <title>
            
                reddit.com: Not found
            
        </title>

        <link rel="shortcut icon" type="image/png" sizes="512x512" href="https://www.redditstatic.com/accountmanager/favicon/favicon-512x512.png">
        <link rel="shortcut icon" type="image/png" sizes="192x192" href="https://www.redditstatic.com/accountmanager/favicon/favicon-192x192.png">
        <link rel="shortcut icon" type="image/png" sizes="32x32" href="https://www.redditstatic.com/accountmanager/favicon/favicon-32x32.png">
        <link rel="shortcut icon" type="image/png" sizes="16x16" href="https://www.redditstatic.com/accountmanager/favicon/favicon-16x16.png">
        <link rel="apple-touch-icon" sizes="180x180" href="https://www.redditstatic.com/accountmanager/favicon/apple-touch-icon-180x180.png">
        <link rel="mask-icon" href="https://www.redditstatic.com/accountmanager/favicon/safari-pinned-tab.svg" color="#5bbad5">
        
        <meta name="viewport" content="width=device-width, initial-scale=1, viewport-fit=cover">
        <meta name="msapplication-TileColor" content="#ffffff"/>
        <meta name="msapplication-TileImage" content="https://www.redditstatic.com/accountmanager/favicon/mstile-310x310.png"/>
        <meta name="msapplication-TileImage" content="https://www.redditstatic.com/accountmanager/favicon/mstile-310x150.png"/>
        <meta name="theme-color" content="#ffffff"/>
        
        

  <link rel="stylesheet" href="https://www.redditstatic.com/accountmanager/vendor.4edfac426c2c4357e34e.css">

  <link rel="stylesheet" href="https://www.redditstatic.com/accountmanager/theme.02a88d7effc337a0c765.css">


    </head>
    <body>
        




  
  <div class="Container m-desktop">
    <div class="PageColumns">
        
          
          <div class="PageColumn PageColumn__left">
          
            
<div class="Art"></div>

          </div>
        
        <div class="PageColumn PageColumn__right">
          
<div class="ColumnContainer">
  <div class="SnooIcon"></div>
  <h1 class="Title">404—Not found</h1>
  <p>
    The page you are looking for does not exist.
  </p>
</div>

        </div>
    </div>
</div>


        <script>
            //<![CDATA
                
                window.___r = {"config": {"tracker_endpoint": "https://events.reddit.com/v2", "tracker_key": "AccountManager3", "tracker_secret": "V2FpZ2FlMlZpZTJ3aWVyMWFpc2hhaGhvaHNoZWl3"}};
            //]]>
        </script>
        

  <script type="text/javascript" src="https://www.redditstatic.com/accountmanager/vendor.33ac2d92b89a211b0483.js"></script>

  <script type="text/javascript" src="https://www.redditstatic.com/accountmanager/theme.5333e8893b6d5b30d258.js"></script>

  <script type="text/javascript" src="https://www.redditstatic.com/accountmanager/sentry.d25b8843def9b86b36ac.js"></script>


    </body>
</html>

我尝试将登录 URL 设置为 login_url = f"https://www.reddit.com/user/{username}"，但仍然不起作用。
我尝试使用 https://www.reddit.com/login，但没有斜杠结尾，状态码是 400，p.text 没有输出。
我相信我输入的用户名和密码是正确的。登录 URL 应该是不同的吗？

我注意到在 https://www.reddit.com/login 中，表单的 action 如下：

<form class="AnimatedForm" action="/login" method="post">

英文:

I'm trying to scrape html data from reddit when I am logged in, as the information I need is included in the logged-in page, not in the webpage when I am logged out(from https://stackoverflow.com/questions/76843989/find-elements-by-xpath-does-not-work-and-returns-an-empty-list).

I am using the following code to request login, assuming the login URL is https://www.reddit.com/login/.

import requests
username=&quot;myuser&quot;
password=&quot;password&quot;
payload = {
&#39;loginUsername&#39;: username,
&#39;loginPassword&#39;: password
}
# Use &#39;with&#39; to ensure the session context is closed after use.
s = requests.Session()
headers = {&#39;user-Agent&#39;: &#39;Mozilla/5.0&#39;}
s.headers = headers
#login_url = f&quot;https://www.reddit.com/user/{username}&quot;
#print(login_url)
p = s.post(&quot;https://www.reddit.com/login/&quot;, data=payload)
# print the html returned or something more intelligent to see if it&#39;s a successful login page.
print(p.text)
print(p.status_code)

However, the status code returned is 404 and I get the following for p.text:

&lt;!DOCTYPE html&gt;
&lt;html lang=&quot;en-CA&quot;&gt;
&lt;head&gt;
&lt;title&gt;
reddit.com: Not found
&lt;/title&gt;
&lt;link rel=&quot;shortcut icon&quot; type=&quot;image/png&quot; sizes=&quot;512x512&quot; href=&quot;https://www.redditstatic.com/accountmanager/favicon/favicon-512x512.png&quot;&gt;
&lt;link rel=&quot;shortcut icon&quot; type=&quot;image/png&quot; sizes=&quot;192x192&quot; href=&quot;https://www.redditstatic.com/accountmanager/favicon/favicon-192x192.png&quot;&gt;
&lt;link rel=&quot;shortcut icon&quot; type=&quot;image/png&quot; sizes=&quot;32x32&quot; href=&quot;https://www.redditstatic.com/accountmanager/favicon/favicon-32x32.png&quot;&gt;
&lt;link rel=&quot;shortcut icon&quot; type=&quot;image/png&quot; sizes=&quot;16x16&quot; href=&quot;https://www.redditstatic.com/accountmanager/favicon/favicon-16x16.png&quot;&gt;
&lt;link rel=&quot;apple-touch-icon&quot; sizes=&quot;180x180&quot; href=&quot;https://www.redditstatic.com/accountmanager/favicon/apple-touch-icon-180x180.png&quot;&gt;
&lt;link rel=&quot;mask-icon&quot; href=&quot;https://www.redditstatic.com/accountmanager/favicon/safari-pinned-tab.svg&quot; color=&quot;#5bbad5&quot;&gt;
&lt;meta name=&quot;viewport&quot; content=&quot;width=device-width, initial-scale=1, viewport-fit=cover&quot;&gt;
&lt;meta name=&quot;msapplication-TileColor&quot; content=&quot;#ffffff&quot;/&gt;
&lt;meta name=&quot;msapplication-TileImage&quot; content=&quot;https://www.redditstatic.com/accountmanager/favicon/mstile-310x310.png&quot;/&gt;
&lt;meta name=&quot;msapplication-TileImage&quot; content=&quot;https://www.redditstatic.com/accountmanager/favicon/mstile-310x150.png&quot;/&gt;
&lt;meta name=&quot;theme-color&quot; content=&quot;#ffffff&quot;&gt;
&lt;link rel=&quot;stylesheet&quot; href=&quot;https://www.redditstatic.com/accountmanager/vendor.4edfac426c2c4357e34e.css&quot;&gt;
&lt;link rel=&quot;stylesheet&quot; href=&quot;https://www.redditstatic.com/accountmanager/theme.02a88d7effc337a0c765.css&quot;&gt;
&lt;/head&gt;
&lt;body&gt;
&lt;div class=&quot;Container m-desktop&quot;&gt;
&lt;div class=&quot;PageColumns&quot;&gt;
&lt;div class=&quot;PageColumn PageColumn__left&quot;&gt;
&lt;div class=&quot;Art&quot;&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;PageColumn PageColumn__right&quot;&gt;
&lt;div class=&quot;ColumnContainer&quot;&gt;
&lt;div class=&quot;SnooIcon&quot;&gt;&lt;/div&gt;
&lt;h1 class=&quot;Title&quot;&gt;404&amp;mdash;Not found&lt;/h1&gt;
&lt;p&gt;
The page you are looking for does not exist.
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;script&gt;
//&lt;![CDATA
window.___r = {&quot;config&quot;: {&quot;tracker_endpoint&quot;: &quot;https://events.reddit.com/v2&quot;, &quot;tracker_key&quot;: &quot;AccountManager3&quot;, &quot;tracker_secret&quot;: &quot;V2FpZ2FlMlZpZTJ3aWVyMWFpc2hhaGhvaHNoZWl3&quot;}};
//]]&gt;
&lt;/script&gt;
&lt;script type=&quot;text/javascript&quot; src=&quot;https://www.redditstatic.com/accountmanager/vendor.33ac2d92b89a211b0483.js&quot;&gt;&lt;/script&gt;
&lt;script type=&quot;text/javascript&quot; src=&quot;https://www.redditstatic.com/accountmanager/theme.5333e8893b6d5b30d258.js&quot;&gt;&lt;/script&gt;
&lt;script type=&quot;text/javascript&quot; src=&quot;https://www.redditstatic.com/accountmanager/sentry.d25b8843def9b86b36ac.js&quot;&gt;&lt;/script&gt;
&lt;/body&gt;
&lt;/html&gt;

I tried using login URL as login_url = f"https://www.reddit.com/user/{username}", but it still does not work.
I tried using https://www.reddit.com/login without the slash at the end, and the status is 400 and there is no output for p.text.
I believe the username and password I put in is correct. Should the login URL be something different?

I noticed at https://www.reddit.com/login, the action is as follows:

<form class="AnimatedForm" action="/login" method="post">

答案1

得分: 1

收集信息

如果你检查网络调用，你会发现它请求将以下数据传递给请求：

或者

login_data = {
    "csrf_token": "<RANDOM_VALUE>",
    "otp": "",
    "password": "PASSWORD",
    "dest": "https://www.reddit.com",
    "username": "USERNAME"
}

问题是，csrf_token是动态的，每个请求都会改变。那么我们该怎么办？

查找 `csrf_token`

当发送一个 GET 请求到页面时，csrf_token 是可用的。所以，你可以使用像 BeautifulSoup 这样的库来提取令牌。

注意事项

我发现你需要将 content-type 标头设置为 application/x-www-form-urlencoded。

代码示例

import requests
from bs4 import BeautifulSoup

LOGIN_URL = "https://www.reddit.com/login"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
    "content-type": "application/x-www-form-urlencoded",
}

login_data = {
    "otp": "",
    "password": "PASSWORD",  # 用你的 Reddit 密码替换
    "dest": "https://www.reddit.com",
    "username": "USERNAME",  # 用你的 Reddit 用户名替换
}

with requests.Session() as session:
    session.headers.update(headers)

    # 获取 CSRF 令牌
    response = session.get(LOGIN_URL)
    soup = BeautifulSoup(response.content, "html.parser")
    csrf_token = soup.find("input", {"name": "csrf_token"})["value"]
    login_data["csrf_token"] = csrf_token

    # 执行登录
    with session.post(LOGIN_URL, data=login_data) as response:
        print(response)

参考资料

https://stackoverflow.com/questions/51351443/get-csrf-token-using-python-requests

英文:

Gathering information

If you insepct the Network calls, you'll see that it request the following data to be passed into the request:

login_data = {
    &quot;csrf_token&quot; &quot;&lt;RANDOM_VALUE&gt;&quot;
    &quot;otp&quot;: &quot;&quot;,
    &quot;password&quot;: &quot;PASSWORD&quot;, password
    &quot;dest&quot;: &quot;https://www.reddit.com&quot;,
    &quot;username&quot;: &quot;USERNAME&quot;, username
}

The problem is, that the csrf_token is dynamic, and changes for every request. So, what do we do?

Finding the `csrf_token`

The csrf_token is available when sending a GET request to the page. So, you can use a library such as BeautifulSoup to extract the token.

Notes

I found that you need to set the content-type header to application/x-www-form-urlencoded.

Code example

import requests
from bs4 import BeautifulSoup

LOGIN_URL = &quot;https://www.reddit.com/login&quot;

headers = {
    &quot;User-Agent&quot;: &quot;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36&quot;,
    &quot;content-type&quot;: &quot;application/x-www-form-urlencoded&quot;,
}

login_data = {
    
    &quot;otp&quot;: &quot;&quot;,
    &quot;password&quot;: &quot;PASSWORD&quot;,  # Replace with your Reddit password
    &quot;dest&quot;: &quot;https://www.reddit.com&quot;,
    &quot;username&quot;: &quot;USERNAME&quot;,  # Replace with your Reddit username
}

with requests.Session() as session:
    session.headers.update(headers)

    # Get the CSRF token
    response = session.get(LOGIN_URL)
    soup = BeautifulSoup(response.content, &quot;html.parser&quot;)
    csrf_token = soup.find(&quot;input&quot;, {&quot;name&quot;: &quot;csrf_token&quot;})[&quot;value&quot;]
    login_data[&quot;csrf_token&quot;] = csrf_token

    # Perform login
    with session.post(LOGIN_URL, data=login_data) as response:
        print(response)

使用Python的requests库登录Reddit。

问题

答案1

收集信息

查找 `csrf_token`

注意事项

代码示例

参考资料

Gathering information

Finding the `csrf_token`

Notes

Code example

See also

Tkinter与openweather

旋转的HTML为什么会使div垂直滚动？

Python – 如何将嵌套的 JSON 字典移动到其自己的索引位置？

I2C设置问题

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论

问题

答案1

收集信息

查找 csrf_token

注意事项

代码示例

参考资料

Gathering information

Finding the csrf_token

Notes

Code example

See also

发表评论

查找 `csrf_token`

Finding the `csrf_token`