2023年5月11日 06:13:49go评论107阅读模式

英文:

requests.get().content doesn't return the same code consistently

问题

我有一个函数，它可以获取YouTube视频的源代码，然后尝试查找startTimeMs、endTimeMs和videoId。

这是videoId的代码块：

class className():
    def __init__(self, link):
        # 发送请求
        self.r = requests.get(link)
    def originalVideoID(self):
        # 获取源代码
        source = str(self.r.content)
        # videoID被包含在这些端点中
        start = "&quot;videoDetails&quot;:{&quot;videoId&quot;:&quot;"
        end = "&quot;"
        # 获取videoDetails右侧的所有内容
        videoID = source.split(start)[1]
        # 获取引号左侧的所有内容
        videoID = videoID.split(end)[0]

预期结果：
如果给定YouTube剪辑的URL如下：https://www.youtube.com/clip/UgkxU2HSeGL_NvmDJ-nQJrlLwllwMDBdGZFs，

videoID应该始终是NiXD4xVJM5Y。

实际结果：

有时会出现预期的结果。
其他时候，我会从第15行得到IndexError错误。

在调试时：
我在第14行添加了start in source，当抛出IndexError错误时它返回False。
我已经打印了str(self.r.content)，在那里我可以看到源代码完全不同。

我做错了什么？
这是否是使用另一个工具如selenium的情况，或者我是否错误地使用了requests，或者我应该以不同的方式来解决这个问题？

编辑：这是错误的回溯信息

Traceback (most recent call last):
  File "PATHTOPROJECT\FILENAME.py", line 383, in <module>
    main()
  File "PATHTOPROJECT\FILENAME.py", line 165, in download_video
    downloadLink = className(link).originalVideoID()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "PATHTOPROJECT\FILENAME.py", line 67, in originalVideoID
    videoID = source.split(start)[1]
              ~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

我正在寻找的数据在此脚本中：

<script nonce="601b9hyYx1NEaPf0pQewqA">
    var ytInitialPlayerResponse = 
    {
        ...
        "videoDetails":
        {
            "videoId":"NiXD4xVJM5Y", ...
        },
        ...
        "clipConfig":
        {
            "postId": ... ,"startTimeMs":"0","endTimeMs":"15000"
        }
    }

英文:

I have a function that grabs the source code of YouTube clips and then tries to find: startTimeMs, endTimeMs, and videoId.

This is the videoId block:

class className():
    def __init__(self, link)
        # make the request
        self.r = requests.get(link)
    def originalVideoID(self):
        # get the source code
        source = str(self.r.content)
        # these are the endpoints in which the videoID is enclosed
        start = &quot;\&quot;videoDetails\&quot;:{\&quot;videoId\&quot;:\&quot;&quot;
        end = &#39;\&quot;&#39;
        # gets everything right of videoDetails
        videoID = source.split(start)[1]
        # gets everything left of the quote
        videoID = videoID.split(end)[0]

Expected Outcome:

If given a YouTube Clip URL like: https://www.youtube.com/clip/UgkxU2HSeGL_NvmDJ-nQJrlLwllwMDBdGZFs,

videoID should consistently be NiXD4xVJM5Y.

Actual Outcome:

Sometimes, the expected outcome occurs.
Other times, I get an IndexError from line 15.

When debugging this:

I added start in source to line 14 which returns False whenever IndexError is thrown.
I have printed str(self.r.content) which is where I can see the source code is completely different.

What am I doing wrong?
Is this a case to use another tool like selenium or perhaps, I using requests wrong or I should approach this differently?

EDIT: This is the traceback on the error

Traceback (most recent call last):
  File &quot;PATHTOPROJECT\FILENAME.py&quot;, line 383, in &lt;module&gt;
    main()
  File &quot;PATHTOPROJECT\FILENAME.py&quot;, line 165, in download_video
    downloadLink = className(link).originalVideoID()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File &quot;PATHTOPROJECT\FILENAME.py&quot;, line 67, in originalVideoID
    videoID = source.split(start)[1]
              ~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

The data that I am seeking in the source code is within this script:

&lt;script nonce=&quot;601b9hyYx1NEaPf0pQewqA&quot;&gt;
    var ytInitialPlayerResponse = 
    {
        ...
        &quot;videoDetails&quot;:
        {
            &quot;videoId&quot;:&quot;NiXD4xVJM5Y&quot;, ...
        },
        ...
        &quot;clipConfig&quot;:
        {
            &quot;postId&quot;: ... ,&quot;startTimeMs&quot;:&quot;0&quot;,&quot;endTimeMs&quot;:&quot;15000&quot;
        }
    }

答案1

得分: 1

download chromedriver matching your version of chrome ( https://chromedriver.chromium.org/downloads). unzip the file. change the path_to_chromedriver in the following script...which accepts the cookie policy, waits for the page to be fully loaded and THEN parses the page content (your code/split logic):

import re
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
path_to_chromedriver = "chromedriver/chromedriver"
video_url = 'https://www.youtube.com/clip/UgkxU2HSeGL_NvmDJ-nQJrlLwllwMDBdGZFs'
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument("--test-type")
driver = webdriver.Chrome(executable_path=path_to_chromedriver, options=options)
driver.get(video_url)
# find "accept all" button and submit...
b = [b for b in driver.find_elements(by=By.TAG_NAME, value="button") if b.accessible_name and b.accessible_name.lower() == 'accept all'][0]
b.submit()
# https://stackoverflow.com/a/26567563/12693728: wait for page to be loaded. retrieving video id sometimes fails...suppose because of async resources are not being loaded in a deterministic order/time...assume that when the video container is ready, the page is fully loaded...
timeout = 3
try:
    element_present = EC.presence_of_element_located((By.CLASS_NAME, 'html5-video-container '))
    WebDriverWait(driver, timeout).until(element_present)
except TimeoutException:
    print("Timed out waiting for page to load")
video_id = driver.page_source.split('"videoDetails":{"videoId":"')[1]
video_id = video_id.split('"')[0]
print(video_id)
driver.quit()

output:

NiXD4xVJM5Y

=> maybe there's a way to have chrome run in headless mode, will leave that to you

英文:

import re
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
path_to_chromedriver=&quot;chromedriver/chromedriver&quot;
video_url = &#39;https://www.youtube.com/clip/UgkxU2HSeGL_NvmDJ-nQJrlLwllwMDBdGZFs&#39;
options = webdriver.ChromeOptions()
options.add_argument(&#39;--ignore-certificate-errors&#39;)
options.add_argument(&quot;--test-type&quot;)
driver = webdriver.Chrome(executable_path=path_to_chromedriver, options=options)
driver.get(video_url)
# find &quot;accept all&quot; button and submit...
b = [b for b in driver.find_elements(by=By.TAG_NAME, value=&quot;button&quot;) if b.accessible_name and b.accessible_name.lower() == &#39;accept all&#39;][0]
b.submit()
# https://stackoverflow.com/a/26567563/12693728: wait for page to be loaded. retrieving video id sometimes fails...suppose because of async resources are not being loaded in a deterministic order/time...assume that when the video container is ready, the page is fully loaded...
timeout = 3
try:
    element_present = EC.presence_of_element_located((By.CLASS_NAME, &#39;html5-video-container &#39;))
    WebDriverWait(driver, timeout).until(element_present)
except TimeoutException:
    print(&quot;Timed out waiting for page to load&quot;)
video_id = driver.page_source.split(&#39;&quot;videoDetails&quot;:{&quot;videoId&quot;:&quot;&#39;)[1]
video_id = video_id.split(&#39;&quot;&#39;)[0]
print(video_id)
driver.quit()

output:

NiXD4xVJM5Y

=> maybe there's a way to have chrome run in headless mode, will leave that to you

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

requests.get().content 不会始终返回相同的代码。

问题

This is the videoId block:

Expected Outcome:

Actual Outcome:

When debugging this:

答案1

Why does pandas read_excel fail on an openpyxl error saying 'ReadOnlyWorksheet' object has no attribute 'defined_names'?

使用循环从单个数据框中获取不同的数据框

使用 pygit2 如何删除特定的提交？

git-filter-repo 丢失了远程仓库

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。