requests.get().content 不会始终返回相同的代码。

huangapple go评论77阅读模式
英文:

requests.get().content doesn't return the same code consistently

问题

我有一个函数,它可以获取YouTube视频的源代码,然后尝试查找startTimeMsendTimeMsvideoId

这是videoId的代码块:

class className():
    def __init__(self, link):
        # 发送请求
        self.r = requests.get(link)

    def originalVideoID(self):
        # 获取源代码
        source = str(self.r.content)

        # videoID被包含在这些端点中
        start = ""videoDetails":{"videoId":""
        end = """

        # 获取videoDetails右侧的所有内容
        videoID = source.split(start)[1]

        # 获取引号左侧的所有内容
        videoID = videoID.split(end)[0]

预期结果:
如果给定YouTube剪辑的URL如下:https://www.youtube.com/clip/UgkxU2HSeGL_NvmDJ-nQJrlLwllwMDBdGZFs,

videoID应该始终是NiXD4xVJM5Y。

实际结果:

  • 有时会出现预期的结果。
  • 其他时候,我会从第15行得到IndexError错误。

在调试时:
我在第14行添加了start in source,当抛出IndexError错误时它返回False
我已经打印了str(self.r.content),在那里我可以看到源代码完全不同。

我做错了什么?
这是否是使用另一个工具如selenium的情况,或者我是否错误地使用了requests,或者我应该以不同的方式来解决这个问题?

编辑:这是错误的回溯信息

Traceback (most recent call last):
  File "PATHTOPROJECT\FILENAME.py", line 383, in <module>
    main()
  File "PATHTOPROJECT\FILENAME.py", line 165, in download_video
    downloadLink = className(link).originalVideoID()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "PATHTOPROJECT\FILENAME.py", line 67, in originalVideoID
    videoID = source.split(start)[1]
              ~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

我正在寻找的数据在此脚本中:

<script nonce="601b9hyYx1NEaPf0pQewqA">
    var ytInitialPlayerResponse = 
    {
        ...
        "videoDetails":
        {
            "videoId":"NiXD4xVJM5Y", ...
        },
        ...
        "clipConfig":
        {
            "postId": ... ,"startTimeMs":"0","endTimeMs":"15000"
        }
    }
英文:

I have a function that grabs the source code of YouTube clips and then tries to find: startTimeMs, endTimeMs, and videoId.

This is the videoId block:

class className():
    def __init__(self, link)
        # make the request
        self.r = requests.get(link)

    def originalVideoID(self):
        # get the source code
        source = str(self.r.content)

        # these are the endpoints in which the videoID is enclosed
        start = &quot;\&quot;videoDetails\&quot;:{\&quot;videoId\&quot;:\&quot;&quot;
        end = &#39;\&quot;&#39;

        # gets everything right of videoDetails
        videoID = source.split(start)[1]

        # gets everything left of the quote
        videoID = videoID.split(end)[0]

Expected Outcome:

If given a YouTube Clip URL like: https://www.youtube.com/clip/UgkxU2HSeGL_NvmDJ-nQJrlLwllwMDBdGZFs,

videoID should consistently be NiXD4xVJM5Y.

Actual Outcome:

  • Sometimes, the expected outcome occurs.
  • Other times, I get an IndexError from line 15.

When debugging this:

I added start in source to line 14 which returns False whenever IndexError is thrown.
I have printed str(self.r.content) which is where I can see the source code is completely different.

What am I doing wrong?
Is this a case to use another tool like selenium or perhaps, I using requests wrong or I should approach this differently?

EDIT: This is the traceback on the error

Traceback (most recent call last):
  File &quot;PATHTOPROJECT\FILENAME.py&quot;, line 383, in &lt;module&gt;
    main()
  File &quot;PATHTOPROJECT\FILENAME.py&quot;, line 165, in download_video
    downloadLink = className(link).originalVideoID()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File &quot;PATHTOPROJECT\FILENAME.py&quot;, line 67, in originalVideoID
    videoID = source.split(start)[1]
              ~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

The data that I am seeking in the source code is within this script:

&lt;script nonce=&quot;601b9hyYx1NEaPf0pQewqA&quot;&gt;
    var ytInitialPlayerResponse = 
    {
        ...
        &quot;videoDetails&quot;:
        {
            &quot;videoId&quot;:&quot;NiXD4xVJM5Y&quot;, ...
        },
        ...
        &quot;clipConfig&quot;:
        {
            &quot;postId&quot;: ... ,&quot;startTimeMs&quot;:&quot;0&quot;,&quot;endTimeMs&quot;:&quot;15000&quot;
        }
    }

答案1

得分: 1

download chromedriver matching your version of chrome ( https://chromedriver.chromium.org/downloads). unzip the file. change the path_to_chromedriver in the following script...which accepts the cookie policy, waits for the page to be fully loaded and THEN parses the page content (your code/split logic):

import re
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

path_to_chromedriver = "chromedriver/chromedriver"
video_url = 'https://www.youtube.com/clip/UgkxU2HSeGL_NvmDJ-nQJrlLwllwMDBdGZFs'

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument("--test-type")
driver = webdriver.Chrome(executable_path=path_to_chromedriver, options=options)

driver.get(video_url)

# find "accept all" button and submit...
b = [b for b in driver.find_elements(by=By.TAG_NAME, value="button") if b.accessible_name and b.accessible_name.lower() == 'accept all'][0]
b.submit()

# https://stackoverflow.com/a/26567563/12693728: wait for page to be loaded. retrieving video id sometimes fails...suppose because of async resources are not being loaded in a deterministic order/time...assume that when the video container is ready, the page is fully loaded...
timeout = 3
try:
    element_present = EC.presence_of_element_located((By.CLASS_NAME, 'html5-video-container '))
    WebDriverWait(driver, timeout).until(element_present)
except TimeoutException:
    print("Timed out waiting for page to load")

video_id = driver.page_source.split('"videoDetails":{"videoId":"')[1]
video_id = video_id.split('"')[0]
print(video_id)

driver.quit()

output:

NiXD4xVJM5Y

=> maybe there's a way to have chrome run in headless mode, will leave that to you requests.get().content 不会始终返回相同的代码。

英文:

download chromedriver matching your version of chrome ( https://chromedriver.chromium.org/downloads). unzip the file. change the path_to_chromedriver in the following script...which accepts the cookie policy, waits for the page to be fully loaded and THEN parses the page content (your code/split logic):

import re
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

path_to_chromedriver=&quot;chromedriver/chromedriver&quot;
video_url = &#39;https://www.youtube.com/clip/UgkxU2HSeGL_NvmDJ-nQJrlLwllwMDBdGZFs&#39;

options = webdriver.ChromeOptions()
options.add_argument(&#39;--ignore-certificate-errors&#39;)
options.add_argument(&quot;--test-type&quot;)
driver = webdriver.Chrome(executable_path=path_to_chromedriver, options=options)

driver.get(video_url)

# find &quot;accept all&quot; button and submit...
b = [b for b in driver.find_elements(by=By.TAG_NAME, value=&quot;button&quot;) if b.accessible_name and b.accessible_name.lower() == &#39;accept all&#39;][0]
b.submit()

# https://stackoverflow.com/a/26567563/12693728: wait for page to be loaded. retrieving video id sometimes fails...suppose because of async resources are not being loaded in a deterministic order/time...assume that when the video container is ready, the page is fully loaded...
timeout = 3
try:
    element_present = EC.presence_of_element_located((By.CLASS_NAME, &#39;html5-video-container &#39;))
    WebDriverWait(driver, timeout).until(element_present)
except TimeoutException:
    print(&quot;Timed out waiting for page to load&quot;)

video_id = driver.page_source.split(&#39;&quot;videoDetails&quot;:{&quot;videoId&quot;:&quot;&#39;)[1]
video_id = video_id.split(&#39;&quot;&#39;)[0]
print(video_id)

driver.quit()

output:

NiXD4xVJM5Y

=> maybe there's a way to have chrome run in headless mode, will leave that to you requests.get().content 不会始终返回相同的代码。

huangapple
  • 本文由 发表于 2023年5月11日 06:13:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/76222905.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定