requests.get().content 不会始终返回相同的代码。

huangapple go评论107阅读模式
英文:

requests.get().content doesn't return the same code consistently

问题

我有一个函数,它可以获取YouTube视频的源代码,然后尝试查找startTimeMsendTimeMsvideoId

这是videoId的代码块:

  1. class className():
  2. def __init__(self, link):
  3. # 发送请求
  4. self.r = requests.get(link)
  5. def originalVideoID(self):
  6. # 获取源代码
  7. source = str(self.r.content)
  8. # videoID被包含在这些端点中
  9. start = ""videoDetails":{"videoId":""
  10. end = """
  11. # 获取videoDetails右侧的所有内容
  12. videoID = source.split(start)[1]
  13. # 获取引号左侧的所有内容
  14. videoID = videoID.split(end)[0]

预期结果:
如果给定YouTube剪辑的URL如下:https://www.youtube.com/clip/UgkxU2HSeGL_NvmDJ-nQJrlLwllwMDBdGZFs,

videoID应该始终是NiXD4xVJM5Y。

实际结果:

  • 有时会出现预期的结果。
  • 其他时候,我会从第15行得到IndexError错误。

在调试时:
我在第14行添加了start in source,当抛出IndexError错误时它返回False
我已经打印了str(self.r.content),在那里我可以看到源代码完全不同。

我做错了什么?
这是否是使用另一个工具如selenium的情况,或者我是否错误地使用了requests,或者我应该以不同的方式来解决这个问题?

编辑:这是错误的回溯信息

  1. Traceback (most recent call last):
  2. File "PATHTOPROJECT\FILENAME.py", line 383, in <module>
  3. main()
  4. File "PATHTOPROJECT\FILENAME.py", line 165, in download_video
  5. downloadLink = className(link).originalVideoID()
  6. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  7. File "PATHTOPROJECT\FILENAME.py", line 67, in originalVideoID
  8. videoID = source.split(start)[1]
  9. ~~~~~~~~~~~~~~~~~~~^^^
  10. IndexError: list index out of range

我正在寻找的数据在此脚本中:

  1. <script nonce="601b9hyYx1NEaPf0pQewqA">
  2. var ytInitialPlayerResponse =
  3. {
  4. ...
  5. "videoDetails":
  6. {
  7. "videoId":"NiXD4xVJM5Y", ...
  8. },
  9. ...
  10. "clipConfig":
  11. {
  12. "postId": ... ,"startTimeMs":"0","endTimeMs":"15000"
  13. }
  14. }
英文:

I have a function that grabs the source code of YouTube clips and then tries to find: startTimeMs, endTimeMs, and videoId.

This is the videoId block:

  1. class className():
  2. def __init__(self, link)
  3. # make the request
  4. self.r = requests.get(link)
  5. def originalVideoID(self):
  6. # get the source code
  7. source = str(self.r.content)
  8. # these are the endpoints in which the videoID is enclosed
  9. start = &quot;\&quot;videoDetails\&quot;:{\&quot;videoId\&quot;:\&quot;&quot;
  10. end = &#39;\&quot;&#39;
  11. # gets everything right of videoDetails
  12. videoID = source.split(start)[1]
  13. # gets everything left of the quote
  14. videoID = videoID.split(end)[0]

Expected Outcome:

If given a YouTube Clip URL like: https://www.youtube.com/clip/UgkxU2HSeGL_NvmDJ-nQJrlLwllwMDBdGZFs,

videoID should consistently be NiXD4xVJM5Y.

Actual Outcome:

  • Sometimes, the expected outcome occurs.
  • Other times, I get an IndexError from line 15.

When debugging this:

I added start in source to line 14 which returns False whenever IndexError is thrown.
I have printed str(self.r.content) which is where I can see the source code is completely different.

What am I doing wrong?
Is this a case to use another tool like selenium or perhaps, I using requests wrong or I should approach this differently?

EDIT: This is the traceback on the error

  1. Traceback (most recent call last):
  2. File &quot;PATHTOPROJECT\FILENAME.py&quot;, line 383, in &lt;module&gt;
  3. main()
  4. File &quot;PATHTOPROJECT\FILENAME.py&quot;, line 165, in download_video
  5. downloadLink = className(link).originalVideoID()
  6. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  7. File &quot;PATHTOPROJECT\FILENAME.py&quot;, line 67, in originalVideoID
  8. videoID = source.split(start)[1]
  9. ~~~~~~~~~~~~~~~~~~~^^^
  10. IndexError: list index out of range

The data that I am seeking in the source code is within this script:

  1. &lt;script nonce=&quot;601b9hyYx1NEaPf0pQewqA&quot;&gt;
  2. var ytInitialPlayerResponse =
  3. {
  4. ...
  5. &quot;videoDetails&quot;:
  6. {
  7. &quot;videoId&quot;:&quot;NiXD4xVJM5Y&quot;, ...
  8. },
  9. ...
  10. &quot;clipConfig&quot;:
  11. {
  12. &quot;postId&quot;: ... ,&quot;startTimeMs&quot;:&quot;0&quot;,&quot;endTimeMs&quot;:&quot;15000&quot;
  13. }
  14. }

答案1

得分: 1

download chromedriver matching your version of chrome ( https://chromedriver.chromium.org/downloads). unzip the file. change the path_to_chromedriver in the following script...which accepts the cookie policy, waits for the page to be fully loaded and THEN parses the page content (your code/split logic):

  1. import re
  2. from selenium import webdriver
  3. from selenium.webdriver.common.by import By
  4. from selenium.common.exceptions import TimeoutException
  5. from selenium.webdriver.support.ui import WebDriverWait
  6. from selenium.webdriver.support import expected_conditions as EC
  7. path_to_chromedriver = "chromedriver/chromedriver"
  8. video_url = 'https://www.youtube.com/clip/UgkxU2HSeGL_NvmDJ-nQJrlLwllwMDBdGZFs'
  9. options = webdriver.ChromeOptions()
  10. options.add_argument('--ignore-certificate-errors')
  11. options.add_argument("--test-type")
  12. driver = webdriver.Chrome(executable_path=path_to_chromedriver, options=options)
  13. driver.get(video_url)
  14. # find "accept all" button and submit...
  15. b = [b for b in driver.find_elements(by=By.TAG_NAME, value="button") if b.accessible_name and b.accessible_name.lower() == 'accept all'][0]
  16. b.submit()
  17. # https://stackoverflow.com/a/26567563/12693728: wait for page to be loaded. retrieving video id sometimes fails...suppose because of async resources are not being loaded in a deterministic order/time...assume that when the video container is ready, the page is fully loaded...
  18. timeout = 3
  19. try:
  20. element_present = EC.presence_of_element_located((By.CLASS_NAME, 'html5-video-container '))
  21. WebDriverWait(driver, timeout).until(element_present)
  22. except TimeoutException:
  23. print("Timed out waiting for page to load")
  24. video_id = driver.page_source.split('"videoDetails":{"videoId":"')[1]
  25. video_id = video_id.split('"')[0]
  26. print(video_id)
  27. driver.quit()

output:

  1. NiXD4xVJM5Y

=> maybe there's a way to have chrome run in headless mode, will leave that to you requests.get().content 不会始终返回相同的代码。

英文:

download chromedriver matching your version of chrome ( https://chromedriver.chromium.org/downloads). unzip the file. change the path_to_chromedriver in the following script...which accepts the cookie policy, waits for the page to be fully loaded and THEN parses the page content (your code/split logic):

  1. import re
  2. from selenium import webdriver
  3. from selenium.webdriver.common.by import By
  4. from selenium.common.exceptions import TimeoutException
  5. from selenium.webdriver.support.ui import WebDriverWait
  6. from selenium.webdriver.support import expected_conditions as EC
  7. path_to_chromedriver=&quot;chromedriver/chromedriver&quot;
  8. video_url = &#39;https://www.youtube.com/clip/UgkxU2HSeGL_NvmDJ-nQJrlLwllwMDBdGZFs&#39;
  9. options = webdriver.ChromeOptions()
  10. options.add_argument(&#39;--ignore-certificate-errors&#39;)
  11. options.add_argument(&quot;--test-type&quot;)
  12. driver = webdriver.Chrome(executable_path=path_to_chromedriver, options=options)
  13. driver.get(video_url)
  14. # find &quot;accept all&quot; button and submit...
  15. b = [b for b in driver.find_elements(by=By.TAG_NAME, value=&quot;button&quot;) if b.accessible_name and b.accessible_name.lower() == &#39;accept all&#39;][0]
  16. b.submit()
  17. # https://stackoverflow.com/a/26567563/12693728: wait for page to be loaded. retrieving video id sometimes fails...suppose because of async resources are not being loaded in a deterministic order/time...assume that when the video container is ready, the page is fully loaded...
  18. timeout = 3
  19. try:
  20. element_present = EC.presence_of_element_located((By.CLASS_NAME, &#39;html5-video-container &#39;))
  21. WebDriverWait(driver, timeout).until(element_present)
  22. except TimeoutException:
  23. print(&quot;Timed out waiting for page to load&quot;)
  24. video_id = driver.page_source.split(&#39;&quot;videoDetails&quot;:{&quot;videoId&quot;:&quot;&#39;)[1]
  25. video_id = video_id.split(&#39;&quot;&#39;)[0]
  26. print(video_id)
  27. driver.quit()

output:

  1. NiXD4xVJM5Y

=> maybe there's a way to have chrome run in headless mode, will leave that to you requests.get().content 不会始终返回相同的代码。

huangapple
  • 本文由 发表于 2023年5月11日 06:13:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/76222905.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定