英文:
requests.get().content doesn't return the same code consistently
问题
我有一个函数,它可以获取YouTube视频的源代码,然后尝试查找startTimeMs、endTimeMs和videoId。
这是videoId的代码块:
class className():
def __init__(self, link):
# 发送请求
self.r = requests.get(link)
def originalVideoID(self):
# 获取源代码
source = str(self.r.content)
# videoID被包含在这些端点中
start = ""videoDetails":{"videoId":""
end = """
# 获取videoDetails右侧的所有内容
videoID = source.split(start)[1]
# 获取引号左侧的所有内容
videoID = videoID.split(end)[0]
预期结果:
如果给定YouTube剪辑的URL如下:https://www.youtube.com/clip/UgkxU2HSeGL_NvmDJ-nQJrlLwllwMDBdGZFs,
videoID应该始终是NiXD4xVJM5Y。
实际结果:
- 有时会出现预期的结果。
- 其他时候,我会从第15行得到IndexError错误。
在调试时:
我在第14行添加了start in source
,当抛出IndexError错误时它返回False
。
我已经打印了str(self.r.content)
,在那里我可以看到源代码完全不同。
我做错了什么?
这是否是使用另一个工具如selenium的情况,或者我是否错误地使用了requests,或者我应该以不同的方式来解决这个问题?
编辑:这是错误的回溯信息
Traceback (most recent call last):
File "PATHTOPROJECT\FILENAME.py", line 383, in <module>
main()
File "PATHTOPROJECT\FILENAME.py", line 165, in download_video
downloadLink = className(link).originalVideoID()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "PATHTOPROJECT\FILENAME.py", line 67, in originalVideoID
videoID = source.split(start)[1]
~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
我正在寻找的数据在此脚本中:
<script nonce="601b9hyYx1NEaPf0pQewqA">
var ytInitialPlayerResponse =
{
...
"videoDetails":
{
"videoId":"NiXD4xVJM5Y", ...
},
...
"clipConfig":
{
"postId": ... ,"startTimeMs":"0","endTimeMs":"15000"
}
}
英文:
I have a function that grabs the source code of YouTube clips and then tries to find: startTimeMs, endTimeMs, and videoId.
This is the videoId block:
class className():
def __init__(self, link)
# make the request
self.r = requests.get(link)
def originalVideoID(self):
# get the source code
source = str(self.r.content)
# these are the endpoints in which the videoID is enclosed
start = "\"videoDetails\":{\"videoId\":\""
end = '\"'
# gets everything right of videoDetails
videoID = source.split(start)[1]
# gets everything left of the quote
videoID = videoID.split(end)[0]
Expected Outcome:
If given a YouTube Clip URL like: https://www.youtube.com/clip/UgkxU2HSeGL_NvmDJ-nQJrlLwllwMDBdGZFs,
videoID should consistently be NiXD4xVJM5Y.
Actual Outcome:
- Sometimes, the expected outcome occurs.
- Other times, I get an IndexError from line 15.
When debugging this:
I added start in source
to line 14 which returns False
whenever IndexError is thrown.
I have printed str(self.r.content)
which is where I can see the source code is completely different.
What am I doing wrong?
Is this a case to use another tool like selenium or perhaps, I using requests wrong or I should approach this differently?
EDIT: This is the traceback on the error
Traceback (most recent call last):
File "PATHTOPROJECT\FILENAME.py", line 383, in <module>
main()
File "PATHTOPROJECT\FILENAME.py", line 165, in download_video
downloadLink = className(link).originalVideoID()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "PATHTOPROJECT\FILENAME.py", line 67, in originalVideoID
videoID = source.split(start)[1]
~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
The data that I am seeking in the source code is within this script:
<script nonce="601b9hyYx1NEaPf0pQewqA">
var ytInitialPlayerResponse =
{
...
"videoDetails":
{
"videoId":"NiXD4xVJM5Y", ...
},
...
"clipConfig":
{
"postId": ... ,"startTimeMs":"0","endTimeMs":"15000"
}
}
答案1
得分: 1
download chromedriver matching your version of chrome ( https://chromedriver.chromium.org/downloads). unzip the file. change the path_to_chromedriver in the following script...which accepts the cookie policy, waits for the page to be fully loaded and THEN parses the page content (your code/split logic):
import re
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
path_to_chromedriver = "chromedriver/chromedriver"
video_url = 'https://www.youtube.com/clip/UgkxU2HSeGL_NvmDJ-nQJrlLwllwMDBdGZFs'
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument("--test-type")
driver = webdriver.Chrome(executable_path=path_to_chromedriver, options=options)
driver.get(video_url)
# find "accept all" button and submit...
b = [b for b in driver.find_elements(by=By.TAG_NAME, value="button") if b.accessible_name and b.accessible_name.lower() == 'accept all'][0]
b.submit()
# https://stackoverflow.com/a/26567563/12693728: wait for page to be loaded. retrieving video id sometimes fails...suppose because of async resources are not being loaded in a deterministic order/time...assume that when the video container is ready, the page is fully loaded...
timeout = 3
try:
element_present = EC.presence_of_element_located((By.CLASS_NAME, 'html5-video-container '))
WebDriverWait(driver, timeout).until(element_present)
except TimeoutException:
print("Timed out waiting for page to load")
video_id = driver.page_source.split('"videoDetails":{"videoId":"')[1]
video_id = video_id.split('"')[0]
print(video_id)
driver.quit()
output:
NiXD4xVJM5Y
=> maybe there's a way to have chrome run in headless mode, will leave that to you
英文:
download chromedriver matching your version of chrome ( https://chromedriver.chromium.org/downloads). unzip the file. change the path_to_chromedriver in the following script...which accepts the cookie policy, waits for the page to be fully loaded and THEN parses the page content (your code/split logic):
import re
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
path_to_chromedriver="chromedriver/chromedriver"
video_url = 'https://www.youtube.com/clip/UgkxU2HSeGL_NvmDJ-nQJrlLwllwMDBdGZFs'
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument("--test-type")
driver = webdriver.Chrome(executable_path=path_to_chromedriver, options=options)
driver.get(video_url)
# find "accept all" button and submit...
b = [b for b in driver.find_elements(by=By.TAG_NAME, value="button") if b.accessible_name and b.accessible_name.lower() == 'accept all'][0]
b.submit()
# https://stackoverflow.com/a/26567563/12693728: wait for page to be loaded. retrieving video id sometimes fails...suppose because of async resources are not being loaded in a deterministic order/time...assume that when the video container is ready, the page is fully loaded...
timeout = 3
try:
element_present = EC.presence_of_element_located((By.CLASS_NAME, 'html5-video-container '))
WebDriverWait(driver, timeout).until(element_present)
except TimeoutException:
print("Timed out waiting for page to load")
video_id = driver.page_source.split('"videoDetails":{"videoId":"')[1]
video_id = video_id.split('"')[0]
print(video_id)
driver.quit()
output:
NiXD4xVJM5Y
=> maybe there's a way to have chrome run in headless mode, will leave that to you
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论