英文:
Youtube url scrapping using python
问题
以下是翻译好的内容:
URL: https://www.youtube.com/@PW-Foundation/videos
编写一个Python程序,提取前五个视频的视频URL。
import requests
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
import logging
youtube_search = "https://www.youtube.com/@PW-Foundation/videos"
url_search = urlopen(youtube_search)
youtube_page = url_search.read()
youtube_html = bs(youtube_page, "html.parser")
youtube_html.findAll('div', {'id':'contents'})
当我执行这段代码时,它显示一个空列表。
我想要一个HTML源代码,其中包含前五个视频的URL。
英文:
URL: https://www.youtube.com/@PW-Foundation/videos
Write a Python program to extract the video URL of the first five videos.
import requests
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
import logging
youtube_search = "https://www.youtube.com/@PW-Foundation/videos"
url_search = urlopen(youtube_search)
youtube_page = url_search.read()
youtube_html = bs(youtube_page, "html.parser")
youtube_html.findAll('div', {'id':'contents'})
when I execute this, It shows an empty list.
I want an HTML source where I can find the URL of the first five videos.
答案1
得分: 1
以下是提供的代码的翻译结果:
import re
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
import json
youtube_search = "https://www.youtube.com/@PW-Foundation/videos"
# 打开URL并读取页面内容
url_search = urlopen(youtube_search)
youtube_page = url_search.read()
# 使用BeautifulSoup解析页面的HTML内容
youtube_html = bs(youtube_page, "html.parser")
# 定义一个正则表达式模式,从script标签中提取JSON数据
pattern = r'<script nonce="[-\w]+">\n\s+var ytInitialData = (.+)''
script_data = re.search(pattern=pattern, string=youtube_html.prettify())[1].replace(';', '')
# 将JSON数据加载到Python字典中
json_data = json.loads(script_data)
# 从JSON数据中提取视频列表,并将其存储在'videos_container'变量中
videos_container = json_data['contents']['twoColumnBrowseResultsRenderer']['tabs'][1]['tabRenderer']['content']['richGridRenderer']['contents']
print(f"总视频数:{len(videos_container)-1}")
# 遍历视频列表并打印视频的URL
for video in videos_container[:-1]:
# print(video)
video_id = video['richItemRenderer']['content']['videoRenderer']['videoId']
video_url = f"https://www.youtube.com/watch?v={video_id}"
print(video_url)
输出结果:
总视频数:30
以上是代码的翻译结果,其中视频的详细信息可以在上述代码中的video
变量中找到,可以以类似的方式解析/提取视频的URL。
英文:
- The data is present as a JSON string within the script tag of the HTML, which you can extract and parse with just BeautifulSoup.
- By default, there are data for up to 30 YouTube videos under the JSON string that holds all the information of every video.
Here's the way to extract the JSON data and process the video URLs:
import re
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
import json
youtube_search = "https://www.youtube.com/@PW-Foundation/videos"
# Open the URL and read the content of the page
url_search = urlopen(youtube_search)
youtube_page = url_search.read()
# Parse the HTML content of the page using BeautifulSoup
youtube_html = bs(youtube_page, "html.parser")
# # Define a regular expression pattern to extract the JSON data from the script tag
pattern = r'<script nonce="[-\w]+">\n\s+var ytInitialData = (.+)'
script_data = re.search(pattern=pattern, string=youtube_html.prettify())[1].replace(';', '')
# Load the JSON data into a Python dictionary
json_data = json.loads(script_data)
# Extract the list of videos from the JSON data and store it in the 'videos_container' variable
videos_container = json_data['contents']['twoColumnBrowseResultsRenderer']['tabs'][1]['tabRenderer']['content']['richGridRenderer']['contents']
print(f"Total videos: {len(videos_container)-1}")
# Loop through the video list and print the URLs of the videos
for video in videos_container[:-1]:
# print(video)
video_id = video['richItemRenderer']['content']['videoRenderer']['videoId']
video_url = f"https://www.youtube.com/watch?v={video_id}"
print(video_url)
output:
Total videos: 30
- All the details pertaining to a video is available under the variable
video
in the code above that can be parsed/extracted in a similar manner as we did to extract the video_url.
答案2
得分: 0
我会尝试使用Selenium来解决这个问题,因为YouTube使用JS渲染这些页面,我认为使用requests和bs4无法获取URLS。
你可以像这样使用Selenium:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Firefox()
# driver = webdriver.Chrome() # 如果你更喜欢使用Chrome
video_urls = []
def accept_cookies():
try:
elem = driver.find_element(By.XPATH, "/html/body/c-wiz/div/div/div/div[2]/div[1]/div[3]/div[1]/form[1]/div/div/button/span")
elem.click()
return True
except NoSuchElementException:
return False
def find_videos():
print("test")
try:
# 复制URLS的代码
return True
except NoSuchElementException:
return False
def activate_game():
try:
elem = driver.find_element(By.CLASS_NAME, "btn")
elem.click()
return True
except NoSuchElementException:
return False
def activate_scraping():
driver.get("https://www.youtube.com/@NetworkChuck/videos")
step = 0
tries = 0
while step < 2:
if tries <= 5: # 尝试5次完成任务
tries += 1
success = False
match step:
case 0:
success = accept_cookies()
case 1:
success = find_videos()
if success:
step += 1
tries = 0
else:
driver.implicitly_wait(2) # 在重试当前步骤之前等待2秒
else:
return False
assert "No results found." not in driver.page_source
driver.close()
return True
activate_scraping()
请注意,我编写了这段代码,以便在连接速度较慢时重试查找元素,以防止崩溃,你还可以轻松添加步骤。
你仍然需要复制链接,但我认为如果你稍微研究一下Selenium文档,你就可以做到。
https://selenium-python.readthedocs.io/locating-elements.html
英文:
I would try an approach with selenium, since YT renders these pages with JS and I don't think its possible to scrape URLS with requests and bs4.
You use something like this:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Firefox()
# driver = webdriver.Chrome() # If you would prefer to use Chrome
video_urls = []
def accept_cookies():
try:
elem = driver.find_element(By.XPATH, "/html/body/c-wiz/div/div/div/div[2]/div[1]/div[3]/div[1]/form[1]/div/div/button/span")
elem.click()
return True
except NoSuchElementException:
return False
def find_videos():
print("test")
try:
# CODE THAT COPIES THE URLS
return True
except NoSuchElementException:
return False
def activate_game():
try:
elem = driver.find_element(By.CLASS_NAME, "btn")
elem.click()
return True
except NoSuchElementException:
return False
def activate_scraping():
driver.get("https://www.youtube.com/@NetworkChuck/videos")
step = 0
tries = 0
while step < 2:
if tries <= 5: # 5 tries to accomplish the task
tries += 1
success = False
match step:
case 0:
success = accept_cookies()
case 1:
success = find_videos()
if success:
step += 1
tries = 0
else:
driver.implicitly_wait(2) # wait 2 secs before retrying the current step
else:
return False
assert "No results found." not in driver.page_source
driver.close()
return True
activate_scraping()
Note that I wrote that code so that it would retry to find the element so it doesn't crash if your connection is slow, also you can easily add steps.
You still need to copy the links but I think if you dive a little bit into the selenium docs you can manage to do that.
https://selenium-python.readthedocs.io/locating-elements.html
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论