使用Python进行YouTube URL抓取

huangapple go评论76阅读模式
英文:

Youtube url scrapping using python

问题

以下是翻译好的内容:

URL: https://www.youtube.com/@PW-Foundation/videos
编写一个Python程序,提取前五个视频的视频URL。

import requests
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
import logging

youtube_search = "https://www.youtube.com/@PW-Foundation/videos"
url_search = urlopen(youtube_search)
youtube_page = url_search.read()
youtube_html = bs(youtube_page, "html.parser")
youtube_html.findAll('div', {'id':'contents'})

当我执行这段代码时,它显示一个空列表。

我想要一个HTML源代码,其中包含前五个视频的URL。

英文:

URL: https://www.youtube.com/@PW-Foundation/videos
Write a Python program to extract the video URL of the first five videos.

import requests
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
import logging

youtube_search = "https://www.youtube.com/@PW-Foundation/videos"
url_search = urlopen(youtube_search)
youtube_page = url_search.read()
youtube_html = bs(youtube_page, "html.parser")
youtube_html.findAll('div', {'id':'contents'})

when I execute this, It shows an empty list.

I want an HTML source where I can find the URL of the first five videos.

答案1

得分: 1

以下是提供的代码的翻译结果:

import re
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
import json

youtube_search = "https://www.youtube.com/@PW-Foundation/videos"

# 打开URL并读取页面内容
url_search = urlopen(youtube_search)
youtube_page = url_search.read()

# 使用BeautifulSoup解析页面的HTML内容
youtube_html = bs(youtube_page, "html.parser")

# 定义一个正则表达式模式,从script标签中提取JSON数据
pattern = r'<script nonce="[-\w]+">\n\s+var ytInitialData = (.+)&#39;'
script_data = re.search(pattern=pattern, string=youtube_html.prettify())[1].replace(';', '')

# 将JSON数据加载到Python字典中
json_data = json.loads(script_data)

# 从JSON数据中提取视频列表,并将其存储在'videos_container'变量中
videos_container = json_data['contents']['twoColumnBrowseResultsRenderer']['tabs'][1]['tabRenderer']['content']['richGridRenderer']['contents']

print(f"总视频数:{len(videos_container)-1}")

# 遍历视频列表并打印视频的URL
for video in videos_container[:-1]:
    # print(video)
    video_id = video['richItemRenderer']['content']['videoRenderer']['videoId']
    video_url = f"https://www.youtube.com/watch?v={video_id}"
    print(video_url)

输出结果:

总视频数:30

以上是代码的翻译结果,其中视频的详细信息可以在上述代码中的video变量中找到,可以以类似的方式解析/提取视频的URL。

英文:
  • The data is present as a JSON string within the script tag of the HTML, which you can extract and parse with just BeautifulSoup.
  • By default, there are data for up to 30 YouTube videos under the JSON string that holds all the information of every video.

Here's the way to extract the JSON data and process the video URLs:

import re
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
import json

youtube_search = &quot;https://www.youtube.com/@PW-Foundation/videos&quot;

# Open the URL and read the content of the page
url_search = urlopen(youtube_search)
youtube_page = url_search.read()

# Parse the HTML content of the page using BeautifulSoup
youtube_html = bs(youtube_page, &quot;html.parser&quot;)

# # Define a regular expression pattern to extract the JSON data from the script tag
pattern = r&#39;&lt;script nonce=&quot;[-\w]+&quot;&gt;\n\s+var ytInitialData = (.+)&#39;
script_data = re.search(pattern=pattern, string=youtube_html.prettify())[1].replace(&#39;;&#39;, &#39;&#39;)

# Load the JSON data into a Python dictionary
json_data = json.loads(script_data)

# Extract the list of videos from the JSON data and store it in the &#39;videos_container&#39; variable
videos_container = json_data[&#39;contents&#39;][&#39;twoColumnBrowseResultsRenderer&#39;][&#39;tabs&#39;][1][&#39;tabRenderer&#39;][&#39;content&#39;][&#39;richGridRenderer&#39;][&#39;contents&#39;]

print(f&quot;Total videos: {len(videos_container)-1}&quot;)

# Loop through the video list and print the URLs of the videos
for video in videos_container[:-1]:
    # print(video)
    video_id = video[&#39;richItemRenderer&#39;][&#39;content&#39;][&#39;videoRenderer&#39;][&#39;videoId&#39;]
    video_url = f&quot;https://www.youtube.com/watch?v={video_id}&quot;
    print(video_url)

output:

Total videos: 30
  • All the details pertaining to a video is available under the variable video in the code above that can be parsed/extracted in a similar manner as we did to extract the video_url.

答案2

得分: 0

我会尝试使用Selenium来解决这个问题,因为YouTube使用JS渲染这些页面,我认为使用requests和bs4无法获取URLS。

你可以像这样使用Selenium:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException

driver = webdriver.Firefox()
# driver = webdriver.Chrome()  # 如果你更喜欢使用Chrome

video_urls = []

def accept_cookies():
    try:
        elem = driver.find_element(By.XPATH, "/html/body/c-wiz/div/div/div/div[2]/div[1]/div[3]/div[1]/form[1]/div/div/button/span")
        elem.click()
        return True
    except NoSuchElementException:
        return False

def find_videos():
    print("test")
    try:
        # 复制URLS的代码

        return True
    except NoSuchElementException:
        return False

def activate_game():
    try:
        elem = driver.find_element(By.CLASS_NAME, "btn")
        elem.click()
        return True
    except NoSuchElementException:
        return False

def activate_scraping():
    driver.get("https://www.youtube.com/@NetworkChuck/videos")
    step = 0
    tries = 0
    while step < 2:
        if tries <= 5:  # 尝试5次完成任务
            tries += 1
            success = False
            match step:
                case 0:
                    success = accept_cookies()
                case 1:
                    success = find_videos()

            if success:
                step += 1
                tries = 0
            else:
                driver.implicitly_wait(2)  # 在重试当前步骤之前等待2秒
        else:
            return False
    assert "No results found." not in driver.page_source
    driver.close()
    return True

activate_scraping()

请注意,我编写了这段代码,以便在连接速度较慢时重试查找元素,以防止崩溃,你还可以轻松添加步骤。

你仍然需要复制链接,但我认为如果你稍微研究一下Selenium文档,你就可以做到。
https://selenium-python.readthedocs.io/locating-elements.html

英文:

I would try an approach with selenium, since YT renders these pages with JS and I don't think its possible to scrape URLS with requests and bs4.

You use something like this:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException

driver = webdriver.Firefox()
# driver = webdriver.Chrome()  # If you would prefer to use Chrome


video_urls = []


def accept_cookies():
    try:
        elem = driver.find_element(By.XPATH, &quot;/html/body/c-wiz/div/div/div/div[2]/div[1]/div[3]/div[1]/form[1]/div/div/button/span&quot;)
        elem.click()
        return True
    except NoSuchElementException:
        return False


def find_videos():
    print(&quot;test&quot;)
    try:
        # CODE THAT COPIES THE URLS

        return True
    except NoSuchElementException:
        return False


def activate_game():
    try:
        elem = driver.find_element(By.CLASS_NAME, &quot;btn&quot;)
        elem.click()
        return True
    except NoSuchElementException:
        return False


def activate_scraping():
    driver.get(&quot;https://www.youtube.com/@NetworkChuck/videos&quot;)
    step = 0
    tries = 0
    while step &lt; 2:
        if tries &lt;= 5:  # 5 tries to accomplish the task
            tries += 1
            success = False
            match step:
                case 0:
                    success = accept_cookies()
                case 1:
                    success = find_videos()

            if success:
                step += 1
                tries = 0
            else:
                driver.implicitly_wait(2)  # wait 2 secs before retrying the current step
        else:
            return False
    assert &quot;No results found.&quot; not in driver.page_source
    driver.close()
    return True


activate_scraping()

Note that I wrote that code so that it would retry to find the element so it doesn't crash if your connection is slow, also you can easily add steps.

You still need to copy the links but I think if you dive a little bit into the selenium docs you can manage to do that.
https://selenium-python.readthedocs.io/locating-elements.html

huangapple
  • 本文由 发表于 2023年7月27日 15:47:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/76777547.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定