2023年7月27日 15:47:04go评论185阅读模式

英文:

Youtube url scrapping using python

问题

以下是翻译好的内容：

URL: https://www.youtube.com/@PW-Foundation/videos
编写一个Python程序，提取前五个视频的视频URL。

import requests
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
import logging

youtube_search = "https://www.youtube.com/@PW-Foundation/videos"
url_search = urlopen(youtube_search)
youtube_page = url_search.read()
youtube_html = bs(youtube_page, "html.parser")
youtube_html.findAll('div', {'id':'contents'})

当我执行这段代码时，它显示一个空列表。

我想要一个HTML源代码，其中包含前五个视频的URL。

英文:

URL: https://www.youtube.com/@PW-Foundation/videos
Write a Python program to extract the video URL of the first five videos.

import requests
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
import logging

youtube_search = &quot;https://www.youtube.com/@PW-Foundation/videos&quot;
url_search = urlopen(youtube_search)
youtube_page = url_search.read()
youtube_html = bs(youtube_page, &quot;html.parser&quot;)
youtube_html.findAll(&#39;div&#39;, {&#39;id&#39;:&#39;contents&#39;})

when I execute this, It shows an empty list.

I want an HTML source where I can find the URL of the first five videos.

答案1

得分: 1

以下是提供的代码的翻译结果：

import re
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
import json

youtube_search = "https://www.youtube.com/@PW-Foundation/videos"

# 打开URL并读取页面内容
url_search = urlopen(youtube_search)
youtube_page = url_search.read()

# 使用BeautifulSoup解析页面的HTML内容
youtube_html = bs(youtube_page, "html.parser")

# 定义一个正则表达式模式，从script标签中提取JSON数据
pattern = r'<script nonce="[-\w]+">\n\s+var ytInitialData = (.+)&#39;'
script_data = re.search(pattern=pattern, string=youtube_html.prettify())[1].replace(';', '')

# 将JSON数据加载到Python字典中
json_data = json.loads(script_data)

# 从JSON数据中提取视频列表，并将其存储在'videos_container'变量中
videos_container = json_data['contents']['twoColumnBrowseResultsRenderer']['tabs'][1]['tabRenderer']['content']['richGridRenderer']['contents']

print(f"总视频数：{len(videos_container)-1}")

# 遍历视频列表并打印视频的URL
for video in videos_container[:-1]:
    # print(video)
    video_id = video['richItemRenderer']['content']['videoRenderer']['videoId']
    video_url = f"https://www.youtube.com/watch?v={video_id}"
    print(video_url)

输出结果：

总视频数：30

以上是代码的翻译结果，其中视频的详细信息可以在上述代码中的video变量中找到，可以以类似的方式解析/提取视频的URL。

英文:

The data is present as a JSON string within the script tag of the HTML, which you can extract and parse with just BeautifulSoup.
By default, there are data for up to 30 YouTube videos under the JSON string that holds all the information of every video.

Here's the way to extract the JSON data and process the video URLs:

import re
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
import json

youtube_search = &quot;https://www.youtube.com/@PW-Foundation/videos&quot;

# Open the URL and read the content of the page
url_search = urlopen(youtube_search)
youtube_page = url_search.read()

# Parse the HTML content of the page using BeautifulSoup
youtube_html = bs(youtube_page, &quot;html.parser&quot;)

# # Define a regular expression pattern to extract the JSON data from the script tag
pattern = r&#39;&lt;script nonce=&quot;[-\w]+&quot;&gt;\n\s+var ytInitialData = (.+)&#39;
script_data = re.search(pattern=pattern, string=youtube_html.prettify())[1].replace(&#39;;&#39;, &#39;&#39;)

# Load the JSON data into a Python dictionary
json_data = json.loads(script_data)

# Extract the list of videos from the JSON data and store it in the &#39;videos_container&#39; variable
videos_container = json_data[&#39;contents&#39;][&#39;twoColumnBrowseResultsRenderer&#39;][&#39;tabs&#39;][1][&#39;tabRenderer&#39;][&#39;content&#39;][&#39;richGridRenderer&#39;][&#39;contents&#39;]

print(f&quot;Total videos: {len(videos_container)-1}&quot;)

# Loop through the video list and print the URLs of the videos
for video in videos_container[:-1]:
    # print(video)
    video_id = video[&#39;richItemRenderer&#39;][&#39;content&#39;][&#39;videoRenderer&#39;][&#39;videoId&#39;]
    video_url = f&quot;https://www.youtube.com/watch?v={video_id}&quot;
    print(video_url)

output:

Total videos: 30

All the details pertaining to a video is available under the variable video in the code above that can be parsed/extracted in a similar manner as we did to extract the video_url.

答案2

得分: 0

我会尝试使用Selenium来解决这个问题，因为YouTube使用JS渲染这些页面，我认为使用requests和bs4无法获取URLS。

你可以像这样使用Selenium：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException

driver = webdriver.Firefox()
# driver = webdriver.Chrome()  # 如果你更喜欢使用Chrome

video_urls = []

def accept_cookies():
    try:
        elem = driver.find_element(By.XPATH, "/html/body/c-wiz/div/div/div/div[2]/div[1]/div[3]/div[1]/form[1]/div/div/button/span")
        elem.click()
        return True
    except NoSuchElementException:
        return False

def find_videos():
    print("test")
    try:
        # 复制URLS的代码

        return True
    except NoSuchElementException:
        return False

def activate_game():
    try:
        elem = driver.find_element(By.CLASS_NAME, "btn")
        elem.click()
        return True
    except NoSuchElementException:
        return False

def activate_scraping():
    driver.get("https://www.youtube.com/@NetworkChuck/videos")
    step = 0
    tries = 0
    while step < 2:
        if tries <= 5:  # 尝试5次完成任务
            tries += 1
            success = False
            match step:
                case 0:
                    success = accept_cookies()
                case 1:
                    success = find_videos()

            if success:
                step += 1
                tries = 0
            else:
                driver.implicitly_wait(2)  # 在重试当前步骤之前等待2秒
        else:
            return False
    assert "No results found." not in driver.page_source
    driver.close()
    return True

activate_scraping()

请注意，我编写了这段代码，以便在连接速度较慢时重试查找元素，以防止崩溃，你还可以轻松添加步骤。

你仍然需要复制链接，但我认为如果你稍微研究一下Selenium文档，你就可以做到。
https://selenium-python.readthedocs.io/locating-elements.html

英文:

I would try an approach with selenium, since YT renders these pages with JS and I don't think its possible to scrape URLS with requests and bs4.

You use something like this:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException

driver = webdriver.Firefox()
# driver = webdriver.Chrome()  # If you would prefer to use Chrome


video_urls = []


def accept_cookies():
    try:
        elem = driver.find_element(By.XPATH, &quot;/html/body/c-wiz/div/div/div/div[2]/div[1]/div[3]/div[1]/form[1]/div/div/button/span&quot;)
        elem.click()
        return True
    except NoSuchElementException:
        return False


def find_videos():
    print(&quot;test&quot;)
    try:
        # CODE THAT COPIES THE URLS

        return True
    except NoSuchElementException:
        return False


def activate_game():
    try:
        elem = driver.find_element(By.CLASS_NAME, &quot;btn&quot;)
        elem.click()
        return True
    except NoSuchElementException:
        return False


def activate_scraping():
    driver.get(&quot;https://www.youtube.com/@NetworkChuck/videos&quot;)
    step = 0
    tries = 0
    while step &lt; 2:
        if tries &lt;= 5:  # 5 tries to accomplish the task
            tries += 1
            success = False
            match step:
                case 0:
                    success = accept_cookies()
                case 1:
                    success = find_videos()

            if success:
                step += 1
                tries = 0
            else:
                driver.implicitly_wait(2)  # wait 2 secs before retrying the current step
        else:
            return False
    assert &quot;No results found.&quot; not in driver.page_source
    driver.close()
    return True


activate_scraping()

Note that I wrote that code so that it would retry to find the element so it doesn't crash if your connection is slow, also you can easily add steps.

You still need to copy the links but I think if you dive a little bit into the selenium docs you can manage to do that.
https://selenium-python.readthedocs.io/locating-elements.html

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Python进行YouTube URL抓取

问题

答案1

答案2

如果后台子进程失败，如何引发异常？

如何在Pandas中满足特定条件时添加连续数字

json.loads在JSON有效时抛出错误。

用Python切片两个日期时间对象

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论