问题

我已经使用下面的代码从YouTube视频中提取字幕，但它只适用于英文视频。我有一些西班牙语视频，所以我想知道如何修改代码以提取西班牙语字幕？

    from pytube import YouTube
    from youtube_transcript_api import YouTubeTranscriptApi
    
    # 定义要提取文本的YouTube视频的视频URL或ID
    video_url = 'https://www.youtube.com/watch?v=xYgoNiSo-kY'
    
    # 使用pytube下载视频
    youtube = YouTube(video_url)
    video = youtube.streams.get_highest_resolution()
    video.download()
    
    # 获取已下载视频的文件路径
    video_path = video.default_filename
    
    # 从URL中获取视频ID
    video_id = video_url.split('v=')[-1]
    
    # 获取指定视频ID的字幕
    transcript = YouTubeTranscriptApi.get_transcript(video_id)
    
    # 从字幕中提取文本
    captions_text = ''
    for segment in transcript:
        caption = segment['text']
        captions_text += caption + ' '
    
    # 打印提取的文本
    print(captions_text)

英文:

I have used the code below to extract subtitles from YouTube videos, but it only works for videos in English. I have some videos in Spanish, so I would like to know how I can modify the code to extract Spanish subtitles too?

from pytube import YouTube
from youtube_transcript_api import YouTubeTranscriptApi

# Define the video URL or ID of the YouTube video you want to extract text from
video_url = &#39;https://www.youtube.com/watch?v=xYgoNiSo-kY&#39;

# Download the video using pytube
youtube = YouTube(video_url)
video = youtube.streams.get_highest_resolution()
video.download()

# Get the downloaded video file path
video_path = video.default_filename

# Get the video ID from the URL
video_id = video_url.split(&#39;v=&#39;)[-1]

# Get the transcript for the specified video ID
transcript = YouTubeTranscriptApi.get_transcript(video_id)

# Extract the text from the transcript
captions_text = &#39;&#39;
for segment in transcript:
    caption = segment[&#39;text&#39;]
    captions_text += caption + &#39; &#39;

# Print the extracted text
print(captions_text)

答案1

得分: 1

# 使用 - [list_transcripts][1] - 获取可用语言列表：

示例：

&lt;!-- language-all: py --&gt;
    video_id = &#39;xYgoNiSo-kY&#39;
    transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)

然后，循环 `transcript_list` 变量以查看获取的可用语言：

示例：

    for x, tr in enumerate(transcript_list):
      print(tr.language_code)

在这种情况下，结果是：

&gt; es

修改您的代码以循环视频中可用的语言并下载生成的字幕：

示例：

    # 用于存储已下载字幕的变量：
    all_captions = []
    caption = None
    captions_text = &#39;&#39;
    
    # 循环此视频可用的所有语言并下载生成的字幕：
    for x, tr in enumerate(transcript_list):
      print(&quot;正在下载&quot; + tr.language + &quot;的字幕...&quot;)
      transcript_obtained_in_language = transcript_list.find_transcript([tr.language_code]).fetch()
      for segment in transcript_obtained_in_language:
        caption = segment[&#39;text&#39;]
        captions_text += caption + &#39; &#39;
      all_captions.append({&quot;语言 &quot; : tr.language_code + &quot; - &quot; + tr.language, &quot;字幕&quot; : captions_text})
      caption = None
      captions_text = &#39;&#39;
      print(&quot;=&quot;*20)
    print(&quot;完成&quot;)

在 `all_captions` 变量中，将存储从给定的 `VIDEO_ID` 获取到的字幕和语言。

英文:

Use - list_transcripts - for get the list of available languages:

Example:

video_id = 'xYgoNiSo-kY'
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)

Then, loop the transcript_list variable to see the available languages obtained:

Example:

for x, tr in enumerate(transcript_list):
  print(tr.language_code)

In this case, the result is:

> es

Modify your code for loop the languages available on the video and download the generated captions:

Example:

# Variables for store the downloaded captions:
all_captions = []
caption = None
captions_text = &#39;&#39;

# Loop all languages available for this video and download the generated captions:
for x, tr in enumerate(transcript_list):
  print(&quot;Downloading captions in &quot; + tr.language + &quot;...&quot;)
  transcript_obtained_in_language = transcript_list.find_transcript([tr.language_code]).fetch()
  for segment in transcript_obtained_in_language:
    caption = segment[&#39;text&#39;]
    captions_text += caption + &#39; &#39;
  all_captions.append({&quot;language &quot; : tr.language_code + &quot; - &quot; + tr.language, &quot;captions&quot; : captions_text})
  caption = None
  captions_text = &#39;&#39;
  print(&quot;=&quot;*20)
print(&quot;Done&quot;)

In the all_captions variable, will be stored the captions and the language obtained from the given VIDEO_ID.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何从YouTube视频中提取不同语言的字幕

问题

答案1

Python – PyQt5样式可勾选的QGroupBox（Fusion）类似于Windows Vista

在PySpark中如何标记行。

`summary_col`为什么忽略了`info_dict`参数？

Pandas基于行驶距离和时间的速度计算

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论