如何从YouTube视频中提取不同语言的字幕

huangapple go评论223阅读模式
英文:

How to extract subtitles from Youtube videos in varied languages

问题

我已经使用下面的代码从YouTube视频中提取字幕,但它只适用于英文视频。我有一些西班牙语视频,所以我想知道如何修改代码以提取西班牙语字幕?

    from pytube import YouTube
    from youtube_transcript_api import YouTubeTranscriptApi
    
    # 定义要提取文本的YouTube视频的视频URL或ID
    video_url = 'https://www.youtube.com/watch?v=xYgoNiSo-kY'
    
    # 使用pytube下载视频
    youtube = YouTube(video_url)
    video = youtube.streams.get_highest_resolution()
    video.download()
    
    # 获取已下载视频的文件路径
    video_path = video.default_filename
    
    # 从URL中获取视频ID
    video_id = video_url.split('v=')[-1]
    
    # 获取指定视频ID的字幕
    transcript = YouTubeTranscriptApi.get_transcript(video_id)
    
    # 从字幕中提取文本
    captions_text = ''
    for segment in transcript:
        caption = segment['text']
        captions_text += caption + ' '
    
    # 打印提取的文本
    print(captions_text)
英文:

I have used the code below to extract subtitles from YouTube videos, but it only works for videos in English. I have some videos in Spanish, so I would like to know how I can modify the code to extract Spanish subtitles too?

<!-- language: py -->
from pytube import YouTube
from youtube_transcript_api import YouTubeTranscriptApi

# Define the video URL or ID of the YouTube video you want to extract text from
video_url = &#39;https://www.youtube.com/watch?v=xYgoNiSo-kY&#39;

# Download the video using pytube
youtube = YouTube(video_url)
video = youtube.streams.get_highest_resolution()
video.download()

# Get the downloaded video file path
video_path = video.default_filename

# Get the video ID from the URL
video_id = video_url.split(&#39;v=&#39;)[-1]

# Get the transcript for the specified video ID
transcript = YouTubeTranscriptApi.get_transcript(video_id)

# Extract the text from the transcript
captions_text = &#39;&#39;
for segment in transcript:
    caption = segment[&#39;text&#39;]
    captions_text += caption + &#39; &#39;

# Print the extracted text
print(captions_text)

答案1

得分: 1

# 使用 - [list_transcripts][1] - 获取可用语言列表:

示例

&lt;!-- language-all: py --&gt;
    video_id = &#39;xYgoNiSo-kY&#39;
    transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)

然后循环 `transcript_list` 变量以查看获取的可用语言

示例

    for x, tr in enumerate(transcript_list):
      print(tr.language_code)

在这种情况下结果是

&gt; es

修改您的代码以循环视频中可用的语言并下载生成的字幕

示例

    # 用于存储已下载字幕的变量:
    all_captions = []
    caption = None
    captions_text = &#39;&#39;
    
    # 循环此视频可用的所有语言并下载生成的字幕:
    for x, tr in enumerate(transcript_list):
      print(&quot;正在下载&quot; + tr.language + &quot;的字幕...&quot;)
      transcript_obtained_in_language = transcript_list.find_transcript([tr.language_code]).fetch()
      for segment in transcript_obtained_in_language:
        caption = segment[&#39;text&#39;]
        captions_text += caption + &#39; &#39;
      all_captions.append({&quot;语言 &quot; : tr.language_code + &quot; - &quot; + tr.language, &quot;字幕&quot; : captions_text})
      caption = None
      captions_text = &#39;&#39;
      print(&quot;=&quot;*20)
    print(&quot;完成&quot;)

`all_captions` 变量中将存储从给定的 `VIDEO_ID` 获取到的字幕和语言
英文:

Use - list_transcripts - for get the list of available languages:

Example:

<!-- language-all: py -->
video_id = 'xYgoNiSo-kY'
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)

Then, loop the transcript_list variable to see the available languages obtained:

Example:

for x, tr in enumerate(transcript_list):
  print(tr.language_code)

In this case, the result is:

> es

Modify your code for loop the languages available on the video and download the generated captions:

Example:

# Variables for store the downloaded captions:
all_captions = []
caption = None
captions_text = &#39;&#39;

# Loop all languages available for this video and download the generated captions:
for x, tr in enumerate(transcript_list):
  print(&quot;Downloading captions in &quot; + tr.language + &quot;...&quot;)
  transcript_obtained_in_language = transcript_list.find_transcript([tr.language_code]).fetch()
  for segment in transcript_obtained_in_language:
    caption = segment[&#39;text&#39;]
    captions_text += caption + &#39; &#39;
  all_captions.append({&quot;language &quot; : tr.language_code + &quot; - &quot; + tr.language, &quot;captions&quot; : captions_text})
  caption = None
  captions_text = &#39;&#39;
  print(&quot;=&quot;*20)
print(&quot;Done&quot;)

In the all_captions variable, will be stored the captions and the language obtained from the given VIDEO_ID.

huangapple
  • 本文由 发表于 2023年5月29日 21:35:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/76357844.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定