英文:
How to extract subtitles from Youtube videos in varied languages
问题
我已经使用下面的代码从YouTube视频中提取字幕,但它只适用于英文视频。我有一些西班牙语视频,所以我想知道如何修改代码以提取西班牙语字幕?
from pytube import YouTube
from youtube_transcript_api import YouTubeTranscriptApi
# 定义要提取文本的YouTube视频的视频URL或ID
video_url = 'https://www.youtube.com/watch?v=xYgoNiSo-kY'
# 使用pytube下载视频
youtube = YouTube(video_url)
video = youtube.streams.get_highest_resolution()
video.download()
# 获取已下载视频的文件路径
video_path = video.default_filename
# 从URL中获取视频ID
video_id = video_url.split('v=')[-1]
# 获取指定视频ID的字幕
transcript = YouTubeTranscriptApi.get_transcript(video_id)
# 从字幕中提取文本
captions_text = ''
for segment in transcript:
caption = segment['text']
captions_text += caption + ' '
# 打印提取的文本
print(captions_text)
英文:
I have used the code below to extract subtitles from YouTube videos, but it only works for videos in English. I have some videos in Spanish, so I would like to know how I can modify the code to extract Spanish subtitles too?
<!-- language: py -->
from pytube import YouTube
from youtube_transcript_api import YouTubeTranscriptApi
# Define the video URL or ID of the YouTube video you want to extract text from
video_url = 'https://www.youtube.com/watch?v=xYgoNiSo-kY'
# Download the video using pytube
youtube = YouTube(video_url)
video = youtube.streams.get_highest_resolution()
video.download()
# Get the downloaded video file path
video_path = video.default_filename
# Get the video ID from the URL
video_id = video_url.split('v=')[-1]
# Get the transcript for the specified video ID
transcript = YouTubeTranscriptApi.get_transcript(video_id)
# Extract the text from the transcript
captions_text = ''
for segment in transcript:
caption = segment['text']
captions_text += caption + ' '
# Print the extracted text
print(captions_text)
答案1
得分: 1
# 使用 - [list_transcripts][1] - 获取可用语言列表:
示例:
<!-- language-all: py -->
video_id = 'xYgoNiSo-kY'
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
然后,循环 `transcript_list` 变量以查看获取的可用语言:
示例:
for x, tr in enumerate(transcript_list):
print(tr.language_code)
在这种情况下,结果是:
> es
修改您的代码以循环视频中可用的语言并下载生成的字幕:
示例:
# 用于存储已下载字幕的变量:
all_captions = []
caption = None
captions_text = ''
# 循环此视频可用的所有语言并下载生成的字幕:
for x, tr in enumerate(transcript_list):
print("正在下载" + tr.language + "的字幕...")
transcript_obtained_in_language = transcript_list.find_transcript([tr.language_code]).fetch()
for segment in transcript_obtained_in_language:
caption = segment['text']
captions_text += caption + ' '
all_captions.append({"语言 " : tr.language_code + " - " + tr.language, "字幕" : captions_text})
caption = None
captions_text = ''
print("="*20)
print("完成")
在 `all_captions` 变量中,将存储从给定的 `VIDEO_ID` 获取到的字幕和语言。
英文:
Use - list_transcripts - for get the list of available languages:
Example:
<!-- language-all: py -->
video_id = 'xYgoNiSo-kY'
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
Then, loop the transcript_list
variable to see the available languages obtained:
Example:
for x, tr in enumerate(transcript_list):
print(tr.language_code)
In this case, the result is:
> es
Modify your code for loop the languages available on the video and download the generated captions:
Example:
# Variables for store the downloaded captions:
all_captions = []
caption = None
captions_text = ''
# Loop all languages available for this video and download the generated captions:
for x, tr in enumerate(transcript_list):
print("Downloading captions in " + tr.language + "...")
transcript_obtained_in_language = transcript_list.find_transcript([tr.language_code]).fetch()
for segment in transcript_obtained_in_language:
caption = segment['text']
captions_text += caption + ' '
all_captions.append({"language " : tr.language_code + " - " + tr.language, "captions" : captions_text})
caption = None
captions_text = ''
print("="*20)
print("Done")
In the all_captions
variable, will be stored the captions and the language obtained from the given VIDEO_ID
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论