How to get subtitles for YouTube videos using Python
Sometimes, we need to get transcripts/subtitles of YouTube videos, but to do this, we would have to go to the YouTube video and manually generate the transcript. In Python, we have a package named youtube_transcript_api that can be used to automatically give you a transcript that you can use as plain text.
First, let us install this package by running:
pip install youtube_transcript_api
Now, need the YouTube video id for the transcript we want to generate. In the URL below, the text in green is the video id:
https://www.youtube.com/watch?v=Y8Tko2YC5hA
Code
Now, let’s see the code:
from youtube_transcript_api import YouTubeTranscriptApidef generate_transcript(id):transcript = YouTubeTranscriptApi.get_transcript(id)script = ""for text in transcript:t = text["text"]if t != '[Music]':script += t + " "return script, len(script.split())id = 'Y8Tko2YC5hA'transcript, no_of_words = generate_transcript(id)print(transcript)
Explanation
- In line 1, we import the required package.
- In line 3, we create the
generate_transcript()function, which accepts the videoidas a parameter and will return the transcript as well as the number of words in the transcript. - In line 4, we use the
get_transcript()method of our package that gets the transcript of theidprovided as a parameter. This function returns a list of dictionaries, so we need to do some processing to convert it to a single string. - In line 7, we run a loop to iterate over all the dictionary values and fetch the text for each time interval. Then, we combine it into a string.
- In line 9, we added a filter to skip the
Musicso that, if there is any music in the video, it will not come to our final transcript string. - Finally, in line 12, we return the values.
- In line 15, we call our function by passing the video
id.
This package will throw an error if there is no subtitle for the YouTube video for which you passed the video
id.