Sound to text transcribe & translate from YouTube video
Why did I start this project?
As a person who learned programming by watching and following many tutorial videos on YouTube, I wanted to provide benefits similar to those of my parents, who also desired to learn new knowledge. While I can understand English decently thanks to my parents, who supported me in studying in the USA, they are not as familiar with English. Still, English speakers also make the most informative videos about new knowledge. When there are good resources to learn new skills, it shouldn’t be prevented due to the language barrier. Thus, I wanted to use the power of Python programming language to solve this problem.
Tech stack I used
- Python 3+
- Python Virtual environment
- Open AI Whisper API
- Google Translation API
How did I build this project? (March 23rd ~ 31st, 2023)
Step 1: Install the virtual environment and activate it
When I started this project, I expected many dependencies to be installed, so I decided to use a Python virtual environment with the pip3 install virtualenv
.
Since I was using MacOS, I ran source venv/bin/activate
to activate the Python virtual environment,’ where ‘venv’ is the virtual environment’s name.
Step 2: Create a function to test in the terminal
# app.py
def main():
print("testing")
if __name__=="__main__":
main()
In the terminal (‘zsh’ for my case), check if I can run python3 app.py
; the app.py
should print “testing.”
Step 3: Install the required dependencies
pip install python-dotenv
: to save and load credentialspip install pytube
pip install openai
Step 4: Create two different functions
from pytube import YouTube
def extract_audio(url: str):
yt=YouTube(url)
title = yt.title
length = yt.length # length in seconds
# Print video length in hrs, mins, secs
hours = length // 3600
minutes = (length % 3600) // 60
seconds = length % 60
print(f"Video title: {title} \n")
print(f"Video length: {hours} hrs {minutes} mins {seconds} secs")
# Get the highest-resolution audio stream
# Type of audio-stream is <class 'pytube.streams.Stream'>
audio_stream = yt.streams.filter(only_audio=True).order_by('abr').desc().first()
# Orders streams by their audio bit rate (abr) in descending order, and finally selects the first (i.e., highest bit rate) audio stream.
# audio size in bytes
audio_size=audio_stream.filesize
# audio size in MB, 1MB=1000000bytes
audio_size_MB=audio_size/1000000
print(f"Audio file size: {audio_size_MB} MB")
return audio_stream, audio_size_MB
Using the ‘pytube’ library, this function extracted the only high-quality sound from the YouTube video URL that was inserted into the main function via the terminal. The length of the video was converted into hours, minutes, and seconds to check whether the sound exceeded the Whisper API limit or not. The highest audio-extracting process was required because YouTube videos contain various data types, like metadata.
from dotenv import load_dotenv
import os
import openai
# Transcribe in English text
def transcribe_to_english(audio_stream):
openai.api_key = os.getenv("OPENAI_API_KEY")
# audio file path
media_file_path=audio_stream.download(
output_path=os.path.join(os.getcwd(), 'audio'),
filename='audio.wav',
)
media_file=open(media_file_path, 'rb')
# transcribe audio to English
response=openai.Audio.translate(
model='whisper-1',
file=media_file,
to_language='en',
)
# remove audio file
os.remove(media_file_path)
return response.text
To be able to run this code, it’s necessary to get an API key from Open AI and use the Whisper API model to get a transcript from the audio stream data extracted from the extract_audio function.
Step 5: Modify the main function to call other functions
import argparse
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--input", '-i', type=str,required=True, help="YouTube video link")
args = parser.parse_args()
url = args.input
#extract_audio(url)
# assign audio and audio_size as variable
audio_stream, audio_size_MB =extract_audio(url)
if audio_size_MB > MAX_SIZE:
print("Audio size is more than 25MB, starting chunk files... ")
else:
# transcript audio to English
ENG_TEXT=transcribe_to_english(audio_stream)
translate_text(ENG_TEXT, GOOGLE_PROJECT_ID)
return url
argparse
was used to allow users to input the YouTube video URL directly into the terminal.
The if-else
statement was used because, at that time, the Open AI’s whisper API model had limitations like maximum audio chunk size. If the audio file size is more than 25MB, it must be split into multiple chunks to transcribe or translate them.
Challenges I faced
- Parsing the exact chunk of the sound
- Not perfectly transcribe another language like Korean
- Accessibility of the project for parents
- The high price of the better version of models of Google Translation API
Since humans don’t speak the language with specific intervals of chunks of sound like the robot, when I was trying to split the extracted sound into several chunks, it cut in the middle of the sentence. Getting the whole transcript was fine, but choosing a specific interval of the transcript was challenging. In addition, the ‘whisper’ model from Open AI was trained mainly in English; if I applied it to other languages, it still showed a lack of transcribing ability.
The project was only available on my local computer and my iTerm2 terminal. So, unless I tell my parents how to run the program, they cannot use it.
What did I learn?
- Built-in Python libraries like
argparse
,os
- Parsing the chunk
At first, accepting the URL on the terminal and reading files was challenging, but it was exciting that the built-in functions allowed the programmer to handle this process. Due to the company’s limited APIs, parsing the only required chunk of sound was not easy. I might lack an understanding of sound engineering to process the sound properly. Furthermore, with the power of Artificial Intelligence, it was relatively easier to build this project than I had imagined before.
Conclusion
Since the APIs that I used were early beta versions, they could be improved later. To make this project work for non-programmers, I needed to learn to build API, like using Fast API, and use some frontend skills to provide better UIUX.
Source code: https://github.com/Seokhyeon315/Project2-SoundToText