Sound to text transcribe & translate from YouTube video

Why did I start this project?

As a person who learned programming by watching and following many tutorial videos on YouTube, I wanted to provide benefits similar to those of my parents, who also desired to learn new knowledge. While I can understand English decently thanks to my parents, who supported me in studying in the USA, they are not as familiar with English. Still, English speakers also make the most informative videos about new knowledge. When there are good resources to learn new skills, it shouldn’t be prevented due to the language barrier. Thus, I wanted to use the power of Python programming language to solve this problem.

Tech stack I used

Python 3+
Python Virtual environment
Open AI Whisper API
Google Translation API

How did I build this project? (March 23rd ~ 31st, 2023)

Step 1: Install the virtual environment and activate it

When I started this project, I expected many dependencies to be installed, so I decided to use a Python virtual environment with the pip3 install virtualenv.

Since I was using MacOS, I ran source venv/bin/activate to activate the Python virtual environment,’ where ‘venv’ is the virtual environment’s name.

Step 2: Create a function to test in the terminal

# app.py
def main():
	print("testing")

if __name__=="__main__":
	main()

In the terminal (‘zsh’ for my case), check if I can run python3 app.py; the app.py should print “testing.”

Step 3: Install the required dependencies

pip install python-dotenv: to save and load credentials
pip install pytube
pip install openai

Step 4: Create two different functions

from pytube import YouTube

def extract_audio(url: str):
    yt=YouTube(url)
    title = yt.title
    length = yt.length # length in seconds

    # Print video length in hrs, mins, secs
    hours = length // 3600
    minutes = (length % 3600) // 60
    seconds = length % 60
    print(f"Video title: {title} \n")
    print(f"Video length: {hours} hrs {minutes} mins {seconds} secs")

    # Get the highest-resolution audio stream
    # Type of audio-stream is <class 'pytube.streams.Stream'>
    audio_stream = yt.streams.filter(only_audio=True).order_by('abr').desc().first()
    # Orders streams by their audio bit rate (abr) in descending order, and finally selects the first (i.e., highest bit rate) audio stream.

    # audio size in bytes
    audio_size=audio_stream.filesize
    # audio size in MB, 1MB=1000000bytes
    audio_size_MB=audio_size/1000000
    print(f"Audio file size: {audio_size_MB} MB")
    return audio_stream, audio_size_MB

Using the ‘pytube’ library, this function extracted the only high-quality sound from the YouTube video URL that was inserted into the main function via the terminal. The length of the video was converted into hours, minutes, and seconds to check whether the sound exceeded the Whisper API limit or not. The highest audio-extracting process was required because YouTube videos contain various data types, like metadata.

from dotenv import load_dotenv
import os
import openai

# Transcribe in English text
def transcribe_to_english(audio_stream):
    openai.api_key = os.getenv("OPENAI_API_KEY")

    # audio file path
    media_file_path=audio_stream.download(
        output_path=os.path.join(os.getcwd(), 'audio'),
        filename='audio.wav',

    )
    media_file=open(media_file_path, 'rb')
    # transcribe audio to English
    response=openai.Audio.translate(
      model='whisper-1',
      file=media_file,
      to_language='en',
    )

    # remove audio file
    os.remove(media_file_path)
    return response.text

To be able to run this code, it’s necessary to get an API key from Open AI and use the Whisper API model to get a transcript from the audio stream data extracted from the extract_audio function.

Step 5: Modify the main function to call other functions

import argparse

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--input", '-i', type=str,required=True, help="YouTube video link")
    args = parser.parse_args()
    url = args.input
    #extract_audio(url)

    # assign audio and audio_size as variable
    audio_stream, audio_size_MB =extract_audio(url)

    if audio_size_MB > MAX_SIZE:
        print("Audio size is more than 25MB, starting chunk files... ")

    else:
        # transcript audio to English
        ENG_TEXT=transcribe_to_english(audio_stream)
        translate_text(ENG_TEXT, GOOGLE_PROJECT_ID)

    return url

argparse was used to allow users to input the YouTube video URL directly into the terminal.

The if-else statement was used because, at that time, the Open AI’s whisper API model had limitations like maximum audio chunk size. If the audio file size is more than 25MB, it must be split into multiple chunks to transcribe or translate them.

Challenges I faced

Parsing the exact chunk of the sound
Not perfectly transcribe another language like Korean
Accessibility of the project for parents
The high price of the better version of models of Google Translation API

Since humans don’t speak the language with specific intervals of chunks of sound like the robot, when I was trying to split the extracted sound into several chunks, it cut in the middle of the sentence. Getting the whole transcript was fine, but choosing a specific interval of the transcript was challenging. In addition, the ‘whisper’ model from Open AI was trained mainly in English; if I applied it to other languages, it still showed a lack of transcribing ability.

The project was only available on my local computer and my iTerm2 terminal. So, unless I tell my parents how to run the program, they cannot use it.

What did I learn?

Built-in Python libraries like argparse, os
Parsing the chunk

At first, accepting the URL on the terminal and reading files was challenging, but it was exciting that the built-in functions allowed the programmer to handle this process. Due to the company’s limited APIs, parsing the only required chunk of sound was not easy. I might lack an understanding of sound engineering to process the sound properly. Furthermore, with the power of Artificial Intelligence, it was relatively easier to build this project than I had imagined before.

Conclusion

Since the APIs that I used were early beta versions, they could be improved later. To make this project work for non-programmers, I needed to learn to build API, like using Fast API, and use some frontend skills to provide better UIUX.

Source code: https://github.com/Seokhyeon315/Project2-SoundToText