Extracting Relevant Information from Video Recording - Act Phase

Extracting Relevant Information from Video Recording - Act Phase

Solution Concept:

Due to time constraints being a constant inconvenience besides internet accessibility issues, several students find themselves overwhelmed and unable to manage their tasks well and eventually messing up their work-life balance. That is why we opted for InfoMiner! InfoMiner is an AI-powered video transcription and analysis service that can easily convert any lesson recording into accurate and searchable text, while generating a full report detailing key points such as a comprehensible summary, a question-answering panel as well as extracting main keywords and analyzing the speaker's sentiment.
Have we also mentioned that InfoMiner generates this whole report in mere seconds? Thus, making it a very useful tool for students to organize their time better, access information faster and easily creates detailed summaries that help them prepare for exams and retain information.

Video Transcription:

As we previously mentioned, the video transcription process that we were to implement was through 2 essential steps:
Converting Video to Audio through MoviePy: this is a rather simple step that is done in a few lines of codes.
Recognizing speech from the audio file using SpeechRecognition library: Following this method, we used the Google recognizer due to its precise and multi-lingual transcription.
Although it proved to be a useful method, it has its own inconveniences as the recognizer limits every request to 10MB each so we can only transcribe short or low-quality audio files. This is a huge issue when it comes to our project since our work consists of transcribing online class sessions that last more than an hour each.
That is why we went on to drop this method and used a pretrained model called Whisper that encompasses these two steps as well as providing other useful functionalities.

Whisper is trained on around 769M parameters and is quite efficient in fulfilling the Video Transcription task. It takes a s an input a video file and outputs a json file containing the text transcription, the URL of the video (in case of using an online and not a local video reference) and the video title. Here's an example of the output:

Text Summarization:
After transcribing the video file to text, we can now work on the several functionalities that our project englobes. Starting with Text Summarization, we are essentially inputting the transcribed text to a pretrained summarization model called Pegasus with over 568M parameters. The output is a sequence of tokens selected by the model using beam search decoding algorithm. The selected sequence of tokens forms the final output summary with a fixed maximum length= 128 tokens.

Here is a brief example on how the Pegasus Transformer works:

For the fine-tuning part, we will use ROUGE to evaluate the Pegasus model performance after fine-tuning it on the CNN/Daily Mail dataset.

Question Answering:
For the second task, we will be dealing with the question answering functionality, more precisely reading comprehension of the transcribed text. We input the text and pose a question to the model and see if it gets the answer correctly. The model is based on BERT architecture but specifically trained on a large corpus of French text (138 Gb of French text) -> 110 million parameters. The output is the answer to the question asked.

For the fine-tuning part, we used the camemBERT model which was fine-tuned on the FSQuAD (French Squad) dataset and fine-tuned it on the downstream task of open question answering.

Sentiment Analysis:
Next up, we have the sentiment analysis part which is used to predict the tone and motive of the speaker in the video, this will help provide more context to the situation as well as understanding the scenario. As usual, using the transcribed text as input, this time into the BERT-ABS for abstractive sentiment analysis with over a 140M parameters.

The output of BERT-ABS can be represented as a set of (aspect_term, polarity_label) pairs, where aspect_term is a string representing the aspect term and polarity_label is a string representing the predicted sentiment polarity label for that aspect. The polarity label can take one of three values: positive, negative, or neutral.

For the fine-tuning part, we will be using MAMS (Multimodal Aspect-based Sentiment Analysis) dataset, which provides aspect-based sentiment annotations for YouTube video comments in order to perform fine-tuning of BERT-ABS and we will evaluate the final results using F1-score.

Keyword Extraction:
Last but not least, we will be looking at the keyword extraction feature. We will be using this functionality to provide the specific topic and lesson being taught in the video which will help categorize the videos (which course material is being taught and which lesson it is exactly through the keyword extraction).

Using the fine-tuned Keybert model for keyword extraction with over 33M parameters, and outputting the main keywords present:

For the fine-tuning part, We will be using a fine-tuned KeyBERT which is a transformer-based natural language processing model that has been trained specifically for the task of keyword extraction from a given input text or document. It uses the pre-trained BERT architecture and is further fine-tuned on a large corpus of text to learn the specific patterns and structures that are relevant for the keyword extraction task. The model is available as a Python package and can be easily integrated into natural language processing workflows.

Since the keybert is a multi-language model trained on large multilingual datasets and can extract keywords or key phrases from texts written in multiple languages, we will be using it on our French datasets.

Deployment Phase:
When it came to deploying InfoMiner, we created a web application using React for the structuring and designing of the app while implementing the different functionalities through the FastAPI framework.

We also used Pytube in our application, which allows the download of videos following the input of a URL.

The application's interface is essentially composed of two major segments:
Firstly, the video link container where you put in the video you would like to be transcribed. Secondly, a full report of the video, containing all the relevant information (video text transcript, summary, extracted keywords, QA panel and sentiment analysis), will be generated below.

Conclusive Statement:
These methods are functional and ready to be implemented, however we are still working on fine tuning the models or even replacing some of them with other state-of-the-art models in case we get better result. We will be further working on pre-finetuning in order to improve the predictive performance of these transformer models before moving on to the deployment phase and assembling and integrating all of these functionalities together in one single forward. In the meantime, I advise you to stay in the loop for any upcoming updates.

Commentaires