AI in Action - we built our own RAG app, should you?

One advanced AI approach we use is Retrieval-Augmented Generation (RAG). RAG combines the power of retrieval-based models and generation-based models.

Artificial Intelligence (AI) is revolutionizing various industries by enhancing work efficiency and accuracy. At Amsterdam Standard AI HUB, we have embraced models to streamline our processes and deliver higher quality results. This blog post shares our practical experience in utilizing AI models to improve our work, specifically through the analysis and sanitization of text data.

What is RAG in AI?

We build a RAG app that would summarise and pull out key points from transcriptions.

A RAG, essentially, retrieves relevant information from a large dataset and uses it to generate more accurate and contextually relevant responses. This hybrid approach significantly improves the performance of AI in tasks requiring deep contextual understanding and accurate data generation, making it a crucial tool in our text analysis and sanitization process.

blog_pp_ai_hub

One of the ideas we're implementing is the analysis of all kinds of texts: meeting transcriptions, documents, documentation, and notes. These texts help us build context and a knowledge base, allowing us to draw interesting conclusions, detect risks, improve future meetings, and generate tasks. However, for the analysis to be effective, we need to ensure that the input data we work with is clean and free of spelling or typographical errors. Clean data = better and accurate analysis.

After analyzing the transcriptions of our meetings, we noticed that some words were not recorded correctly. This mainly involved key words such as proper names, names that function within our organization, project names, and company-specific and industry slang.

Before building a RAG app, know this.

First off, AI is predominantly trained on and built for English, which quickly became apparent when we saw how poorly our meetings were being transcribed. We usually speak in Polish and often mix in technical words in English. We concluded that one of the key pillars of our RAG would be sanitizing words to ensure the context is correct.

Challenges with Misspellings and Context

During our analysis of meeting transcriptions, we noticed that some key words were incorrectly transcribed. This was particularly true for some names, project names, and company-specific slang or jargon.

For example, our CEO Leopold, commonly referred to as "Leo," was often transcribed as "Rio." Similarly, the project name "MrWork" was variously transcribed as "Star Warka," "Mister murkiem," and "MrWok." These errors highlight the difficulty of accurately transcribing words that mix English and Polish or are specific to our company.

To effectively correct these misspellings, it's crucial to have a broader context, as the same incorrect phrase might need different corrections depending on the company or project involved. For instance, "Star Warka" would be interpreted differently in the context of Amsterdam Standard compared to a company associated with George Lucas.

Screenshot 2024-07-15 at 15.29.21

Implementing Text Sanitization

To clean and correct these errors, we decided to implement a context-based text sanitization process.

We leveraged our company's knowledge base, which includes information about our projects and employees, to provide the necessary context for corrections.

We experimented with several solutions, including Vector Similarity and natural language processing (NLP) models, but found the best results with Chat Completion.

The key here is constructing the right prompt. In our prompt, each role is responsible for something different:

- The system role specifies the behavior we expect from the model and indicates the instructions we want it to perform, in sequential steps.
- The assistant role transfers the context we are working with.
- The user role is for providing the text we want to correct, marked with a start and end tag.

carbon (20)

That worked well, but we found a way to optimize it better

[Update July] Our better prompt construction

In our new approach, the System role now also delivers the context, the Assistant role prepares the expected outcome and the User role remains pretty much the same.

When we construct the message in this way, we make the model behave they way we want, and follows the instructions of the assistant:

- System: Instruction and context.
- User: Text to correct.
- Assistant: Expected model output.
- System: Instruction and new context.
- User: New text to correct.

prompt with example

When working with AI models, we must remember their limitations regarding message window size - in other words, the number of tokens we can use in communication. Currently, these are large values, such as in the case of GPT-4 - 128,000 tokens. However, it is recommended to use smaller messages containing fewer tokens to improve its effectiveness. To achieve this, the processed text or, in our case, the transcription should be divided into smaller chunks.

Llama 3 8B vs GPT-4 vs GPT-4o for transcriptions

We tested three different models: GPT-4, GPT-4o, and the open-source Llama 3 8B, set up on a local machine. The text to be sanitized initially looked as follows, with errors to be corrected highlighted in red:

Click image to zoom	Our knowledge base includes a list of names and projects we carry out at Amsterdam Standard, as well as a list of words that often appear in our conversations but are not correctly transcribed. For example, "Rio" - "Leo."
Click image to zoom	The best results in terms of effectiveness were achieved using GPT-4 and GPT-4o. Words were correctly conjugated and accurately fixed. Here's the sanitized text using GPT-4o.

AI models vs misspelled words

Llama 3 8B	8 fixed	10 misspellings remaining
GPT-4	17 fixed	2 misspellings remaining
GPT-4o	18 fixed	1 misspelling left

However, considering the time, the sanitization was performed using three separate prompts: for correcting project names, person names, and known Polish-English incorrect words. GPT-4o wins, being almost four times faster than GPT-4. The local Llama leaves much to be desired, but we must remember that it runs on an ordinary laptop without a good GPU.

Llama 3 8B	18 minutes
GPT-4	202 seconds
GPT-4o	35 seconds

In conclusion, constructing the right prompt and using a good model like GPT-4 and GPT-4o handled the task very well. This example only shows a snippet of the entire conversation, but by dividing the text into smaller parts, we can process any amount of text in a similar manner.

We are eager to see what other knowledge we can gain from experimenting and researching AI. We also hope that future versions of AI models will be smart enough to understand multiple languages at once, thus not having the need for an additional sanitation layer.

We've made another step into understanding the workings of AI, and how it can benefit software development, optimize workflows and ultimately make us a bit greener.

Written by: Natalia, on July 15, 2024

Tags:

Code Optimizing