Solution: solution.ipynb
In this homework, we'll evaluate the quality of our RAG system.
It's possible that your answers won't match exactly. If it's the case, select the closest one.
Solution:
- Video: TBA
- Notebook: TBA
Let's start by getting the dataset. We will use the data we generated in the module.
In particular, we'll evaluate the quality of our RAG system with gpt-4o-mini
Read it:
url = f'{github_url}?raw=1'
df = pd.read_csv(url)We will use only the first 300 documents:
df = df.iloc[:300]Now, get the embeddings model multi-qa-mpnet-base-dot-v1 from
the Sentence Transformer library
Note: this is not the same model as in HW3
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name)Create the embeddings for the first LLM answer:
answer_llm = df.iloc[0].answer_llmWhat's the first value of the resulting vector?
- -0.42
- -0.22
- -0.02
- 0.21
Now for each answer pair, let's create embeddings and compute dot product between them
We will put the results (scores) into the evaluations list
What's the 75% percentile of the score?
- 21.67
- 31.67
- 41.67
- 51.67
From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.
So we need to normalize them.
To do it, we
- Compute the norm of a vector
- Divide each element by this norm
So, for vector v, it'll be v / ||v||
In numpy, this is how you do it:
norm = np.sqrt((v * v).sum())
v_norm = v / normLet's put it into a function and then compute dot product between normalized vectors. This will give us cosine similarity
What's the 75% cosine in the scores?
- 0.63
- 0.73
- 0.83
- 0.93
Now we will explore an alternative metric - the ROUGE score.
This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.
It can give a more nuanced view of text similarity than just cosine similarity alone.
We don't need to implement it ourselves, there's a python package for it:
pip install rouge(The latest version at the moment of writing is 1.0.1)
Let's compute the ROUGE score between the answers at the index 10 of our dataframe (doc_id=5170565b)
from rouge import Rouge
rouge_scorer = Rouge()
scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
There are three scores: rouge-1, rouge-2 and rouge-l, and precision, recall and F1 score for each.
rouge-1- the overlap of unigrams,rouge-2- bigrams,rouge-l- the longest common subsequence
What's the F score for rouge-1?
- 0.35
- 0.45
- 0.55
- 0.65
Let's compute the average F-score between rouge-1, rouge-2 and rouge-l for the same record from Q4
- 0.35
- 0.45
- 0.55
- 0.65
Now let's compute the F-score for all the records and create a dataframe from them.
What's the average F-score in rouge_2 across all the records?
- 0.10
- 0.20
- 0.30
- 0.40
- Submit your results here: https://courses.datatalks.club/llm-zoomcamp-2024/homework/hw4
- It's possible that your answers won't match exactly. If it's the case, select the closest one.