Draft

Recipe for Success: Evaluating Tagging Quality with F1 Scores

introduction
code
analysis
Author

Matt

Published

March 17, 2025

What is Tagging and Why Does It Matter?

Tagging is the process of assigning labels or categories to data, helping to organize and retrieve information efficiently. Whether you’re dealing with a collection of images, documents, or product inventory, proper tagging makes it easier to find and analyze data.

For example, in an e-commerce store, products might be tagged as “electronics,” “clothing,” or “home goods.” In machine learning, tagging is essential for training models, such as categorizing emails as “spam” or “not spam.”

However, not all tagging is created equal. Some tags might be inaccurate, inconsistent, or incomplete. This is why evaluating the quality of tagging is crucial. If tags are incorrect they can mislead users, reduce search accuracy, or even harm automated systems effectiveness.

How Do We Measure Tagging Quality?

One way to assess tagging quality and effectiveness is by comparing the predicted tags to a known set of correct tags. This is commonly done using a Confusion Matrix, followed by precision, recall, and the F1 score.

A Confusion Matrix is a table that summarizes classification performance by comparing actual values (ground truth) with predicted values. It helps us see where errors occur.

Delivered (Yes) Delivered (No)
Ordered (Yes) True Positives (TP) False Negatives (FN)
Ordered (No) False Positives (FP) True Negatives (TN)

Precision measures how many of the assigned tags are actually correct.

Recall measures how many of the correct tags were successfully assigned.

F1 Score balances precision and recall into a single number, making it a useful metric for evaluating tagging performance.

The formula for the F1 score is:

F1 = 2 * ((Precision * Recall) / (Precision + Recall))

Where:

  • Precision = TP / (TP + FP)
  • Recall = TP / (TP + FN)
  • TP (True Positives) = # of Correctly Assigned Tags
  • FP (False Positives) = # of Incorrectly Assigned Tags
  • FN (False Negatives) = # of Correct Tags that were missed

So it is possible to measure the effectiveness of tagging data using an LLM if we have: - a known set of true tags (often referred to as our “ground truth” tags) - we have determined the number of incorrect tags assigned by our model - and we use the precision, recall, and F1 score calculations to quantify this effectiveness

Now, let’s bring this concept to life with a real-world example.

Ordering Groceries for Your Favorite Recipe

Imagine you place an online grocery order for the ingredients needed to bake chocolate chip cookies. Your grocery list (the ground truth) is:

  • Flour
  • Sugar
  • Butter
  • Eggs
  • Vanilla Extract
  • Baking Soda
  • Chocolate Chips

However, when your order arrives, you receive the following delivered items (predicted tags):

  • Flour
  • Sugar
  • Butter
  • Eggs
  • Salt
  • Almond Milk
  • Chocolate Chips

Some items match (Flour, Sugar, Butter, Eggs, and Chocolate Chips) but some are incorrect (salt and almond milk were added), and some are missing (e.g., vanilla extract and baking soda were not delivered).

Let’s map this into a confusion matrix:

              Delivered (Yes)        Delivered (No)

Ordered (Yes) Flour, Sugar, Butter, Vanilla Extract, Baking Soda
Eggs, Chocolate Chips (FN = 2)
(TP = 5) Ordered (No) Salt (FP = 1) - (TN not considered)

Why is True Negative (TN) Not Considered?

In this context, TN represents items that were neither ordered nor delivered, which is not useful for evaluating tagging accuracy. Since we only care about whether the right items were tagged (delivered) or missing (not delivered), TN is ignored in this scenario.

Let’s calculate the F1 score to evaluate the delivery accuracy.

Python Implementation

# Define the grocery lists

# Our Ground Truth Tags
ordered_items = {"flour", "sugar", "butter", "eggs", "vanilla extract", "baking soda", "chocolate chips"}

# Our Predicted Tags
delivered_items = {"flour", "sugar", "butter", "eggs", "salt", "chocolate chips"}

# Calculate TP (True Positives), FP (False Positives), FN (False Negatives)
TP = len(ordered_items & delivered_items)  # Correctly delivered
FP = len(delivered_items - ordered_items)  # Extra items
FN = len(ordered_items - delivered_items)  # Missing items

# Compute Precision and Recall
precision = TP / (TP + FP) if (TP + FP) > 0 else 0
recall = TP / (TP + FN) if (TP + FN) > 0 else 0

# Compute F1 Score
f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

# Print results
print(f"Correctly Delivered (TP): {TP}")
print(f"Extra Items (FP): {FP}")
print(f"Missing Items (FN): {FN}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1_score:.2f}")

Expected Output

Correctly Delivered (TP): 5
Extra Items (FP): 1
Missing Items (FN): 2
Precision: 0.833
Recall: 0.714
F1 Score: 0.769

Conclusion

In this example, our grocery order had an F1 score of 0.769, meaning the delivery was fairly accurate but not perfect. If this were a machine learning tagging system, we’d aim to improve this score by reducing incorrect tags (false positives) and ensuring all relevant tags are included (reducing false negatives).

F1 scores are widely used in evaluating search algorithms, recommendation systems, and automated classification models, making them a valuable tool for ensuring high-quality tagging.

Whether you’re analyzing machine learning performance or just checking if your online grocery delivery was accurate, the F1 score provides a clear and balanced way to measure success.

Do you have a real-world example where incorrect tagging caused issues? Share your experiences in the comments!