F1-Score: The Balanced Metric

The F1-Score is a single metric that combines Precision and Recall into a single value. It is particularly useful when you have an imbalanced dataset and you need to find an optimal balance between "False Positives" and "False Negatives."

1. The Mathematical Formula

The F1-Score is the harmonic mean of Precision and Recall. Unlike a simple average, the harmonic mean punishes extreme values. If either Precision or Recall is very low, the F1-Score will also be low.

F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

Why use the Harmonic Mean?

If we used a standard arithmetic average, a model with 1.0 Precision and 0.0 Recall would have a "decent" score of 0.5. However, such a model is useless. The harmonic mean ensures that if one metric is 0, the total score is 0.

2. When to Use the F1-Score

F1-Score is the best choice when:

Imbalanced Classes: You have a large number of "Negative" samples and few "Positive" ones (e.g., Fraud detection).
Equal Importance: You care equally about minimizing False Positives (Precision) and False Negatives (Recall).

3. Visualizing the Balance

Think of the F1-Score as a "balance scale." If you tilt too far toward catching everyone (Recall), your precision drops. If you tilt too far toward being perfectly accurate (Precision), you miss people. The F1-Score is highest when these two are in equilibrium.

4. Implementation with Scikit-Learn

from sklearn.metrics import f1_score

# Actual target values
y_true = [0, 1, 1, 0, 1, 1, 0]

# Model predictions
y_pred = [0, 1, 0, 0, 1, 1, 1]

# Calculate F1-Score
score = f1_score(y_true, y_pred)

print(f"F1-Score: {score:.2f}")
# Output: F1-Score: 0.75

5. Summary Table: Which Metric to Trust?

Scenario	Best Metric	Why?
Balanced Data	Accuracy	Simple and representative.
Spam Filter	Precision	False Positives (real mail in spam) are very bad.
Cancer Screen	Recall	False Negatives (missing a sick patient) are fatal.
Fraud Detection	F1-Score	Need to catch thieves (Recall) without blocking everyone (Precision).

6. Beyond Binary: Macro vs. Weighted F1

If you have more than two classes (Multi-class classification), you'll see these options:

Macro F1: Calculates F1 for each class and takes the unweighted average. Treats all classes as equal.
Weighted F1: Calculates F1 for each class but weights them by the number of samples in that class.

References

Scikit-Learn: F1 Score Documentation
Towards Data Science: The F1 Score Paradox.

The F1-Score gives us a snapshot at a single threshold. But how do we evaluate a model's performance across ALL possible thresholds?

1. The Mathematical Formula​

Why use the Harmonic Mean?​

2. When to Use the F1-Score​

3. Visualizing the Balance​

4. Implementation with Scikit-Learn​

5. Summary Table: Which Metric to Trust?​

6. Beyond Binary: Macro vs. Weighted F1​

References​