MAE, MSE, and RMSE

The difference between MAE (mean absolute error), MSE (mean squared error), and RMSE (root mean squared error) is subtle, and I’ve seen people new to machine learning often choose RMSE without understanding its benefits. As a brief reminder, these metrics are just loss functions (i.e. a lower score is better) and are way to measure predictive accuracy. They calculate a single metric to summarize loss when you have a data set of size $N$, and an continuous outcome $y_n$ and it’s associated prediction $\hat{y}_n$ (based on the model you choose).

MAE:

$$\text{MAE}=\frac{1}{N}\sum_{n}|y_n - \hat{y}_n|$$

MSE:

$$\text{MSE}=\frac{1}{N}\sum_{n}(y_n-\hat{y_n})^2 $$

RMSE:

$$\text{RMSE}=\sqrt{MSE}$$

Here’s an advantage to using MAE:

You can interpret the metric in terms of the units that your data is measured in. For example, if your model is predicting minutes spent on a streaming platform then the MAE of the model is interpreted as the average difference between the actual minutes and predicted minutes spent on the platform.

Here’s an advantage to using MSE:

The square transformation in MSE causes large error values to be disproportionately large in comparison to the absolute value of differences in MAE. Larger errors end up having a much larger effect. This is particularly useful when you want the model to be penalized more for generating larger errors. (One disadvantage however is that it causes your model to be sensitive to outliers.)
The derivative of MSE is well defined. On the other hand, the derivative of MAE is not well defined when $y_n = \hat{y}_n$ (i.e. the absolute value function is not differentiable at 0). This is important since we use derivatives to minimize loss functions!

In addition to the MSE points above, here’s another advantage to using RMSE:

Because we’re taking the square root, the RMSE can be interpreted in terms of the units that the data is actually measured in.

So if you’re willing to accept disproportionately large error values, then RMSE gives you well defined derivates and interpretability.