There’s a lot of great diagrams, explanation, and tedious calculations to explain the bias-variance trade off, but I was trying to come up with a pithy explanation for statisticians who understand regression.

High Bias, Low Variance. Suppose you have some outcome data $y$ with moderate non-linearities, and you want to model it with some regression function $f$. Your first option is super basic; just the mean $y \sim f(\bar{y})$. The mean is not going to be sufficient enough to model the complexities so you’ll have high bias (predictions that are very different from the true value). However, you’ll have low variance (predictions with no variation) since $f$ (the mean in this case) is not sufficient enough to capture the variability in the outcome variable.

Low Bias, High Variance. Now you start packing $f$ with interaction terms and complex polynomials. The bias is going to go down since the model is really good at handling the complexities in the outcome variable. However, the variance is going to increase since $f$ is probably modeling noise in addition to the outcome variable.

We can also think about this in terms of underfitting and overfitting to the data. A simple $f$ with high bias and low variance will underfit the data, whereas a complexj $f$ with low bias and high variance will overfit the data.

underfit data
high bias
low variance
simple f()
complex f()
overfit data
low bias
high variance

ideally, you’re looking for some sweet spot in the middle that balances the bias and the variance. Although, it all depends on the data you’re modeling (e.g. in some cases a simple $f$ is better than a more complex $f$, and vice versa).