Let’s review the decomposition of error, as it was covered in the first lecture.
We are trying to find a function, f* which will minimize the expected risk. Here risk is defined as the difference between the predicted targets and the labeled targets for some distribution of x&y.
But when we attempt to find f*, we are searching from a family of functions, F. We can consider every f in F as a candidate function. Within F, we can seek fn, or the function which will minimize the empirical risk (note that this is different than expected risk).
But the function fn is not necessarily the most optimal, and the most optimal function f* is unlikely to be within this family F. So now we define g* as the function which minimizes expected risk from the set of candidate functions. And we assume that f*, fn, g* are all unique.
The error can then be written as follows
e = e_approximate + e_statistical
The approximate error is how close the most optimal function of the candidates considered (g*) is to the actual most optimal function (f*). [bias]
The statistical error is that produced by the samples we have drawn of x,y. It’s the effect of minimizing the empirical risk (fn) instead of the expected risk (g*). [variance]
The optimization error is the error that results from the data distribution & feature representation. [noise]