Structuring Machine Learning Model
Orthogonalization
What this means is that suppose we have a joystick that controls where player moves. So moving to one axis makes player jump, while on other axis it makes the player shoot. So what orthogonalization circumvents around is its ability to adjust the knobs only on one direction. That is, moving one axis should only affect one thing and not others. In our case, moving in one axis, should only allow player to jump and not shoot.
How does this relate to ML?
Well, we want our system more robust to dependent hyper-parameters and should try to make them independent .
Single Number Evaluation Metric
Accuracy means how well your dataset did(TP+TN).
Precision => Out of your dataset that classified a value as true, how many are actually true?
Recall => Out of actual dataset that are true, how many were predicted true?
Since we cannot trust in any, what we do is take f1 score which is the single evaluation metric.
But we cannot always have single evaluation metric. Say we have many models, its accuracy increases but takes very longer time. So, running time and accuracy is its two metrics. Running time is our satisficing metric, whilst accuracy is optimising metric(metric which we are care for our optimisation process). So if our satisficing metric of running time is as low as <1000ms; and its accuracy is 95%, but one model takes about 1200ms but accuracy is 98%, we choose the first, rather than later.
It is reasonable to choose 1 out of N possible metrics as evaluation metric and N-1 as satisficing metrices.
Another example : Say we are developing Machine Learning model for trigger word detection. What we do is, our evaluation metric is its accuracy. What our satisficing metric is false positive < 5(number of times word is detected without the user actually speaking (you know actually false, but detects true)). Less than 1 here means, we take any model, that has highest accuracy metric and has false positive less than 5. Any model that has higher accuracy but has higher FP (satisficing metric), we ignore it!
Cross Validation/ Test set
The cross validation and test set must come from same distribution. We tune hyperparams from cross validation dataset and if we tend to evaluate this on dataset that comes from different distribution, we are doing it wrong.
So, we randomly shuffle data, and make sure cross validation and test set are from SAME distribution.
Size?
Remember how you used to split data such that you used 60% training, 20% cross, 20% test sets? We don’t do that for much larger dataset (say millions).
What we do is 98% training, 1% test and 1% cross validation set.
Why choose human level performance metric?
After deep learning model does better than human level performance, it has been found that the slope of learning curve of deep learning model doesn’t improve much. Any ML model reaches its saturation point of progress called Bayes Optimal Error. No matter what you do, your model cannot surpass Bayesn Optimal Error. Why am I telling this? Because human evaluation metric is not far from this error, so taking human level performance metric is good baseline.
Avoidable Bias
Simple thing. You cannot get below this error unless you are overfitting. For example, when you compare the model with human level performance, the human level error is say 2% and your training error is 3%. So if you cannot improve more than 3% without overfitting, then that (3% — 2% = 1%)called avoidable bias.
Human Level Error
How do we define human level error? Human level error is called proxy of Bayesian Optimal Error. So, if we are building the machine learning model for medical diagnostics, what should be human level error?
Typical Human => 10%
Doctor => 8%
Exp doctor=> 7%
Team of Exp doctors => 5%.
Which human level should we choose? Since human level error is chosen to be the Bayesian error, so any error cannot be it, the human level error should be Team of Exp Doctors.
Different Distributions of Training and Test Set
Suppose we are building a classifier, that takes live image from mobile and classifies whether it is a Football Game or Cricket game. What do we need at first? The built model. The problem is, the model that we train are actually high quality images. Because at first we train them on high quality image, but it doesnt work well on images taken from mobile.
Say we have 1,00,000 examples downloaded from internet and 2500 images of mobile.How do we handle this case?
Option 1.
What we can do is combine them and shuffle them and then test them. But this is a wrong idea. Because our tests and cross validation set would contain very small number of test cases from mobile images.
Option 2.
Train the 100000+2500 images, but test and cross validate on mobile images only. This way we are constraining the model to classify them accordingly.
Checking Data Mismatch
Suppose we have training set from one distribution (image downloaded from internet, high quality). Test and dev set are from mobile images. If we get 1% training error and 10% dev error, we cannot see if it is actual data mismatch or whether it is overfitting. What we do in this case is, we separate some part from training set(training-temp), say 10% and check the error there. If training-temp error and training error is almost same but dev error is large, we might say that there is data mismatch. If training error is low, but dev error and training-temp error is large, we say that it is overfitting.
When not transfer learning?
When your dataset is very large and you try to transfer learn from small dataset, it is a problem.
Multitask Learning
Train network to output multiple tasks simultaneously. One example could be multilabel classification. For example: ‘In autonomous driving’, we want to detect multiple images like (cars, pedestrians, traffic signs). Or object detection.
End to End Learning
Learning from X=>Y directly is end to end problem. If X=>A=>B=>C … =>Y is not end to end learning. Example, face recognition can be solved directly using end to end learning by checking the given input image or we can do something like take input=> detect face => recognise the face.
Another example is determining age of a kid from his x-rays. I mean, we could do something like Image=> Age. But, dataset maybe very low, so image=>boneSize => age , would be a great idea.