At the latest conference on Neural Information Processing Systems (NIPS 2016), Andrew Ng shared some ideas about deep learning. Let me share them with you.
The first great advantage of deep learning is its scale. Andrew summarized it in the following chart:
Deep learning models perform better when the amount of data is increased. Not only that, the larger the neural network, the better it works for larger datasets, unlike traditional models where once performance reaches a certain level, adding data or complexity to the model does not necessarily lead to better performance.
Another reason deep learning models are so powerful is their capacity to learn in an end-to-end fashion. Traditional models usually need significant feature engineering. For example, a model able to transcribe the voice of a person may need to do many intermediate steps with the inputs, e.g., finding the phonemes, correctly chaining them and assigning a word to each chain.
Deep learning models do not usually need that kind of feature engineering. You train them end-to-end, i.e. by showing the model a large number of examples. However, the engineering efforts, instead of being applied to transforming the features, go to the architecture of the model. The data scientist will need to decide and try the neuron types he want, the number of layers, how to connect them, etc.
Challenges in model construction
Deep learning models have their own challenges. Many decisions have to be taken during their construction process in order to make the model successful. If a wrong path is taken, much time and money will be wasted, so how can data scientists make informed decisions on what to do next to improve their model? Andrew showed us his classical decision-making framework used to develop models, but this time he extended it to other useful cases.
Let’s start with the basics: in a classification task (for example, making a diagnosis from a scan), we should have a good idea of the errors from:
- Human experts
- Training sets
- Cross-validation (CV) set (also called development or dev set)
Once we have these errors a data scientist can follow a basic workflow to discover valid decisions in the construction of the model. First ask are your training errors high? If so, the model is not good enough; it may need to be richer (e.g., a larger neural network) with a different architecture, or need more training. Repeat the process until bias is reduced.
Once the training set errors are reduced, a low CV set error is needed. Otherwise, the variance is high, meaning more data is required, more regularization or a new model architecture. Repeat until the model performs well in the training and the CV set.
Nothing new there. However, deep learning is already changing this process. If your model is not good enough there is always a “way out”: increase your data or make your model larger. In traditional models regularization is used to tune this trade-off, or new features are generated – which isn’t always easy. But with deep learning we have better tools to reduce both errors.
Refining the bias/variance process for artificial data sets
But if access to a vast amount of data isn’t always possible the alternative is to build your own training data. A good example could be the training of a speech recognition system, where an artificial training sample can be created by adding noise to the same voice. However, that does not mean that the training set will have the same distribution as the real set. For these cases the bias/variance trade-off needs to be framed differently.
Imagine that, for a speech recognition model, we have 50,000 hours of generated data but only 100 hours of real data. In such a case the best advice is to take the CV set and the test set from the same distribution. In that case, the generated set will be the training set, and the real set should be split into CV and test sets. Otherwise, there will be different distributions between the CV and the test set, which will be noticed once the model is “completed”. The problem is specified by the CV set, hence it should be as close to the real set as possible.
In practice, Andrew recommended splitting the artificial data into two parts: a training set and a small portion of it, which we will call “train/CV set”. With that, we will measure the following errors:
So, the gap between (1) and (2) is the bias, between (2) and (3) is the variance, the gap between (3) and (4) is because of the distribution mismatch, and the gap between (4) and (5) is because of overfitting.
With this in mind the previous workflow should be modified like this:
If the distribution error is high modify the training data distribution to make it as similar as possible to the test data. Proper understanding of the bias-variance problem allows faster progress in the application of machine learning.
Knowing the performance level of humans is very important, as this will guide decisions. It turns out that once a model surpasses human performance it is usually much harder to improve because we are getting closer to the “perfect model”, i.e. where no model can do better (“Bayes rate”). This was not a problem with traditional models, where it was hard to perform at super-human levels, but is becoming increasingly common in the realms of deep learning.
So, when building a model, take the human performance error of the most expert group of humans as reference this will be a proxy for the “Bayes rate”. For example, if a team of doctors does better than one expert doctor use the error measured by the team of doctors.
How can I become a better data scientist?
Reading many papers and replicating results is the best and most reliable path towards becoming a better data scientist. It’s a pattern Andrew has seen across his students, and one that I personally believe in.
Even if almost all you do is “dirty work” – cleaning data, tuning parameters, debugging, optimizing the database, etc., – don’t stop reading papers and replicating models, because replication eventually leads to original ideas.