Scroll to read more

Introduction

Last month, after two years of painstaking work, I finished the Master’s in Data Science program at UC Berkeley.  With the sudden absence of deadlines, projects, and classes, my mind re-discovered a forgotten capacity to drift, wander, and reflect.  I found myself thinking about just how much I had learned over these two years, and how much I felt I had matured as a data scientist.   

I thought back to the first project I did on deep learning, and how wild and undisciplined my approach was.

I wondered how much better I could do now, armed with so much more knowledge and experience.  Then I realized that revisiting this project could be a perfect way to quantify just how much I had improved, and that sharing the differences between my first and second approaches could hopefully save others from repeating my mistakes.

TLDR:  I went from 58th place on my first attempt (which took 6 weeks), to 7th place on my second (which took 2 days). If you want to find out how, you’re going to have to keep reading, but I’ve also included a summary at the end of this post.

The Problem

The project was one of the options for the final project in the course “Applied Machine Learning,” and was pulled from a 2013 Kaggle Competition.  The problem came down to predicting 15 facial key-points on pictures of faces, such as the center of the eye, or edges of the mouth. 

Sample images from the competition.

The dataset is made of ~7000 training images and labels, and ~1800 test images.  Final predictions can be submitted to Kaggle and evaluated automatically.  You can compare the final results to the competition leaderboard, but it should be noted that this stopped updating 4 years ago. You can think of the scores as a measure of how many pixels the predictions were from the ground truth (so the lower, the better).

Competition leaderboard when it closed.

My Approach

In retrospect, going into this project I was definitely over-confident in my ability in and understanding of machine learning.

I was vaguely familiar with all of the necessary concepts for the project like EDA, data cleaning, data augmentation, convolutional neural networks, and regression.  But I was severely limited by my lack of experience with implementing these concepts in code.  As a result, I found myself focusing on what I could do, instead of what I should do.

Rather than progressing systematically through the problem, I bounced around wildly and haphazardly strung together the things that I was able to figure out.  But, for the cohesion of this post I will discuss the steps in the order below:

  • Exploratory data analysis
  • Data cleaning
  • Algorithm selection & fine tuning
  • Data augmentation
  • Ensemble learning

Exploratory Data Analysis

-ATTEMPT 1-

All ML projects start with a problem and some data.  So naturally, the first step is to dig in and develop a deep and nuanced understanding of these two.

But the first time around, I really didn’t know what I was looking for.  I started with the basics, visualizing the first few images in the dataset along with their key points, and looking at some summary statistics associated with each.  But honestly, I was in a rush to get to modeling, so I didn’t go much deeper than this. As a result, I missed A TON of insights that I could have gleaned from a deeper dive into the dataset.

-ATTEMPT 2-

The second time around I had a fundamentally different approach.  Before I started looking at the data, I wrote out a list of questions I wanted to answer through my EDA.  Here are some of them:

  • What kinds of “flaws” are present in the data?
  • What does the diversity of faces look like? (Age, race, gender, facial hair, …)
  • How are the faces oriented?  (All forward-facing, turned, inverted, …)
  • What kind of photometric diversity is present? (Brightness, contrast, blur, …)
  • What kind of geometric diversity is present? (Scale, rotation, cropping, …)

Answering these questions was essential to guide the data cleaning, data augmentation, and modeling processes.  Since this was a small training set of 7,049 images I decided to visualize all of them and quickly glance through each and jot down my findings. 

Many sample images.

While this seems daunting, altogether this only took about 2.5 hours, and here were some of my findings:

  • Some images were horribly labeled.
  • There seem to be two fundamentally different sets of labeling going on.
  • Good gender and age diversity, but poor racial diversity.
  • Most faced forward, with a few exceptions.
  • Huge photometric variability.
  • Little geometric variability.
Examples of unusual images.

While I had worked on this problem before, a surprising amount of this information was new to me, and helped explain a lot about why some of my previous work had backfired.  Armed with this information, I was ready to move on to data cleaning.

Data Cleaning

-ATTEMPT 1-

Normally at this point the goal is to fix up all the problems found with the data during EDA.  But because I rushed through EDA, the only major thing I found was that ~70% of the dataset only had labels for 4 of the 15 points.

There were a few ways I knew of at the time to deal with this:

  1. Fill in the missing values from a previous example.
  2. Fill in the missing values based on the mean or median of that value.
  3. Drop everything with missing values.

Since I had looked at a few images, I knew they weren’t in any sort of order, so I knew that option 1 wouldn’t work well.  For option 2, I figured that there were probably many different orientations of faces, and if the label ended up in the wrong spots, it would completely ruin what the model was learning.

Examples of poor imputation that would hurt model performance.

So, I decided to just go with option 3, and dropped all of the examples with missing fields.  This took my training dataset from 7,000 images down to 2,000 images, which should’ve been a huge red flag.

-ATTEMPT 2-

The second time around, I knew that I would have to deal with these missing values a smarter way. Instead of just guessing, I decided to test each option against a baseline model.  I split my data, creating a small validation set from some of the fully labeled examples, then tested different ways of filling in the missing values, and compared the model performance.

In addition to the three solutions listed above, I also tested a method based on K-Nearest Neighbors. Essentially, this looked at the data points we did have for each image and tried to find the most similar examples to it in the rest of the dataset.  It then borrowed the unlabeled points from these to fill in the values. 

Simplified example of KNN imputation.

To take this a step further, I had it find the closest 1,000 images, and have the models vote on where the new points should be, with each vote weighted by how similar the new points were to the image I was trying to label.  When this was ready, I evaluated all options against the baseline model which used the median values.

As you can see, the KNN approach was the best, while dropping the values as I did in my first attempt, was by far the worst!

Before moving on, I also wanted to test and see what would happen if I removed some of the poorly labeled examples from the dataset that I had identified during EDA.  I removed 22 examples I had identified as the worst offenders and compared new model results to my baseline again.

Simply dropping these 22 examples caused a 21.4% reduction in my model error rate!  At this point, I felt that my data was in a good place, so I moved on to the modeling phase.

Algorithm Selection and Fine Tuning

-ATTEMPT 1-

This is where I spent ~80% of my time on the first attempt.  I was convinced that if I could find the perfect architecture and tune it well, it would solve all of my problems.  Not only was I wrong about this, but my approach to finding this “perfect architecture” was deeply flawed.

MAJOR MISTAKE 0No baseline model.  When I started, I never established a baseline to evaluate my model performance against.  So instead of making a single change at a time and seeing if it helped or hurt, I made multiple changes at a time, so when something did work I couldn’t identify why.

MAJOR MISTAKE 1: Training locally on CPU.  When I say I was a beginner, I mean it.  At the time I’d never used a GPU on Colab, and I didn’t realize that it was just a couple clicks away.  I didn’t figure it out until 2 weeks into working on modeling, which cost me extremely valuable time.

MAJOR MISTAKE 2: Building custom architectures.  While I had some promising results early on due purely to luck, I spent far too much time trying to build bigger and bigger custom networks. (I was completely unfamiliar with the vanishing gradient problem.) So of course it took longer and longer to train (especially on CPU’s). 

MAJOR MISTAKE 3:  Not applying transfer learning.  When I finally did think to try other established networks, I did not understand transfer learning yet.  While I used their architectures, I was training weights from scratch on a tiny dataset, which again cost me valuable time and model performance.

MAJOR MISTAKE 4: Relying on auto ML and grid search.  I lost multiple days using tools like auto ML and grid search across way too many hyperparameters.  Rather than intelligently reducing the search scope, I fed in ~8 options for every hyperparameter and then let it run for 3 days straight and it still didn’t finish. 

-ATTEMPT 2-

The second time around I essentially did the opposite of all of the above.   I started with a baseline and moved systematically through the things I wanted to test.  I did everything on a V100 GPU and implemented early stopping, so training was very fast.

In terms of transfer learning, I tested Mobile Net V2, Mobile Net V3, and EfficentNet. However, none of them performed particularly well.  I believe it is because my training images are grayscale and very low resolution at 96x96x1, whereas the pretrained weights for the other models were RGB & higher resolution.

Ultimately, the best architecture was a custom CovNet structure trained from scratch. Using a learning rate scheduler with exponential decay proved to be highly beneficial as well.

Data Augmentation

-ATTEMPT 1-

When fine-tuning my algorithm didn’t give the results I’d hoped for, I turned to data augmentation.  This proved to be more difficult than expected because any geometric distortions would also have to be applied to the key points to maintain the accuracy of the labels.

An example of moving key points during augmentation.

Luckily, there were libraries to assist with this, but I spent so much energy on implementing these that I did not stop to reflect on which augmentations truly made the most sense.  I then over-augmented my dataset, with nearly a dozen different augmentations, finishing with a ratio of ~2:1 augmented vs original images in my training set.

-ATTEMPT 2-

The second time around, I had a done much more complete EDA, so I knew exactly which types of augmentations I wanted to apply.  I began applying these systematically, testing the impact of each on my validation accuracy.

Once I established how they performed individually, I began combining the top performers and retesting to ensure that the trend still held.  Altogether I ended up with a ratio of ~1:1 augmented vs non-augmented images.

Ensemble Methods

-ATTEMPT 1-

As I wrapped up the project and the deadline approached, I was really disappointed with my overall model performance.  I had trained over 100 models, but because my approach was so haphazard, I had not made significant progress.  So as a final effort, I tried to combine multiple model predictions together to see if it would improve overall performance.

I did this in the most naïve way possible, by literally averaging the final predictions of multiple models and hoping for the best.  Surprisingly, it did significantly help, although I could not explain why it was working.

-ATTEMPT 2-

The second time around, I had significantly more experience with ensembling, and decided to go for a hierarchical approach.  Specifically, I decided to train a model to detect clusters of points, rather than use one model to predict all of them simultaneously.  I started with a 3 model approach: eyes, eyebrows, nose + mouth.

This led to significantly better results, so I decided to take it further.  I continued testing more granular approaches until I finally trained a model for each individual key point, which led to the best results overall. 

Finally, I tried pre-training a model on all key points and then fine-tuning a model for each individual point.  This allowed me to harness the size of the full dataset to learn the features of the face, but then focus on only one key point at a time during fine-tuning.  This was the final breakthrough and ultimately what moved me into the top 10.

Summary

If you skipped to the end, or you just need a recap, here’s a summary of the actions that helped improve model performance:

  • Deeper EDA with manual data cleaning.
  • Weighted K-nearest neighbors data imputation.
  • Custom ConvNet model architecture.
  • Reduced batch size.
  • Exponential learning rate decay.
  • Photometric augmentations: brightness, contrast, Gaussian blur.
  • Geometric augmentations: scale, translate, rotate.
  • Base model pre-training.
  • Fine-tuning individual models for each key point using non-imputed labels.

Conclusion

While it was fun to get a better score, what I took the most pride in during this second attempt was the new-found discipline of my approach. While I would’ve expected this to slow me down, it actually allowed me to work significantly faster because I made better choices about where and how to spend my time to maximize impact. And because I tested one thing at a time, it was explainable. Not only could I articulate what was working and why, but I’ll be able to apply these findings to other computer vision projects in the future.

To sum it all up, here are my final takeaways:

  • Make a plan: Understand what you want to test, and why it should make an impact.
  • Change one thing at a time: Experimentation without a plan is just guessing.  You should be aiming to measure the impact so that you, and others, can learn and improve from the findings.
  • Stick to what you should do, not what is easiest: If you’re running up against a barrier because you can’t figure it out, resist the urge to pivot.  You’re going to have to learn eventually. Taking a short cut now will not help you in the long run.
  • Don’t reinvent the wheel: If you’re studying ML you’re probably pretty smart, but wisdom is realizing that you are not smarter than the ML community.  Don’t try to start from scratch, do your research and understand what has been done, and why.  And then apply it.