What do self-driving cars, Google Translate, and Amazon recommendations have in common? They all use machine learning. But what is “machine learning”? This post hopes to demystify the term that is becoming increasingly relevant in tech.
Machine learning refers to a wide range of statistical techniques which use data to identify patterns and make decisions. It involves correctly defining the task at hand and choosing the right data features and models for the job.
It’s interesting to note that machine learning is not a new development. Many of the techniques have actually been around for decades. Growing volumes of data and recent advancements in computing power and data storage have all meant that we are now collecting more data than ever before and are able to process bigger and more complex data faster than ever. It’s for this reason that we have seen a resurgence in machine learning, with existing techniques now being revisited and developed further.
House Sales Example and Terminology
Let’s start with a simple example of a machine learning task in order to introduce some terminology. Suppose you have a dataset about houses and the prices that they have sold for. The table below shows the first five entries of such a dataset (based on the Ames Housing Dataset of house sales between 2006 and 2010).
A possible machine learning task would investigate how the provided information, such as the number of bathrooms and bedrooms, affects the final sale price. We could use the resulting model to predict the sale price for new houses that go on the market.
|Index||Lot size in square feet||Overall quality||Overall condition||Year built||Number of Bathrooms||Number of Bedrooms||Month sold||
|Sale price ($)|
Each entry in the dataset is called an example. The variable we are predicting, in this case the sale price, is the label. The features are the characteristics that we choose to investigate and use as inputs for our model. In this dataset the features include lot size, overall quality, and number of bedrooms. Choosing the right features is often an art in itself.
Machine learning can either be supervised or unsupervised (or sometimes a hybrid of the two).
Supervised learning requires labelled data to “train” the model. Training is the process whereby the machine learning model “learns” from the data. The trained model can then be used to predict labels for unlabelled examples.
Most tasks in supervised learning are either classification or regression problems.
In classification, examples are assigned to “buckets” or classes, and the task is to predict which class new examples belong to.
- Example: Classifying emails as “spam” or “not spam” using a dataset of emails which are labelled as “spam” or “not spam”.
- Why it’s supervised learning: The model is being trained on a dataset of emails which are labelled as “spam” or “not spam.”
- Why it’s a classification problem: There are two possible labels, “spam”, and “not spam”.
- Example: Classifying images in the ImageNet dataset
- Why it’s supervised learning: The ImageNet datasets contains hundreds of thousands of images which have been labelled with nouns by hand.
- Why it’s a classification problem: There is a finite (but very large) set of possible classes an image can belong to which includes nouns such as “bicycle” or “squirrel.”
Classification models include support vector machines (SVMs), various types of neural networks, decision trees, and Naïve Bayes.
In regression, examples are labelled with continuous values.
- Example: Predicting house sale price using a dataset of historical house sales.
- Why it’s supervised learning: The model is being trained on a dataset of examples which are labelled with house sale price.
- Why it’s a regression problem: The house sale price can take on any value, such as $150,000 or $1,000,000 or anything in between.
Regression models include support vector machines, neural networks, linear regression and generalised linear models.
Supervised deep learning
Recent breakthroughs have made “deep learning” a popular buzz word, but what is it really? “Deep learning” is a field of machine learning which involves using “deep”, or multi-layered, neural networks on vast amounts of data. These artificial neural networks are inspired by biological neural networks and how information spreads between neurons in the brain. “Deep learning” models are particularly interesting because they discover features on their own and do not need to be programmed with any specific rules.
There have been numerous impressive applications of deep learning in recent years, particularly within the fields of speech and image recognition:
- Deep learning has been used to classify images in the ImageNet dataset, achieving greater accuracy than a human could.
- Self-driving cars use deep learning to “see” the world and navigate.
- Google developed the “Google Neural Machine Translation” system to improve the accuracy of Google Translate through deep learning.
- Apple’s Siri uses deep neural networks to interpret sounds and identify when a user has said the “Hey Siri” phrase.
All of these benefits come at a cost: deep learning generally requires vast amounts of labelled data and the resulting neural networks are black-box models. That said, making neural networks interpretable is now an active research area and the open-source tool LIME is looking to do just this.
Deep learning does not always have to be supervised, it can also be unsupervised.
Unsupervised learning is learning from unlabelled data. The model is left to discover patterns in the data on its own.
- Example: Discovering market segments from a dataset of customer data using clustering.
- Why it’s unsupervised learning: The dataset has information about customers, but the customers have not been assigned to specific groups – it is up to the model to discover the market segments.
Some unsupervised learning tasks can be solved using clustering. Clustering involves grouping a set of examples, and models models include k-means and hierarchical clustering. Other unsupervised learning tasks include anomaly detection.
Note also that some neural networks are unsupervised. Within this space, generative adversarial networks (GANs) are particularly interesting – two neural networks work against each other to learn the distribution of data. These neural networks can be used to generate data that is like the data they were trained on.
Getting started with Machine Learning
The most popular programming languages for getting started with machine learning are Python and R. Python is a high-level programming language ideal for rapid prototyping and comes with a wealth of libraries such as pandas, scikit-learn, and Tensorflow. R is a statistical computing language, aimed primarily at statisticians and mathematicians. These tools are great for getting started with machine learning and for performing some exploratory data analysis.
However, actually turning a prototype into a model that scales is difficult. Running machine learning models can be expensive because they often have significant compute, memory, and storage requirements during training. Furthermore, machine learning models that are in production need to be checked and retrained on a regular basis in order to learn from new data.
Another big challenge in moving from prototypes to production models is handling dirty or unusual data (present in almost all systems), especially when new data is constantly arriving.
Finally, once your models are helping to drive outcomes in your business, the last thing you want is for them to break or become unavailable. Robust and reliable hosting architectures are important, meaning you need developers and operations engineers as well as great data scientists.
The Filter and Machine Learning
The Filter uses machine learning extensively within our platform, with one example being our personalised recommendations. In addition, as part of a project part funded through Innovate UK, we are further developing our journey optimisation capabilities using deep learning techniques. This will allow us to predict the outcome of a particular session ahead of time and will be a great benefit to our retail customers, enabling them to personalise and tailor user’s experiences based on the stage that they are at in their shopping “journey”.
If you wish to learn more about Machine Learning, I would recommend the book An Introduction to Statistical Learning which also provides a good introduction to R.
On the other hand, if you already know a lot about this area, please note that we are currently recruiting Data Scientists with Machine Learning skills to join our team.