What is Classification in Predictive Modeling?
In machine learning, Classification refers to training a model on a labeled dataset to assign data points to classes or to classify new data points. Classification is a method of predictive modeling.A few examples of Classification
- Spam Filters: Spam filters use classification algorithms to recognize spam emails by learning their characteristics. Naïve-Bayes and Support Vector Machines are often used.
- Image Classification: How does your phone know that you’re taking a picture of food? Through image classification. Convolutional Neural Networks (CNNs) are typically used to train these algorithms.
- Fraud Detection: Fraud detection falls under anomaly detection. Anomaly detection algorithms can be trained using a variety of classification algorithms, with Naive Bayes and K-NN being the most common. Fraud detection works by classifying activities of “normal” and “outlier”. Outliers/anomalies trigger protective measures.
Datasets for Classification
Classification modeling requires large datasets. More data = more accuracy, typically. In general, your model should have a training dataset that emphasizes quantity and relevance. The training datasets need to sufficiently resemble the problem and have as many examples of each class as possible. For example, if your goal is to train an email spam filter, you need to train the classification model on formal emails, shorthand, forwarded emails, promotional emails, notifications, order confirmations, and spam emails. The dataset needs to sufficiently cover all of these and more to be accurate.Class Label Notation
Class labels are usually string values that are mapped to numeric values. For example, “spam” = 1, “not spam” = 0. (Word to the wise, be careful with your notation and make sure you fully document your mapping.)Which Algorithm Should You Use?
There is no single algorithm that will work best for classification. Like most things in machine learning, you’re pulling from your toolbox of algorithms and will need to gauge which one fits your problem the best. In many cases, you will end up trying multiple different algorithms and comparing to figure out which one is the best. Need somewhere to start?- Random Forest: Easy to implement. Works well on datasets with a lot of variables.
- Naïve-Bayes: Fast and works well on small datasets.
- Logistic Regression: Simple logistic regression can work for classification. It isn’t as powerful as other methods and only works for binary variables
- Binary Classification
- Multi-Class Classification
- Multi-Label Classification
- Imbalance Classification
Binary Classification
Binary Classification refers to classification tasks that have two class labels. Typically, binary classification involves one class that is “normal” and another that is “abnormal”.Examples of Binary Classification
- Email spam filters (spam, not spam)
- Conversion/purchase prediction (buy, no buy)
- Fraud detection (fraud, not fraud)
- Lending decisions (lend to, don’t lend to)
Binary Class Labeling
Binary classes are typically labeled as 0 and 1. 0 is assigned to the “abnormal” classification and 1 is assigned to the “normal” classification. In our spam filter example, “spam” = 0 and “not spam” = 1.Why are Binary Classes Labeled as 0 and 1?
The Bernoulli Distribution! (Shocker.) The Bernoulli Distribution is a discrete probability distribution for events for binary outcomes. It predicts the probability that an event will happen or not happen. The Bernoulli Distribution uses 0 and 1. Hence why the notation is typically expanded to binary class labeling as well.Popular Algorithms in Binary Classification
Binary class labeling typically uses simpler, faster algorithms. Here are a few to consider in your modeling.- Logistic Regression (only natively supports 2 classes)
- Decisions Trees
- k-Nearest Neighbors (kNNs)
- Naïve Bates
- Support Vector Machine (only natively supports 2 classes)
Multi-Class Classification
Multi-Class Classification refers to classification tasks that have more than two class labels. Multi-class classification doesn’t usually have a notion of “normal” or “abnormal”. Rather, examples are classes as being a part of one class among a range of known classes.Examples of Multi-Class Classification
- Color of flowers (red, blue, yellow)
- Animal type (cat, dog, fish)
- Bird species (finch, bluejay, hummingbird)
- Clothing sizes (small, medium, large)
Multi-Class Labeling
Staying organized is important in multi-class labeling–especially when you have a large number of classes. Multi-class labeling usually starts at 0. (For example, “red” = 0, “blue” = 1, “yellow” = 2).Categorical Distributions
Categorical Distributions (also called generalized Bernoulli distributions or Multinoulli distributions) is a discrete probability distribution that assigns probabilities to each of K possible classes. The probability of each class is specified separately by the distribution.Popular Algorithms in Multi-Class Classification
Multi-class Classification can use a wide array of potential algorithms. Here are a few.- k-Nearest Neighbors (kNNs)
- Decision Trees
- Random Forest
- Naïve-Bayes
- Gradient Boosting
Adapting Binary Algorithms for Multi-Class
If you’re more familiar and comfortable with binary algorithms–like logistic regression or support vector machines–you may be in luck. These algorithms can be adapted for multi-class use. These strategies involve using one-vs-rest and one-vs-one.- One-vs-rest: each class vs. all other classes (Is the flower red or anything but red?)
- One-vs-one: runs the binary algorithm multiple times for each possibility. (Is the flower red or blue? Is the flower red or yellow? Is the flower blue or yellow?
Multi-Label Classification
Examples of Multi-Label Classification
Multi-label classification is increasingly more common as we aim to understand more about datasets using AI. The most practical application is computer vision.- Items in a picture
- Movie genre based on a movie poster
- Disclaimers on videos (vulgar, graphic, offensive, profanity)
- Flagging content on social media
Popular Algorithms in Multi-Label Classification
Multi-label classification can’t really pull from binary or multi-class algorithms. Rather, specialized versions of these standard formulas are utilized.- Multi-label Decision Trees
- Multi-label Random Forests
- Multi-label Gradient Boosting
Multi-Label vs Multi-Class–What’s the Difference?
Multi-label and multi-class classification can be difficult to distinguish. For example, favorite fruit vs liked fruits. If you have a dataset of 1,000 people who have been asked to rank apples, bananas, oranges, and grapes and you are seeking insight on only their favorite fruit, multi-class is appropriate. If you are seeking a model that shows how the top three most liked fruits relate, multi-label is appropriate. Another common example is in computer vision and image analysis. For example, if you are analyzing a dataset by facial expression/emotion to figure out the distribution between happy, sad, angry, and scared, multi-class is appropriate. If you are attempting to identify the crossover between sad and angry, multi-label is more appropriate.Imbalanced Classification
Imbalanced Classification refers to classification where the datasets in each class are unequally distributed.Examples of Imbalanced Classification
Imbalanced classification usually refers to a dataset where the majority of the data points are binary with a few important outliers.- Medical diagnostics tests
- Claim Prediction
- Finding outliers
- Fraud detection
Sampling Techniques
In imbalanced classification, you may need to revisit your dataset. You can change the composition of the sample by undersample the majority class or oversampling minority classes.Sampling Techniques for Imbalanced Classification | |
Over Sampling | Under Sampling |
Random Over-Sampling | Random Under-Sampling |
Random Over-Sample with imblearn | Random Under-Sample with imblearn |
Tomek Links | |
Synthetic Minority Oversampling Technique (SMOTE) | |
NearMiss |
Popular Algorithms in Imbalanced Classification
Just like multi-label classification, the algorithms utilized in imbalanced classification need to be specialized to model correctly. You can use “cost-sensitive” algorithms to pay closer attention to minority classes. Logistic regression, decision trees, and support vector machine algorithms all have cost-sensitive specialized options.Classification in Machine Learning
There are four main types of classification tasks in machine learning: binary, multi-class, multi-label, and imbalanced. We’ve given you a few examples and starter algorithms to get you started as you train your next model. Here at Elevate, we use multi-class and multi-label classification to train our computer vision edge-AI on a wide dataset of images.