What is SageMaker Autopilot

SageMaker Autopilot is a service provided by Amazon that allows you to automatically build and train machine learning models on your data, with relatively little data science knowledge.

It abstracts away the heavy lifting of building machine learning models. Given a labeled data tabular data set and a target column to predict, SageMaker will generate multiple candidate models you can deploy and enhance your platform with.

Additionally, it gives you great visibility into how it generates candidate models and allows you to tweak, modify and train your own variants.

What problems can you use it for?

SageMaker Autopilot generates models that either predict values (regression), or predict categories (classification).

Regression algorithms are used to predict the continuous values such as price, salary, age, etc. and classification algorithms are used to predict/classify discrete values such as Cat or Dog, Spam or Not Spam, etc. Autopilot supports both binary and multi-class classification.

Source: https://www.javatpoint.com/regression-vs-classification-in-machine-learning

How does it work?

There are three stages that Autopilot runs through

Analyzes Data

Produces two notebooks
- Data Exploration
- Candidate Definition Notebook
  - Identifies problem type to be solved
  - Data pre-processing
  - Candidate pipelines
    - Using ML model suitable for the problem type supplied or identified

Feature engineering

Creates training and validation datasets for each pipeline
Engineers new features like automatically extracting information from non-numeric columns, such as date and time information from timestamps.

Model tuning

Launches a hyperparameter optimization job for each candidate pipeline
- Runs many training jobs while iterating on combinations of hyperparameters
- Assesses and selects highest performing model

Untitled

Getting Started

This tutorial assumes basic familiarity with AWS services such as IAM, S3 and EC2

Our use case: Predict caloric intake based sleep, recovery and heart rate data

Model training

Create an Amazon Web Services account
Setup SageMaker Studio, this is the environment where you can create and run Autopilot experiments
Upload your labelled data to the S3 bucket your studio created

Screen Shot 2021-08-14 at 3.52.13 PM.png

Screen Shot 2021-08-14 at 3.52.52 PM.png

Autopilot requires you're data be in CSV format and should contain at least 500 rows

Launch and configure the experiment

Screen Shot 2021-08-14 at 4.12.33 PM.png

Experiment options

Experiment name:
- Should be unique within your account in an AWS Region
S3 Bucket name:
- Bucket you uploaded your source data to
- SageMaker must have read permissions to this bucket.
Dataset file name:
- Name of your source data file
Target: Specifies which column you want to predict
Output data location:
- Where Sagemaker stores trial artifacts
- Sagemaker must have write permissions on this location
Machine learning problem type: specifies the kind of data preprocessing and algorithm selection for Autopilot to use
- Auto: Autopilot will automatically infer the problem type
- Regression: Predicting continuous numeric values
- Binary classification: Classifying two categories i.e Cat or Not
- Multiclass classification: Classifying more 3 or more categories i.e Cat, Dog or Neither

Run your experiment and view the artefacts generated

Screen Shot 2021-08-15 at 3.43.51 PM.png

Once your experiment is completed you will get a list of candidate models with your best model highlighted.

You can look under the hood and see the data exploration and candidate generation notebook Autopilot used to create your models.

We wont be needing to go into these, but this is where Data Scientists can tweak and modify the approaches created and generate their own models.

Screen Shot 2021-08-15 at 4.55.32 PM.png

Model Deployment

There are two primary modes of model deployment within SageMaker service, Inference Endpoints and Batch Transform jobs.

Inference Endpoint

Real time predictions
Model deployed to a virtual server
- automatically handles concurrent requests

Batch Transform Jobs

Want to get predictions for an entire dataset
Don't need a persistent endpoint that applications (for example, web or mobile apps) can call to get predictions
Don't need the sub-second latency that SageMaker hosted endpoints provide

Screen Shot 2021-08-15 at 5.57.32 PM.png

Prediction

Invoking our endpoint within a Sagemaker Studio Notebook, we can see how we might interact with our API for real-time inference! Cool huh.

Screen Shot 2021-08-15 at 6.11.00 PM.png

And there you have it, your very first Sagemaker machine learning model.