What is SageMaker Autopilot


SageMaker Autopilot is a service provided by Amazon that allows you to automatically build and train machine learning models on your data, with relatively little data science knowledge.

It abstracts away the heavy lifting of building machine learning models. Given a labeled data tabular data set and a target column to predict, SageMaker will generate multiple candidate models you can deploy and enhance your platform with.

Additionally, it gives you great visibility into how it generates candidate models and allows you to tweak, modify and train your own variants.

What problems can you use it for?

SageMaker Autopilot generates models that either predict values (regression), or predict categories (classification).

Regression algorithms are used to predict the continuous values such as price, salary, age, etc. and classification algorithms are used to predict/classify discrete values such as Cat or Dog, Spam or Not Spam, etc. Autopilot supports both binary and multi-class classification.

regression-vs-classification-in-machine-learning.png

Source: https://www.javatpoint.com/regression-vs-classification-in-machine-learning

How does it work?

There are three stages that Autopilot runs through

Analyzes Data

  • Produces two notebooks
    • Data Exploration
    • Candidate Definition Notebook
      • Identifies problem type to be solved
      • Data pre-processing
      • Candidate pipelines
        • Using ML model suitable for the problem type supplied or identified

Feature engineering

  • Creates training and validation datasets for each pipeline
  • Engineers new features like automatically extracting information from non-numeric columns, such as date and time information from timestamps.

Model tuning

  • Launches a hyperparameter optimization job for each candidate pipeline
    • Runs many training jobs while iterating on combinations of hyperparameters
    • Assesses and selects highest performing model

Untitled

Getting Started

This tutorial assumes basic familiarity with AWS services such as IAM, S3 and EC2

Our use case: Predict caloric intake based sleep, recovery and heart rate data

Model training

  1. Create an Amazon Web Services account
  2. Setup SageMaker Studio, this is the environment where you can create and run Autopilot experiments Screen Shot 2021-08-14 at 3.23.36 PM.png Screen Shot 2021-08-15 at 5.33.11 PM.png
  3. Upload your labelled data to the S3 bucket your studio created

Screen Shot 2021-08-14 at 3.52.13 PM.png

Screen Shot 2021-08-14 at 3.52.52 PM.png

Autopilot requires you're data be in CSV format and should contain at least 500 rows

  1. Launch and configure the experiment

Screen Shot 2021-08-14 at 4.12.33 PM.png

Experiment options

  • Experiment name:
    • Should be unique within your account in an AWS Region
  • S3 Bucket name:
    • Bucket you uploaded your source data to
    • SageMaker must have read permissions to this bucket.
  • Dataset file name:
    • Name of your source data file
  • Target: Specifies which column you want to predict
  • Output data location:
    • Where Sagemaker stores trial artifacts
    • Sagemaker must have write permissions on this location
  • Machine learning problem type: specifies the kind of data preprocessing and algorithm selection for Autopilot to use
    • Auto: Autopilot will automatically infer the problem type
    • Regression: Predicting continuous numeric values
    • Binary classification: Classifying two categories i.e Cat or Not
    • Multiclass classification: Classifying more 3 or more categories i.e Cat, Dog or Neither
  1. Run your experiment and view the artefacts generated

Screen Shot 2021-08-15 at 3.43.51 PM.png

Once your experiment is completed you will get a list of candidate models with your best model highlighted.

You can look under the hood and see the data exploration and candidate generation notebook Autopilot used to create your models.

We wont be needing to go into these, but this is where Data Scientists can tweak and modify the approaches created and generate their own models.

Screen Shot 2021-08-15 at 4.55.32 PM.png

Model Deployment

There are two primary modes of model deployment within SageMaker service, Inference Endpoints and Batch Transform jobs.

Inference Endpoint

  • Real time predictions
  • Model deployed to a virtual server
    • automatically handles concurrent requests

Batch Transform Jobs

  • Want to get predictions for an entire dataset
  • Don't need a persistent endpoint that applications (for example, web or mobile apps) can call to get predictions
  • Don't need the sub-second latency that SageMaker hosted endpoints provide

Screen Shot 2021-08-15 at 5.57.32 PM.png

Prediction

Invoking our endpoint within a Sagemaker Studio Notebook, we can see how we might interact with our API for real-time inference! Cool huh.

Screen Shot 2021-08-15 at 6.11.00 PM.png

And there you have it, your very first Sagemaker machine learning model.