Automate Training, Build And Deployment Of Amazon SageMaker Models Using AWS Step Functions

Background

A few weeks back, I was tasked with automating the training, build and deployment of an Amazon SageMaker model. Initially, I thought that an AWS Lambda Function would be the best candidate for this, however as I started experimenting, I quickly realised that I needed to look elsewhere.

After some research, I found articles that pointed me towards AWS Step Functions. As it happens, AWS has been making AWS Step Functions more Amazon SageMaker friendly, to the point that AWS Step Functions now natively supported most of the Amazon SageMaker APIs.

With the technology decided, I started figuring out how I would achieve what I had set out to do. I did find some good documentation and examples, however they didn’t entirely cover everything I was after.

After much research and experimentation, I finally created a solution that was able to automatically train, build and then deploy an Amazon SageMaker model.

In this blog, I will outline the steps I followed, with the hope that it benefits those wanting to do the same, saving them countless hours of frustration and experimentation.

High Level Architecture Diagram

Below is a high-level architecture diagram of the solution I used.

The steps (as denoted by the numbers in the diagram) are as follows:

  1. The first AWS Step Function state calls the Amazon SageMaker API to create a Training Job, passing it all the necessary parameters.
  2. Using the supplied parameters, Amazon SageMaker downloads the training and validation files from the Amazon S3 bucket, and then runs the training algorithm. The output is uploaded to the same Amazon S3 bucket that contained that training and validation files.
  3. The next AWS Step Function state calls the Amazon SageMaker API to create a model,  using the artifacts from the Training Job.
  4. The next AWS Step Function state calls the Amazon SageMaker API to create an endpoint configuration, using the model that was created in the previous state.
  5. The next AWS Step Function state calls the Amazon SageMaker API to create a model endpoint, using the endpoint configuration that was created in the previous state.
  6. Using the endpoint configuration, Amazon SageMaker deploys the model using Amazon SageMaker Hosting Services, making it available to any client wanting to use it.

Let’s get started.

Implementation

For this blog, I will be using the data and training parameters described in the Amazon SageMaker tutorial at https://docs.aws.amazon.com/sagemaker/latest/dg/gs-console.html

1. Create an Amazon S3 bucket. Create a folder called data inside your Amazon S3 bucket, within which, create three subfolders called train, validation and test (technically these are not folders and subfolders, but keys. However, to keep things simple, I will refer to them as folders and subfolders).

2. Follow Step 4 from the above-mentioned Amazon SageMaker tutorial (https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-preprocess-data.html) to download and transform the training, validation and test data. Then upload the data to the respective subfolders in your Amazon S3 bucket (we won’t be using the test data in this blog, however you can use it to test the deployed model).

For the next three months, you can download the transformed training, validation and test data from my Amazon S3 bucket using the following URLs

https://niv-sagemaker.s3-ap-southeast-2.amazonaws.com/data/train/examples

https://niv-sagemaker.s3-ap-southeast-2.amazonaws.com/data/validation/examples

https://niv-sagemaker.s3-ap-southeast-2.amazonaws.com/data/test/examples

3. Create an AWS IAM role with the following permissions
AmazonSageMakerFullAccess (AWS Managed Policy)

and a custom policy to read, write and delete objects from the Amazon S3 bucket created in Step 1. The policy will look similar to the one below

{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"s3:ListBucket"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::bucketName"
]
},
{
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::bucketName/*"
]
}
]
}

where bucketName is the name of the Amazon S3 bucket created in Step 1 above.

4. Open the AWS Step Functions console and change to the AWS Region where the Amazon SageMaker model endpoint will be deployed.

5. Create a new state machine, choose Author with code snippets and set the type to Standard.

6. Under Definition delete everything and paste the following
{
"Comment": "An AWS Step Function State Machine to train, build and deploy an Amazon SageMaker model endpoint",
"StartAt": "Create Training Job",

The above commands provide a comment describing the purpose of this AWS Step Function state machine and sets the first state name as Create Training Job.

For a full list of Amazon SageMaker APIs supported by AWS Step Functions, please refer to https://docs.aws.amazon.com/step-functions/latest/dg/connect-sagemaker.html

7. Use the following code to create the first state (these are the training parameters described in the above-mentioned Amazon SageMaker tutorial).

"States": {
"Create Training Job": {
"Type": "Task",
"Resource": "arn:aws:states:::sagemaker:createTrainingJob.sync",
"Parameters": {
"TrainingJobName.$": "$$.Execution.Name",
"ResourceConfig": {
"InstanceCount": 1,
"InstanceType": "ml.m4.xlarge",
"VolumeSizeInGB": 5
},
"HyperParameters": {
"max_depth": "5",
"eta": "0.2",
"gamma": "4",
"min_child_weight": "6",
"silent": "0",
"objective": "multi:softmax",
"num_class": "10",
"num_round": "10"
},
"AlgorithmSpecification": {
"TrainingImage": "544295431143.dkr.ecr.ap-southeast-2.amazonaws.com/xgboost:1",
"TrainingInputMode": "File"
},
"OutputDataConfig": {
"S3OutputPath": "s3://bucketName/data/modelartifacts"
},
"StoppingCondition": {
"MaxRuntimeInSeconds": 86400
},
"RoleArn": "iam-role-arn",
"InputDataConfig": [
{
"ChannelName": "train",
"ContentType": "text/csv",
"DataSource": {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": "s3://bucketName/data/train",
"S3DataDistributionType": "FullyReplicated"
}
}
},
{
"ChannelName": "validation",
"ContentType": "text/csv",
"DataSource": {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": "s3://bucketName/data/validation",
"S3DataDistributionType": "FullyReplicated"
}
}
}
]
},
"Retry": [
{
"ErrorEquals": [
"SageMaker.AmazonSageMakerException"
],
"IntervalSeconds": 1,
"MaxAttempts": 1,
"BackoffRate": 1.1
},
{
"ErrorEquals": [
"SageMaker.ResourceLimitExceededException"
],
"IntervalSeconds": 60,
"MaxAttempts": 1,
"BackoffRate": 1
},
{
"ErrorEquals": [
"States.Timeout"
],
"IntervalSeconds": 1,
"MaxAttempts": 1,
"BackoffRate": 1
}
],
"Catch": [
{
"ErrorEquals": [
"States.ALL"
],
"ResultPath": "$.cause",
"Next": "Display Error"
}
],
"Next": "Create Model"
},

I would like to call out a few things from the above code

“Resource”: “arn:aws:states:::sagemaker:createTrainingJob.sync” refers to the Amazon SageMaker API for creating a Training Job. When this state task runs, you will be able to see this Training Job in the Amazon SageMaker console.

TrainingJobName is the name given to the Training Job and it must be unique within the AWS Region, in the AWS account. In my code, I am setting this to the Execution Name (internally referred to as $$.Execution.Name), which is an optional parameter that can be supplied when executing the AWS Step Function state machine. By default, this is set to a unique random string, however to make the Training Job name more recognisable, provide a more meaningful unique value when executing the state machine. I tend to use the current time in the format <training-algorithm>-<year><month><date>-<hour><minute><second>

If you have ever used Jupyter notebooks to run an Amazon SageMaker Training Job, you would have have used a line similar to the following:

        container = get_image_uri(boto3.Session().region_name, ‘xgboost’)

Yes, your guess is correct! Amazon SageMaker uses containers for running Training Jobs. The above assigns the xgboost training algorithm container from the region that the Jupyter notebook is running in.

These containers are hosted in Amazon Elastic Container Registry (Amazon ECR) and maintained by AWS. For each training algorithm that Amazon SageMaker supports, there is a specific container. Details for these containers can be found at https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html.

When submitting a Training Job using AWS Step Functions, you must supply the correct container name, from the correct region (the region where you will be running Amazon SageMaker from). This information is passed using the parameter TrainingImage. To find the correct container path, use https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html.

Another method for getting the value for TrainingImage is to manually submit a Training Job using the Amazon SageMaker console, using the same training algorithm that you will be using in the AWS Step Function state machine. Once the job has started, open it and have a look under the section Algorithm. You will find the Training Image for that particular training algorithm, for that region, listed there. You can use this value for TrainingImage.

S3OutputPath is the location where Amazon SageMaker will store the model artifacts after the Training Job has successfully finished.

RoleArn is the ARN of the AWS IAM Role that was created in Step 3 above

S3Uri under ChannelName: train is the Amazon S3 bucket path to the folder where the training data is located

S3Uri under ChannelName:validation is the Amazon S3 bucket path to the folder where the validation data is located

DON’T FORGET TO CHANGE bucketName TO THE AMAZON S3 BUCKET THAT WAS CREATED IN STEP 1 ABOVE

In the next AWS Step Function state, the model will be created using the artifacts generated from the Training Job.

8. An AWS Step Function state receives input parameters, does its processing and then produces an output. If there is another state next in the path, the output is provided as an input to that state. This is an elegant way for passing dynamic information between states.
Here is the code for the next AWS Step Functions state.
"Create Model": {
"Parameters": {
"PrimaryContainer": {
"Image": "544295431143.dkr.ecr.ap-southeast-2.amazonaws.com/xgboost:1",
"Environment": {},
"ModelDataUrl.$": "$.ModelArtifacts.S3ModelArtifacts"
},
"ExecutionRoleArn": "iam-role-arn",
"ModelName.$": "$.TrainingJobName"
},
"Resource": "arn:aws:states:::sagemaker:createModel",
"Type": "Task",
"ResultPath":"$.taskresult",
"Next": "Create Endpoint Config"
},

Image refers to the same container that was used in the Create Training Job state.

ModelDataUrl refers to the location where the model artifacts that were created in the previous state are stored. This value is part of the output (input to this state) from the previous state. To reference it, use $.ModelArtifacts.S3ModelArtifacts

ExecutionRoleArn is the ARN of the AWS IAM Role that was created in Step 3 above.

“Resource”: “arn:aws:states:::sagemaker:createModel” refers to the Amazon SageMaker API for creating a model

To keep things simple, the name of the generated model will be set to the TrainingJobName. This value is part of the output (input to this state) from the previous state. To reference it, use $.TrainingJobName

After this state finishes execution, you will be able to see the model in the Amazon SageMaker console.

The next state is for creating an Endpoint Configuration using the model that was just created.

Before we continue, I want to point out an additional parameter that I am using “ResultPath”:”$.taskresult”. Let me explain the reason for using this. In my next state, I must provide the name of the model that will be used to create the Endpoint Configuration. Unfortunately, this name is not part of the output of the current state Create Model, so I won’t be able to reference it. However, as you might remember, for simplicity, we set the model name to be the same as TrainingJobName and guess what, this is part of the current states input parameters! Now, if only there was a way for me to make the current state to include its input parameters in its output. Oh wait! There is a way. Using the command  “ResultPath”:”$.taskresult” instructs this AWS Step Function state to include its input parameters in its output.

9. Here is the code for the AWS Step Function state to create an Endpoint Config.

"Create Endpoint Config": {
"Type": "Task",
"Resource": "arn:aws:states:::sagemaker:createEndpointConfig",
"Parameters":{
"EndpointConfigName.$": "$.TrainingJobName",
"ProductionVariants": [
{
"InitialInstanceCount": 1,
"InstanceType": "ml.t2.medium",
"ModelName.$": "$.TrainingJobName",
"VariantName": "AllTraffic"
}
]
},
"ResultPath":"$.taskresult",
"Next":"Create Endpoint"
},

This state is pretty straight forward.

“Resource”: “arn:aws:states:::sagemaker:createEndpointConfig” refers to the Amazon SageMaker API to create an Endpoint Configuration

For simplicity, we will set the Endpoint Configuration name to be the same as the TrainingJobName. The Endpoint will be deployed initially using one  ml.t2.medium instance.

As in the previous state, we will use “ResultPath”:”$.taskresult” to circumvent the lack of parameters in the output of this state.

In the final state, I will instruct Amazon SageMaker to deploy the model endpoint.

10. Here is the code for the final AWS Step Function state.

"Create Endpoint":{
"Type":"Task",
"Resource":"arn:aws:states:::sagemaker:createEndpoint",
"Parameters":{
"EndpointConfigName.$": "$.TrainingJobName",
"EndpointName.$": "$.TrainingJobName"
},
"End": true
},

The Endpoint Configuration from the previous state is used to deploy the model endpoint using Amazon SageMaker Hosting Services.

“Resource”:”arn:aws:states:::sagemaker:createEndpoint” refers to the Amazon SageMaker API for deploying an endpoint using Amazon SageMaker Hosting Services. After this state completes successfully, the endpoint is visible in the Amazon SageMaker console.

The name of the Endpoint, for simplicity is set the same as TrainingJobName

To keep things tidy, it is nice to display an error when things don’t go as planned. There is an AWS Step Function state for that!

11. Here is the code for the state that displays the error message. This state only gets invoked if there is an error in the Create Training Job state.

"Display Error":{
"Type": "Pass",
"Result": "Finished with errors. Please check the individual steps for more information",
"End": true
}

The full AWS Step Function state machine code is available at  https://gist.github.com/nivleshc/a4a99a5c2bca1747b6da0d7da0e388c1

When creating the AWS Step Function state machine, you will be asked for an AWS IAM Role that will be used by the state machine to run the states. Unless you already have an AWS IAM Role that can carry out all the state tasks, choose the option to create a new AWS IAM Role.

To invoke the AWS Step Function state machine, just click on new execution and provide a name for the execution id. As each of the states are run, you will see the visual feedback in the AWS Step Function schematic. You will be able to see the tasks in the Amazon SageMaker console as well.

To take the above one step further, you could invoke the AWS Step Function state machine whenever new training and validation data is available in the Amazon S3 bucket. The new model can then be used to update the existing model endpoint.

Thats it folks! This is how you can automatically train, build and deploy an Amazon SageMaker model!

Once you are finished, don’t forget to clean-up, to avoid any unnecessary costs.

The following must be deleted using the Amazon SageMaker console

  • The model endpoint
  • The Endpoint Configuration
  • The model
  • Any Jupyter Notebook instances you might have provisioned and don’t need anymore
  • Any Jupyter notebooks that are not needed anymore

If you don’t have any use for the following, these can also be deleted.

  • The contents of the Amazon S3 bucket and the bucket itself
  • The AWS IAM Role that was created and the custom policy to access the Amazon S3 bucket and its contents
  • The AWS Step Function state machine

Till the next time, Enjoy!

Using Amazon Alexa to drive a radio-controlled car – Part 1

Introduction

Over the Easter holidays, while watching my son play with his radio-controlled toy car, I had a strange thought pop into my head. Instead of using the sticks on the remote control, won’t it be cool to control the car by using just your voice? You could tell the car to move forward, backward, left or right. What if you could save all the moves you have asked the car to take so far and then at a later time, get the car to replay all those moves?

Now, that would be a car I would love to play with!

In this blog, I will introduce the high-level design for accomplishing the above-mentioned goal. Then over the next few blogs I will take you through the steps to transform the high-level design into a working prototype.

Hardware Requirements

For this prototype, I settled on using the following hardware devices

  • Amazon Echo Dot – this will be used to process my voice commands
  • Raspberry Pi 3 with a GPIO expansion Breadboard
  • A set of four 5v Relay Board Module
  • A radio-controlled race car
  • A soldering iron, solder wire and a digital multimeter

Design considerations

To make the prototype work, I decided to create an Amazon Alexa Skill called race car. This will be used to process my voice commands.

Challenge #1: How would I control the radio-controlled car?

I found two options for this

1. Completely bypass the remote control and send the radio frequency instructions directly to the race car

2. Emulate the button presses on the remote control so that it “thinks” someone is pressing those buttons and then it sends the appropriate radio frequency instructions to the race car

Option Chosen: I chose option 2 because it required the least amount of work. For this option, the only thing I needed to figure out was what happened when a button was pressed. After some experimentation, I found the contacts on the printed circuit board (PCB) of the remote control which I could open and close the contacts on, to emulate the button presses.

Challenge #2: I will use a python script running on a Raspberry Pi 3 within my home network to emulate the button presses on the remote control. How will I get the Amazon Alexa Skill to connect to my Raspberry Pi 3 which is running on my internal home network?

Solution: I found a neat trick at https://developer.amazon.com/blogs/post/Tx14R0IYYGH3SKT/flask-ask-a-new-python-framework-for-rapid-alexa-skills-kit-development . To expose my internal Raspberry Pi 3 python script to the Amazon Alexa Skill, I will use ngrok (https://ngrok.com) to create a secure tunnel between my Raspberry Pi 3 and the ngrok service. This provides me with an HTTPS endpoint within ngrok’s domain, which forwards any requests directed at the ngrok endpoint to the python script running on my internal Raspberry Pi 3 using the secure tunnel.

High Level Design for the prototype

Using the above-mentioned design considerations, the below schematic was developed to create the prototype.

Let’s go through each of the steps (denoted by the numbers) to better understand the design.

1. The user will invoke the race car Amazon Alexa Skill and ask to either move the car in a certain direction, save all the movements that have been requested so far, or run a previously saved set of movements.

2. The Alexa device (Amazon Echo Dot) will record the audio from the user and send it to the Alexa Cloud for processing. Alexa Cloud converts the audio into JSON using Natural Language Processing (NLP). Based on the invocation name, it will pass the JSON file to the race car Amazon Alexa Skill.

3. The race car Amazon Alexa Skill will check to ensure that the intent supplied by the user is valid. Once confirmed, the race car Amazon Alexa Skill will pass the JSON to the endpoint defined for the skill. In our case, this is an endpoint that is hosted at ngrok (https://ngrok.com)

4. The ngrok endpoint will receive the JSON file from the race car Amazon Alexa Skill and then forward it using the secure tunnel to the python script running on the Raspberry Pi 3 within the home network. The python script will use the Flask-Ask framework to process the intents from the Alexa Skills Kit (more information for Flask-Ask can be obtained from https://flask-ask.readthedocs.io/en/latest/)

5. If the user requested to save all the car movements carried out so far, then the python script will write the movements to a table within Amazon DynamoDB.

6. If the user requested to load a previously saved set of movements, then the python script will read the movements from the table within Amazon DynamoDB.

7. If the use requested to either load a previously saved set of movements or to move the car in a certain direction, the python script will emulate the appropriate button presses on the remote control.

8. The remote control will translate the emulated button presses into radio frequency instructions and send them to the car. The car will receive these instructions and move accordingly.

To give you a sneak peek of the prototype, checkout the video at https://youtu.be/4SMYDhuri0Q (there are some minor bugs with the car movement which I intend on getting fixed as soon as possible).

In the next blog in this series, we will go through the process of “hacking” the remote control and also setting up the Raspberry Pi 3 ancillary hardware.

I hope to see you then.

Till then, enjoy!

Script to shutdown servers

I run a lot of Microsoft virtual machines in Azure and also locally on my MacBook Pro. These are my lab machines, which I use for testing.

One of the issues with having many virtual machines is orderly shutting them down. It can be a pain to go through each of them and shutting them down.

To circumvent this, I wrote a small PowerShell script, which does it all for me 🙂

<#
This script will shutdown all servers that are in the @serverlist and are active
Changelog
Date Author Comments
02/06/17 Nivlesh Initial version
#>
$serverlist = @("Server01","Server02","Server03","Server04","Server05")
$server_domainname = "domain.local"
foreach ($server in $serverlist){
$server_fqdn = -join ($server, ".", $server_domainname)
Write-Host "[$server_fqdn]" NoNewline
if (Test-Connection ComputerName $server_fqdn Count 1 ErrorAction SilentlyContinue){
Write-Host ForegroundCOlor Green "Online. Initiating Shutdown" NoNewline
Stop-Computer ComputerName $server_fqdn force
Write-Host ForegroundColor Green " ServerShutdownSuccessfully"
}else{
Write-Host ForegroundColor Red "Server is Offline"
}
}
Read-Host Prompt "Press Enter when ready to shutdown this host or CTRL+C to Skip"
Stop-Computer

view raw
Shutdown_Servers.ps1
hosted with ❤ by GitHub

The script requires the following

$serverlist contains the hostnames of the servers that you want to shutdown (in the order they need to be shutdown)

$server_domainname this is the domain name that the servers are part of.

servername and and server_domainname is used to figure out the server fqdn, which is then used to shutdown that server.

Run the script from a computer that can connect to the servers. Ensure you are logged on as an account that has permissions to shutdown the servers.

The script will go through the list of servers contained in $serverlist and check if they are online. If they are online, then it will try to shut them down.

Do note that these servers will be forced to shutdown, so anything open on those servers will be lost, if not saved.

Once all the online servers have been shutdown, you will be asked if you want to shutdown the computer you are running the script from. You can press Enter to continue or CTRL+C to skip shutting down the computer you are logged on.

Hope this script comes in handy to others