A few weeks back, I was tasked with automating the training, build and deployment of an Amazon SageMaker model. Initially, I thought that an AWS Lambda Function would be the best candidate for this, however as I started experimenting, I quickly realised that I needed to look elsewhere.
After some research, I found articles that pointed me towards AWS Step Functions. As it happens, AWS has been making AWS Step Functions more Amazon SageMaker friendly, to the point that AWS Step Functions now natively supported most of the Amazon SageMaker APIs.
With the technology decided, I started figuring out how I would achieve what I had set out to do. I did find some good documentation and examples, however they didn’t entirely cover everything I was after.
After much research and experimentation, I finally created a solution that was able to automatically train, build and then deploy an Amazon SageMaker model.
In this blog, I will outline the steps I followed, with the hope that it benefits those wanting to do the same, saving them countless hours of frustration and experimentation.
High Level Architecture Diagram
Below is a high-level architecture diagram of the solution I used.
The steps (as denoted by the numbers in the diagram) are as follows:
- The first AWS Step Function state calls the Amazon SageMaker API to create a Training Job, passing it all the necessary parameters.
- Using the supplied parameters, Amazon SageMaker downloads the training and validation files from the Amazon S3 bucket, and then runs the training algorithm. The output is uploaded to the same Amazon S3 bucket that contained that training and validation files.
- The next AWS Step Function state calls the Amazon SageMaker API to create a model, using the artifacts from the Training Job.
- The next AWS Step Function state calls the Amazon SageMaker API to create an endpoint configuration, using the model that was created in the previous state.
- The next AWS Step Function state calls the Amazon SageMaker API to create a model endpoint, using the endpoint configuration that was created in the previous state.
- Using the endpoint configuration, Amazon SageMaker deploys the model using Amazon SageMaker Hosting Services, making it available to any client wanting to use it.
Let’s get started.
For this blog, I will be using the data and training parameters described in the Amazon SageMaker tutorial at https://docs.aws.amazon.com/sagemaker/latest/dg/gs-console.html
1. Create an Amazon S3 bucket. Create a folder called data inside your Amazon S3 bucket, within which, create three subfolders called train, validation and test (technically these are not folders and subfolders, but keys. However, to keep things simple, I will refer to them as folders and subfolders).
2. Follow Step 4 from the above-mentioned Amazon SageMaker tutorial (https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-preprocess-data.html
) to download and transform the training, validation and test data. Then upload the data to the respective subfolders in your Amazon S3 bucket (we won’t be using the test data in this blog, however you can use it to test the deployed model).
For the next three months, you can download the transformed training, validation and test data from my Amazon S3 bucket using the following URLs
3. Create an AWS IAM role with the following permissions
AmazonSageMakerFullAccess (AWS Managed Policy)
and a custom policy to read, write and delete objects from the Amazon S3 bucket created in Step 1. The policy will look similar to the one below
where bucketName is the name of the Amazon S3 bucket created in Step 1 above.
4. Open the AWS Step Functions console and change to the AWS Region where the Amazon SageMaker model endpoint will be deployed.
5. Create a new state machine, choose Author with code snippets and set the type to Standard.
6. Under Definition delete everything and paste the following
The above commands provide a comment describing the purpose of this AWS Step Function state machine and sets the first state name as Create Training Job.
For a full list of Amazon SageMaker APIs supported by AWS Step Functions, please refer to https://docs.aws.amazon.com/step-functions/latest/dg/connect-sagemaker.html
7. Use the following code to create the first state (these are the training parameters described in the above-mentioned Amazon SageMaker tutorial).
I would like to call out a few things from the above code
“Resource”: “arn:aws:states:::sagemaker:createTrainingJob.sync” refers to the Amazon SageMaker API for creating a Training Job. When this state task runs, you will be able to see this Training Job in the Amazon SageMaker console.
TrainingJobName is the name given to the Training Job and it must be unique within the AWS Region, in the AWS account. In my code, I am setting this to the Execution Name (internally referred to as $$.Execution.Name), which is an optional parameter that can be supplied when executing the AWS Step Function state machine. By default, this is set to a unique random string, however to make the Training Job name more recognisable, provide a more meaningful unique value when executing the state machine. I tend to use the current time in the format <training-algorithm>-<year><month><date>-<hour><minute><second>
If you have ever used Jupyter notebooks to run an Amazon SageMaker Training Job, you would have have used a line similar to the following:
container = get_image_uri(boto3.Session().region_name, ‘xgboost’)
Yes, your guess is correct! Amazon SageMaker uses containers for running Training Jobs. The above assigns the xgboost training algorithm container from the region that the Jupyter notebook is running in.
These containers are hosted in Amazon Elastic Container Registry (Amazon ECR) and maintained by AWS. For each training algorithm that Amazon SageMaker supports, there is a specific container. Details for these containers can be found at https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html.
When submitting a Training Job using AWS Step Functions, you must supply the correct container name, from the correct region (the region where you will be running Amazon SageMaker from). This information is passed using the parameter TrainingImage. To find the correct container path, use https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html.
Another method for getting the value for TrainingImage is to manually submit a Training Job using the Amazon SageMaker console, using the same training algorithm that you will be using in the AWS Step Function state machine. Once the job has started, open it and have a look under the section Algorithm. You will find the Training Image for that particular training algorithm, for that region, listed there. You can use this value for TrainingImage.
S3OutputPath is the location where Amazon SageMaker will store the model artifacts after the Training Job has successfully finished.
RoleArn is the ARN of the AWS IAM Role that was created in Step 3 above
S3Uri under ChannelName: train is the Amazon S3 bucket path to the folder where the training data is located
S3Uri under ChannelName:validation is the Amazon S3 bucket path to the folder where the validation data is located
DON’T FORGET TO CHANGE bucketName TO THE AMAZON S3 BUCKET THAT WAS CREATED IN STEP 1 ABOVE
In the next AWS Step Function state, the model will be created using the artifacts generated from the Training Job.
8. An AWS Step Function state receives input parameters, does its processing and then produces an output. If there is another state next in the path, the output is provided as an input to that state. This is an elegant way for passing dynamic information between states.
Here is the code for the next AWS Step Functions state.
Image refers to the same container that was used in the Create Training Job state.
ModelDataUrl refers to the location where the model artifacts that were created in the previous state are stored. This value is part of the output (input to this state) from the previous state. To reference it, use $.ModelArtifacts.S3ModelArtifacts
ExecutionRoleArn is the ARN of the AWS IAM Role that was created in Step 3 above.
“Resource”: “arn:aws:states:::sagemaker:createModel” refers to the Amazon SageMaker API for creating a model
To keep things simple, the name of the generated model will be set to the TrainingJobName. This value is part of the output (input to this state) from the previous state. To reference it, use $.TrainingJobName
After this state finishes execution, you will be able to see the model in the Amazon SageMaker console.
The next state is for creating an Endpoint Configuration using the model that was just created.
Before we continue, I want to point out an additional parameter that I am using “ResultPath”:”$.taskresult”. Let me explain the reason for using this. In my next state, I must provide the name of the model that will be used to create the Endpoint Configuration. Unfortunately, this name is not part of the output of the current state Create Model, so I won’t be able to reference it. However, as you might remember, for simplicity, we set the model name to be the same as TrainingJobName and guess what, this is part of the current states input parameters! Now, if only there was a way for me to make the current state to include its input parameters in its output. Oh wait! There is a way. Using the command “ResultPath”:”$.taskresult” instructs this AWS Step Function state to include its input parameters in its output.
9. Here is the code for the AWS Step Function state to create an Endpoint Config.
This state is pretty straight forward.
“Resource”: “arn:aws:states:::sagemaker:createEndpointConfig” refers to the Amazon SageMaker API to create an Endpoint Configuration
For simplicity, we will set the Endpoint Configuration name to be the same as the TrainingJobName. The Endpoint will be deployed initially using one ml.t2.medium instance.
As in the previous state, we will use “ResultPath”:”$.taskresult” to circumvent the lack of parameters in the output of this state.
In the final state, I will instruct Amazon SageMaker to deploy the model endpoint.
10. Here is the code for the final AWS Step Function state.
The Endpoint Configuration from the previous state is used to deploy the model endpoint using Amazon SageMaker Hosting Services.
“Resource”:”arn:aws:states:::sagemaker:createEndpoint” refers to the Amazon SageMaker API for deploying an endpoint using Amazon SageMaker Hosting Services. After this state completes successfully, the endpoint is visible in the Amazon SageMaker console.
The name of the Endpoint, for simplicity is set the same as TrainingJobName
To keep things tidy, it is nice to display an error when things don’t go as planned. There is an AWS Step Function state for that!
11. Here is the code for the state that displays the error message. This state only gets invoked if there is an error in the Create Training Job state.
The full AWS Step Function state machine code is available at https://gist.github.com/nivleshc/a4a99a5c2bca1747b6da0d7da0e388c1
When creating the AWS Step Function state machine, you will be asked for an AWS IAM Role that will be used by the state machine to run the states. Unless you already have an AWS IAM Role that can carry out all the state tasks, choose the option to create a new AWS IAM Role.
To invoke the AWS Step Function state machine, just click on new execution and provide a name for the execution id. As each of the states are run, you will see the visual feedback in the AWS Step Function schematic. You will be able to see the tasks in the Amazon SageMaker console as well.
To take the above one step further, you could invoke the AWS Step Function state machine whenever new training and validation data is available in the Amazon S3 bucket. The new model can then be used to update the existing model endpoint.
Thats it folks! This is how you can automatically train, build and deploy an Amazon SageMaker model!
Once you are finished, don’t forget to clean-up, to avoid any unnecessary costs.
The following must be deleted using the Amazon SageMaker console
- The model endpoint
- The Endpoint Configuration
- The model
- Any Jupyter Notebook instances you might have provisioned and don’t need anymore
- Any Jupyter notebooks that are not needed anymore
If you don’t have any use for the following, these can also be deleted.
- The contents of the Amazon S3 bucket and the bucket itself
- The AWS IAM Role that was created and the custom policy to access the Amazon S3 bucket and its contents
- The AWS Step Function state machine
Till the next time, Enjoy!