Using Serverless Framework and AWS to map Near-Realtime Positions of Trains

Background

A couple of months back, I found out about the Open Data initiative from Transport for New South Wales. This is an awesome undertaking, to provide data to developers and other interested parties, so that they can develop great applications. For those interested, the Open Data hub can be accessed at https://opendata.transport.nsw.gov.au.

I had been playing with data for a few months now and when I looked through the various APIs that I could access via the Open Data hub, I became extremely interested.

In this blog, I will take you through one of my mini projects based off the data at Open Data Hub. I will be using the Public Transport – Vehicle Positions API to plot the near realtime positions of Sydney trains on a map. The API provides access to more than just train position data, however to keep things simple, I will concentrate on only trains in this blog.

Solution Architecture

As I am a huge fan of serverless, I decided to architect my solution with as much serverless components as possible. The diagram below shows a high-level architecture of how the Transport Positioning System (this is what I will call my solution, TPS for short) will be created.

Let’s go through the steps (as marked in the diagram above)

  1. The lambda function will query the Open Data API every 5 minutes for the position of all trains
  2. After the data has been received, the lambda function will go through each record and assign a label to each train. To ensure the labels are consistent across each lambda invocation, the train to label association will be stored in an Amazon DynamoDB table. The lambda function will query the table to check if a train has already been allocated a label. If it has, then that label will be used. Otherwise, a new label will be created, and the Amazon DynamoDB table will be updated to store this new train to label association.
  3. I found Bing Maps to be much easier (and cheaper) to use for plotting items on a map. The only disadvantage is that it can, at most show 100 points (called pushpins) on the map. The lambda function will go through the first 100 items returned from Open Data API and using the labels that were found/created in step 2, create a pushpin url that will be used to generate a map of the location of the first 100 trains. This url will then be used to generate the map by sending a request to Bing Maps.
  4. The lambda function will then create a static webpage that displays the map showing the position of the trains, along with a description that provides more information about the labels used for each train (for example, label 1 could have a description of “19:10 Central Station to Penrith Station”). The label description is set to the train’s label, which is obtained from the query results of the Open Data API query.

This is what I cooked up earlier

Be warned! This blog is quite lengthy as a lot was done to make this solution work. However, if you would rather see the final result before delving into the details, check it out https://sls-tps-website-dev.s3-ap-southeast-2.amazonaws.com/vehiclelocation.html.

Screenshot of TPS

Above is a screenshot of TPS in action. It is a static webpage that is being generated every 5 minutes, showing the position of trains. I will keep the lambda function running for at least three months, so you have plenty of time available to check it out.

Okay, let’s get our hands dirty and start coding.

Prerequisites

Before we start, the following must be in place

  1. You must have an AWS account. If you don’t have one already, you can sign up for a free tier at https://aws.amazon.com/free/
  2. You must have Serverless Framework installed on your computer. If you don’t have, it follow the details at https://serverless.com/framework/docs/getting-started/
  3. Setup the AWS access keys and secret access key that Serverless Framework will use to provision resources into your AWS account. Instructions to get this done can be obtained from https://serverless.com/framework/docs/providers/aws/guide/credentials/
  4. After items 1 – 3 have been completed, create a python runtime Serverless Framework service (my service is called tps)
  5. Within the tps service folder, install the following Serverless Framework plugins
    1. serverless-python-requirements – this plugin adds all the required python modules into a zip file containing our lambda function script, which then gets uploaded to AWS (the required python modules must be defined in the requirement.txt file)
    2. serverless-prune-plugin – this plugin ensures that only the specified number of lambda function versions exist.
  6. Serverless-python-requirements requires a file called requirements.txt. For this solution, create a file called requirements.txt at the root of the service folder and put the following lines inside it
    requests==2.22.0
    gtfs-realtime-bindings==0.0.6
  7. Register an account with Transport for New South Wales Open Data Hub (https://opendata.transport.nsw.gov.au/). This is free. Once registered, login to the Open Data Hub portal and then under My Account, click on Applications and then create an Application that has permissions to Public Transport – Realtime Vehicle Positions API. Note down the API key that is generated as it will be used by the python script later.
  8. Create an account with Bing Maps, use the Website licence plan. This will provide 125,000 billable transactions of generating maps per calendar year at no charge. For generating a map every 5 minutes, this is more than enough. Note down the API key that is provided. Details on creating a Bing Maps account is available at https://www.microsoft.com/en-us/maps/licensing/options

This project has two parts to it. The first is to create the AWS resources that will host our project. For this, we will be using Serverless Framework to create our AWS Lambda function, Amazon Simple Storage Service (S3) bucket, Amazon DynamoDB table, AWS CloudWatch Logs and AWS CloudWatch Events.

The second part is the API query and data ingestion from the Open Data API. The next sections will cover each of these parts.

AWS Resource Creation

As previously mentioned, we will use Serverless Framework to create our AWS resources. Serverless Framework uses serverless.yml to specify which resources need to be created. This file is created by default whenever a Serverless Framework service is created.

In this section, I will take you through the serverless.yml file I used for this project.

The file starts like this.

It defines the service name, plugins and variables that will be used in this file (notice the plugins serverless-python-requirements and serverless-prune-plugin)

The following default values have been configured in the above serverless.yml file.

  • application name is set to tps
  • environment is set to dev
  • Amazon CloudWatch logs is set to 14 days retention
  • the AWS region is set to ap-southeast-2 (Sydney)
  • the Amazon DynamoDB table is set to transportPosition
  • the billing mode for the Amazon DynamoDB table will be set to PAY PER REQUEST.

The next section defines the details for the cloud provider where resources will be provisioned.

As previously mentioned, I am using AWS. The lambda function will use python 3.7 runtime. The deployment bucket to host all artefacts is also defined. This is an Amazon Simple Storage Service (S3) bucket. Ensure that this S3 bucket exists before deploying the serverless service.

The next section defines the IAM role that will be created for the lambda function.

The specified IAM role will allow the lambda function to carry out all required operations on Amazon Dynamodb table (currently the IAM role permits all DynamoDB actions, however if required, this can be tightened) and to upload objects to the Amazon S3 bucket that will host the static website.

The next section provides instructions on what to include and exclude when creating the serverless package.

Now we come to the important sections for within serverless.yml. The next section defines the lambda function, its handler and what events will trigger it.

For the tps lambda function, the handler function is at src.tps_vehiclepos.run (this means that there is a subfolder within the service folder called src, within which there is a python file called tps_vehiclepos. Inside this file is a function called run). The lambda function will be run every 5 minutes. To achieve this, I am using AWS Events. Two environment variables are also being passed to the lambda function (BUCKET and TRANSPORTPOSITION_TABLE)

The last section defines all the resources that will be created by the Serverless Framework.

The following resources will be created

  • Amazon S3 bucket. This bucket will store the Bing Maps that show the train positions. It will also serve the website which will be used to display the position maps.
  • Amazon DynamoDB table. The table will be used to store details for trains found in the Open Data API query. Note that we are also using the DynamoDB TTL feature. Since we don’t need to retain items older than a day, this will allow us to easily prune the Amazon DynamoDB table, to reduce costs.

The full serverless.yml file can be downloaded from https://gist.github.com/nivleshc/951ff2f235a89d58d4abec6de91ef738

Generating the map

In this section we will go through the script that does all the magic, which is querying the Open Data API, processing the data, generating the map and then displaying the results in a webpage.

The script is called tps_vehiclelocation.py. I will discuss the script in parts below.

The first section lists the python modules that will be imported. It also defines the variables that will be used within the script.

Remember to replace <insert your opendata api key> and <insert your bing maps api key here> with your own Open Data and Bing Maps api key. Do not include the ‘<‘ ‘>’ in the script.

Next, I will take you through the various functions that have been created to carry out specific tasks.

First up is the initialise global variables function. As you might be aware, AWS Lambdas can get reused over various invocations. During my experimentations, I found this happening and the issue I found was that my global variables were not being automatically initialised. This caused erroneous results. To circumvent this issue, I decided to write an explicit function that will initialise all global variables at the beginning of each lambda function invocation.

When placing each train location on the map (these will be doing using pushpins), a label will be used to identify each pushpin. Unfortunately, Bing Maps doesn’t allow more than three characters for each label. This is not enough to provide a meaningful label. The solution I devised was to use a consecutive numbering scheme for the labels and then provide a key on the website page. The key will provide a description for each label. One complication with this approach is that I need to ensure the same label is used for the same train for any lambda invocation. This is achieved by storing the label to train mapping in the Amazon DynamoDb table.

The next function downloads all label to train associations stored in Amazon DynamoDB. This ensures that labels remain consistent across lambda invocations.

The next function just queries Open Data API for the train positions.

The following function takes the data from the above function and processes it.

The function goes through each vehicle record that was returned and checks to see if the vehicle already has a label associated to it. If there is one, then this label will be used for it otherwise a new one is created. The Amazon DynamoDB table will be updated with this newly created label. This function uses the first 100 trains returned by the Open Data API, to generate a pushpin URL. This URL will be used to generate the Bing Maps showing the positions of the trains.

The following function uses the pushpin URL to create a map using Bing Maps.

The function provides the pushpin URL to Bing Maps. Bing Maps returns a map showing the positions of the trains. The map is then uploaded to the Amazon S3 bucket that will serve the website. The function then goes generates a description page showing descriptions for each label. As you can imagine, each day, there are hundreds of labels being created. Not all of these labels will be displayed on the map, however they would be listed in the key area. To provide quick access to the key descriptions which show trains that are currently displayed in the map, the corresponding entries will be displayed in blue. This ensures that people don’t go on a wild goose chase, trying to locate a train on the map which might not be displayed.

The next function updates the transport position Amazon DynamoDB table with any new labels that were created during this invocation. This ensures that the labels persist for all subsequent lambda invocations.

Now that we know what all the functions do, lets move on to the handler function that the tps lambda will call. The handler function will coordinate all the functions.

The handler function calls the respective functions to get the following tasks done (in the order listed below)

  • Initialises the global variables
  • downloads the previously associated labels from the Amazon DynamoDB table
  • calls Open Data API to get the latest position of the trains.
  • the train data is then processed, and the pushpin URL generated
  • Using the pushpin URL, the map is generated using Bing Maps.
  • A landing page is created. This webpage shows the map along with a key to show descriptions for each label.
  • Finally, all labels that were created within this lambda invocation are uploaded to the Amazon DynamoDB table. This ensures that for all subsequent invocations of the lambda function, the respective trains get the same label assigned to them.

The full tps_vehiclepos.py file can be downloaded from https://gist.github.com/nivleshc/03cb06bd6ae9cb969192af0ee2a1a15b

That’s it! Now you have a good idea about how the AWS resources were generated and how the data was acquired, processed and then visualised.

Cost to run the solution

When I started developing this solution, to view what type of data was being provided by Open Data API, I tried to ingest everything into DynamoDB. This was a VERY VERY bad idea as it cost me quite a lot. However, since then I have modified my code to only ingest and store fields that are required, into Amazon DynamoDB table. This has drastically reduced by running costs. Currently I am being charged approximately USD0.05 or less per day. You can run this easily within your free tier without incurring any additional costs (just as a precaution, monitor your costs to ensure there are no surprises).

As mentioned at the beginning of this blog, the final result can be seen at https://sls-tps-website-dev.s3-ap-southeast-2.amazonaws.com/vehiclelocation.html. The webpage is refreshed every one minute however the map is generated every five minutes. I will keep this project running for at least three months. So, if you want to check it out, you have ample time.

I have extremely enjoyed creating this project, I hope you all enjoy it as well. Till the next time, Enjoy!

Advertisements

Display Raspberry Pi Metrics using AWS CloudWatch

Background

I am all for cloud computing, however there are some things, in my view, that still need an on-premises presence. One such is devices that allow you to securely connect to your home network. For this, I use a Raspberry Pi running OpenVPN server. OpenVPN is an awesome tool and apart from securely connecting to my home network, it also allows me to securely tunnel my network traffic via my home network when I am connected to an unsecured network.

However, over the last few days, I have been having issues with my vpn connections. It would intermittently disconnect, and at times I will have to try a few times before the connection was re-established. At first it was a nuisance, however lately it has become a bigger issue. Finally, I decided to fix the issue!

After spending some time on it, guess what the problem turned out to be? A few weeks back I had installed a software on my Raspberry Pi for testing purposes. I forgot to uninstall it and now for some reason it was hogging the CPU! As I didn’t need this software, the quick fix was to simply uninstall it.

This got me thinking. There must be a better way to monitor the CPU/Memory/Disk space usage on my Raspberry PI instead of logging onto it every now and then, or worse, when things broke. I could install monitoring tools on it which could notify me when certain thresholds were breached. However, this meant adding more workloads to my Raspberry Pi, something which I wasn’t too keen on doing. I finally decided to publish the metrics to AWS CloudWatch and create some alarms in it.

In this blog, I will list the steps that I followed, to publish my Raspberry Pi metrics to AWS CloudWatch. Without taking much more time, let’s get started.

Create an AWS user for the AWS CloudWatch agent

The AWS CloudWatch agent that will run on our Raspberry Pi needs to be able to authenticate with our AWS account, before it can upload any metrics.

To enable this, create an IAM user with programmatic access and assign the CloudWatchAgentServerPolicy directly to it. When you are creating this IAM user, keep a note of the secret access key that is displayed at the end of the user creation process. If you lose these keys, there is no way to recover them, the only option will be to regenerate it.

For detailed instructions on creating this IAM user, visit https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/create-iam-roles-for-cloudwatch-agent-commandline.html

Downloading and installing the AWS CloudWatch agent

With the IAM user done, let’s proceed to installing the AWS CloudWatch agent on the Raspberry Pi. The available agents can be downloaded from https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-commandline-fleet.html

Knowing that my Raspberry Pi runs Raspbian as the operating system, which is a variant of Debian, I proceeded to download the ARM64 version of the .deb file. This is when the fun started!

Running the command

dpkg -I -E ./amazon-cloudwatch-agent.deb

gave me the following error

package architecture (arm64) does not match system (armhf)

Interesting, according to the error, it seemed that my Raspberry Pi has an armhf (arm hard float) architecture, which is not supported by the standard AWS CloudWatch agents. For those interested in the various Debian ports, this webpage lists all of them https://www.debian.org/ports/#portlist-released

This latest discovery put my planning into a tailspin!

After spending some time searching, I came across https://github.com/awslabs/collectd-cloudwatch . This described a plugin for collectd, which would allow me to publish the Raspberry Pi metrics to AWS CloudWatch!

Horary! I was back on track again! Below is a record of what I did next.

  1. On my Raspberry Pi, I installed collectd using
    sudo apt-get install collectd
  2. I then downloaded the installation script for the AWS CloudWatch plugin using the following command
    wget https://github.com/awslabs/collectd-cloudwatch/blob/master/src/setup.py
  3. Once downloaded, I used chmod to make the script executable using the following command
    chmod u+x setup.py
  4. If you look through the script, you will notice that it tries to detect the linux distribution for the system it is running on, and then uses the respective installer commands to install the plugin. Digging abit further, I found that the way it detects the linux distribution is by looking through the files matching the pattern /etc/*-release.

    When I looked at all files fitting the name pattern /etc/*release, the only file I found was /etc/os-release which was a symbolic link to /usr/lib/os-release

    Opening the file /usr/lib/os-release, I noticed that the name of the distribution that was installed on my Raspberry Pi was “Raspbian GNU/Linux

    Comparing this to the script setup.py, I found that it wasn’t one of the supported distributions. Fear not because this is easily remedied!

    So here is what you do.

    Open setup.py in your favorite editor and scroll down to where the following section is

    DISTRIBUTION_TO_INSTALLER = {
      "Ubuntu": APT_INSTALL_COMMAND,
      "Red Hat Enterprise Linux Server": YUM_INSTALL_COMMAND,
      "Amazon Linux AMI": YUM_INSTALL_COMMAND,
      "Amazon Linux": YUM_INSTALL_COMMAND,
      "CentOS Linux": YUM_INSTALL_COMMAND,
    }

    Add the line

    "Raspbian GNU": APT_INSTALL_COMMAND,

    after

    "CentOS Linux": YUM_INSTALL_COMMAND

    You should now have the following

    DISTRIBUTION_TO_INSTALLER = {
      "Ubuntu": APT_INSTALL_COMMAND,
      "Red Hat Enterprise Linux Server": YUM_INSTALL_COMMAND,
      "Amazon Linux AMI": YUM_INSTALL_COMMAND,
      "Amazon Linux": YUM_INSTALL_COMMAND,
      "CentOS Linux": YUM_INSTALL_COMMAND,
      "Raspbian GNU": APT_INSTALL_COMMAND,
    }
  5. Run setup.py. The script seems to be customised to run within an Amazon EC2 instance because it tries to gather information by querying the instance metadata urls. Ignore these errors and enter the information that is requested. Below are the questions that will be asked

    When asked for your region, enter your AWS region. For me, this is ap-southeast-2
    When asked, enter the hostname of the Raspberry Pi
    Next, you will be asked about the AWS credentials to connect to AWS CloudWatch. Enter the credentials for the user that was created above
    Unless you are using a proxy server, answer none to "Enter a proxy server name" and "Enter a proxy server port"
    At the next prompt for "Include the Auto-Scaling Group name as a metric dimension" choose No
    For "Include the Fixed Dimension as a metric dimension" prompt choose No
    At the next prompt for "Enable high resolution" choose No
    For "Enable flush internal", leave this at "Default 60s"
    The last question asks how to install the CloudWatch plugin. Choose "Add plugin to existing configuration"
  6. Now that all the questions have been answered, you must select the metrics that have to be published to AWS CloudWatch. To check which metrics can be published, open the file

    /opt/collectd-plugins/cloudwatch/config/blocked-metrics.

    From the above file, select the metrics that you want to be published to AWS CloudWatch, and copy them into the file

     /opt/collectd-plugins/cloudwatch/config/whitelist.conf
  7. After the whitelist has been populated, restart the collectd agent so that it can read the updated settings. To do this, issue the following commandt
    sudo service collectd restart
  8. Thats it! Give it approximately five minutes and the Raspberry Pi metrics should be populated inside AWS CloudWatch. To check, login to AWS CloudWatch and under Metrics, you should see a custom namespace for collectd. This is the metrics that were sent from your Raspberry Pi.

Here is a screenshot of the CPU metrics that my Raspberry Pi uploaded to AWS CloudWatch

AWS CloudWatch Raspberry Pi Metrics

 

If you want to be alerted when a certain metric reaches a particular threshold, you can create an alarm within AWS CloudWatch, that notifies you when it happens.

Thats it folks! Till the next time, Enjoy!