Monitor Amazon Elastic Kubernetes Service Clusters using Prometheus and Grafana

Introduction

In today’s dynamic and fast-paced technological landscape, businesses and organisations heavily rely on cloud-based infrastructures to cater to their computing needs efficiently. With the increasing adoption of cloud services, the ability to provision resources has become remarkably streamlined, offering unparalleled scalability and flexibility. However, this newfound ease of provisioning also brings about the necessity for comprehensive monitoring strategies.

Monitoring provisioned resources is a critical aspect of managing a robust and reliable cloud infrastructure. It enables organisations to gain valuable insights into the performance, health and utilisation of their resources. By tracking various metrics and data points, monitoring empowers stakeholders to make informed decisions, optimise resource allocation, and identify potential issues before they escalate into major problems.

In this blog, we will walk through a solution that I built to monitor an Amazon Elastic Kubernetes Service (EKS) cluster using Prometheus and Grafana. We will be extending the capabilities of one of my previous blogs, where we had deployed an Amazon EKS cluster using the Serverless Terraform pipeline. If you haven’t read it, or need a refresher, the blog is available at https://nivleshc.wordpress.com/2023/06/12/create-an-amazon-elastic-kubernetes-service-cluster-using-a-serverless-terraform-pipeline/

A Preview Of The Final Product

Before we continue, I think it is a good idea to get a sneak peek of what we will be building. This should provide you enough motivation to stick till the end of the blog. While there is a lot that will be provisioned, everything is created using infrastructure as code. This means that you won’t have to do any coding yourself.

The screenshot below shows one of the Grafana dashboard panels that will be created in this blog. As you might notice, it exposes a lot of important metrics about the Amazon EKS cluster, in a clean and concise manner. Have I got you intrigued? Lets continue.

Updates to the Serverless Terraform Pipeline

As you might have figured out, we will be using the Serverless Terraform Pipeline to deploy our Amazon EKS cluster, which will now include the monitoring layer.

To facilitate this additional functionality, the Serverless Terraform Pipeline code has been updated.

Here is a summary of the updates.

  • the IAM policy attached to the AWS CodeBuild Service role has been updated. The AWS CodeBuild Service role can now access, add and update AWS System Manager Parameter Store parameters. These permissions will be used by the AWS CodeBuild project, to store the Grafana dashboard administrator credentials in AWS System Manager Parameter Store.

High Level Architecture

The diagram below shows the high level architecture for provisioning the Amazon EKS cluster, which now includes the monitoring components (Prometheus and Grafana) using the Serverless Terraform pipeline.

Once the code is pushed to the infrastructure code repository, the Amazon EventBridge rule detects that a new commit has been pushed and it automatically triggers the Infrastructure pipeline. The pipeline creates a plan for the proposed changes and then sends an email to the approver, asking them to either approve or reject the changes.

After the change is approved, the Terraform pipeline proceeds to provisioning the Amazon Elastic Kubernetes Services cluster, along with the other resources that have been defined in the code that was just pushed.

In addition to the resources described in the above-mentioned blog, the following resources will also be created. These collectively make up the monitoring solution for our Amazon EKS cluster.

  • a Prometheus microservice will be provisioned inside the Amazon EKS cluster.
  • a Grafana microservice will be provisioned inside the Amazon EKS cluster.
  • a Grafana dashboard will be created, using Prometheus as the data source, to display the Amazon EKS cluster metrics.
  • the Grafana dashboard administrator credentials will be stored in AWS Systems Manager Parameter Store.
  • an ingress rule will be created to expose the Grafana dashboard to the internet using an Application Load Balancer.

Walkthrough of the Code

To get a better understanding of the monitoring solution, lets go through the code that is used to deploy it.

Note

In this walkthrough, we will only focus on parts of the code that make up the monitoring solution. If you would like to understand the code that is used to provision the Amazon EKS cluster, please read through the previous blog at https://nivleshc.wordpress.com/2023/06/12/create-an-amazon-elastic-kubernetes-service-cluster-using-a-serverless-terraform-pipeline/.

  1. Clone my GitHub repository using the following command
git clone https://github.com/nivleshc/blog-tf-pipeline-eks-cluster.git

2. Open the folder named blog-tf-pipeline-eks-cluster. This folder contains all the files from my GitHub repository. Open the file called iam.tf. This file contains all the IAM resources that will be created. You will notice that three new resource blocks have been added at the bottom of this file.

The first resource block creates an IAM role, which will be used when creating an Amazon EKS EBS driver addon. I will discuss the functionality of this addon further down in this blog. The second and third resource blocks attach the AWS Managed Policy "arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy" to the newly created IAM role.

resource "aws_iam_role" "amazoneks_ebs_csi_driver_role" {
name = "${var.env}-${local.amazoneks_ebs_csi_controller.role_name_suffix}"
assume_role_policy = jsonencode({
Version : "2012-10-17",
Statement : [
{
Effect : "Allow",
Condition : {
StringEquals : {
"${replace(aws_eks_cluster.eks_cluster.identity[0].oidc[0].issuer, "https://", "")}:aud" : "sts.amazonaws.com",
"${replace(aws_eks_cluster.eks_cluster.identity[0].oidc[0].issuer, "https://", "")}:sub" : "system:serviceaccount:${local.amazoneks_ebs_csi_controller.namespace}:${local.amazoneks_ebs_csi_controller.service_account_name}"
}
},
Principal : {
Federated : "${aws_iam_openid_connect_provider.eks_cluster.arn}"
},
Action : "sts:AssumeRoleWithWebIdentity"
}
]
})
}
data "aws_iam_policy" "amazon_ebs_csi_driver_policy" {
arn = local.amazoneks_ebs_csi_controller.aws_managed_policy_arn
}
resource "aws_iam_role_policy_attachment" "amazoneks_ebs_csi_driver_role_policy_attach" {
role = aws_iam_role.amazoneks_ebs_csi_driver_role.id
policy_arn = data.aws_iam_policy.amazon_ebs_csi_driver_policy.arn
}
3. Next, open the file called eks.tf. A new resource block has been added to provision the EBS CSI driver addon for the Amazon EKS cluster. This addon will enable our microservices, in particular Prometheus, to use persistent storage (EBS volumes), to store its contents.
resource "aws_eks_addon" "aws_ebs_csi_driver" {
cluster_name = aws_eks_cluster.eks_cluster.name
addon_name = local.amazoneks_ebs_csi_controller.addon_name
service_account_role_arn = aws_iam_role.amazoneks_ebs_csi_driver_role.arn
depends_on = [
aws_eks_cluster.eks_cluster
]
}

4. We will now look at the contents of the file called locals.tf. This file defines all the named values that are used in this repository’s Terraform code. You can think of locals as another type of a variable. There are three local definitions that have been added to this file. These define the values that will be used when provisioning the EBS CSI Controller Amazon EKS addon, Prometheus and Grafana resources.

amazoneks_ebs_csi_controller = {
service_account_name = "ebs-csi-controller-sa"
namespace = "kube-system"
role_name_suffix = "AmazonEKS_EBS_CSI_DriverRole"
aws_managed_policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy"
addon_name = "aws-ebs-csi-driver"
}
prometheus = {
name = "prometheus"
namespace = "${var.env}-prometheus"
helm = {
repository = "https://prometheus-community.github.io/helm-charts"
chart = {
name = "prometheus"
version = "23.1.0"
}
values_filename = "values/prometheus.yaml"
}
}
grafana = {
name = "grafana"
namespace = "${var.env}-grafana"
service_port = 3000
helm = {
repository = "https://grafana.github.io/helm-charts"
chart = {
name = "grafana"
version = "6.58.4"
}
values_filename = "values/grafana.tftpl"
}
admin_credentials = {
username_ssm_parameter_path = "/${var.env}/grafana/admin/username"
password_ssm_parameter_path = "/${var.env}/grafana/admin/password"
}
ingress = {
annotations = {
scheme = "internet-facing"
target_type = "instance"
}
class_name = "alb"
rule = {
http = {
path = {
path = "/"
path_type = "Prefix"
}
}
}
}
}

5. Next, lets look at the file that contains Terraform code to provision the Prometheus resources. Open the file called prometheus.tf.

The first resource block creates a kubernetes namespace. We will provision Prometheus inside this namespace.

resource "kubernetes_namespace" "prometheus" {
metadata {
name = local.prometheus.namespace
}
depends_on = [
aws_eks_cluster.eks_cluster
]
}

The second resource block is used to provision Prometheus using a helm chart.

resource "helm_release" "prometheus" {
name = local.prometheus.name
repository = local.prometheus.helm.repository
chart = local.prometheus.helm.chart.name
version = local.prometheus.helm.chart.version
namespace = kubernetes_namespace.prometheus.id
values = ["${file(local.prometheus.helm.values_filename)}"]
depends_on = [
aws_eks_cluster.eks_cluster,
kubernetes_namespace.prometheus,
aws_eks_addon.aws_ebs_csi_driver
]
}

6. As you might have noticed, the above Helm chart is using a values file to override some of the default values. The values file is located inside the values subfolder and is named prometheus.yaml. The contents of the values file are displayed below.

server:
persistentVolume:
enabled: false
service:
servicePort: 9090
type: ClusterIP

7. Next, open the file named grafana.tf. This file contains the resources that are used to provision Grafana.

The first resource block creates a kubernetes namespace. All Grafana resources will be provisioned inside this namespace.

resource "kubernetes_namespace" "grafana" {
metadata {
name = local.grafana.namespace
}
depends_on = [
aws_eks_cluster.eks_cluster
]
}

The second resource block deploys Grafana using a Helm chart inside the grafana namespace.

resource "helm_release" "grafana" {
name = local.grafana.name
repository = local.grafana.helm.repository
chart = local.grafana.helm.chart.name
version = local.grafana.helm.chart.version
namespace = kubernetes_namespace.grafana.id
values = [
templatefile(local.grafana.helm.values_filename, { service_port = local.grafana.service_port, namespace = "${kubernetes_namespace.prometheus.id}", grafana_dashboards = fileset("${path.module}/grafana_dashboards/", "*.json"), module_path = "${path.module}" })
]
depends_on = [
aws_eks_cluster.eks_cluster,
kubernetes_namespace.grafana,
helm_release.prometheus
]
}

Some of the default values of the Grafana Helm chart are being overridden using a values file. This values file is actually a Terraform template file, located inside the subfolder called values and is named grafana.tftpl. Its contents are displayed below.

service:
enabled: true
type: NodePort
port: ${service_port}
targetPort: 3000
annotations: {}
labels: {}
portName: service
appProtocol: ""
datasources:
datasources.yaml:
apiVersion: 1
datasources:
– name: prometheus
type: prometheus
url: http://prometheus-server.${namespace}.svc.cluster.local:9090
access: proxy
isDefault: true
# Provision grafana-dashboards-kubernetes
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
– name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: true
editable: true
options:
path: /var/lib/grafana/dashboards/default
dashboards:
default:
%{ for dashboard in grafana_dashboards ~}
${indent(4, replace(replace(dashboard, ".json", ""), "./", ""))}:
json: |
${indent(8, file("${module_path}/grafana_dashboards/${dashboard}"))}
%{ endfor }

The values file defines where the dashboard json files are located. In our case, it is inside the folder called grafana_dashboards and is named kubernetes_nodes_views.json.

This dashboard is based on the prebuilt dashboard found at https://grafana.com/grafana/dashboards/15759-kubernetes-views-nodes/.

The next resource block inside grafana.tf provisions the ingress rule. This will expose the Grafana service to the internet.

resource "kubernetes_ingress_v1" "grafana" {
metadata {
name = local.grafana.name
namespace = kubernetes_namespace.grafana.id
annotations = {
"alb.ingress.kubernetes.io/scheme" = local.grafana.ingress.annotations.scheme
"alb.ingress.kubernetes.io/target-type" = local.grafana.ingress.annotations.target_type
}
labels = {
"app.kubernetes.io/name" = local.grafana.name
}
}
spec {
ingress_class_name = local.grafana.ingress.class_name
rule {
http {
path {
path = local.grafana.ingress.rule.http.path.path
backend {
service {
name = local.grafana.name
port {
number = local.grafana.service_port
}
}
}
path_type = local.grafana.ingress.rule.http.path.path_type
}
}
}
}
depends_on = [
aws_eks_cluster.eks_cluster,
kubernetes_namespace.grafana,
helm_release.grafana
]
}

The next resource block is actually a data source block. It gives access to the Grafana dashboard admin’s user credentials.

data "kubernetes_secret_v1" "grafana_admin_credentials" {
metadata {
name = local.grafana.name
namespace = kubernetes_namespace.grafana.id
}
binary_data = {
"admin-password" = ""
"admin-user" = ""
"ldap-toml" = ""
}
depends_on = [
helm_release.grafana
]
}

The next two resource blocks are used to store the Grafana dashboard admin’s username and password into AWS System Manager Parameter Store.

resource "aws_ssm_parameter" "grafana_admin_username" {
name = local.grafana.admin_credentials.username_ssm_parameter_path
type = "SecureString"
value = data.kubernetes_secret_v1.grafana_admin_credentials.binary_data.admin-user
depends_on = [
data.kubernetes_secret_v1.grafana_admin_credentials
]
}
resource "aws_ssm_parameter" "grafana_admin_password" {
name = local.grafana.admin_credentials.password_ssm_parameter_path
type = "SecureString"
value = data.kubernetes_secret_v1.grafana_admin_credentials.binary_data.admin-password
depends_on = [
data.kubernetes_secret_v1.grafana_admin_credentials
]
}

The last resource block outputs the Grafana ingress load balancer url. This is the url that is to be used to access the Grafana dashboard from the internet.

output "grafana_service_ingress_http_hostname" {
description = "Grafana service ingress hostname"
value = kubernetes_ingress_v1.grafana.status[0].load_balancer[0].ingress[0].hostname
depends_on = [
kubernetes_ingress_v1.grafana
]
}

Deploying the solution

This solution assumes that the Serverless Terraform Pipeline exists. If it does not, then please use the instructions at https://nivleshc.wordpress.com/2023/03/28/use-aws-codepipeline-aws-codecommit-aws-codebuild-amazon-simple-storage-service-amazon-dynamodb-and-docker-to-create-a-pipeline-to-deploy-terraform-code/ to deploy it before continuing.

  1. Get the details of the Infrastructure CodeCommit repository clone url. This was shown as an output of the TF_INFRA_APPLY CodeBuild project, when the Serverless Terraform Pipeline was deployed. It is also available from the AWS CodeCommit portal.

2. Clone the GitHub repository using the following command. You will get a warning that you are cloning an empty repository.

git clone <infrastructure repository clone url>

3. Next, copy the files contained inside the blog-tf-pipeline-eks-cluster folder into the folder where you cloned the infrastructure repository. Note – only copy the contents, and not the folder itself.

4. Open the file _provider.tf in your favourite IDE. In the provider “aws” block, under default_tags, update the Environment and Owner tag. For Environment, you could set this to “dev”. For Owner, you can provider your name. This will be used create tags for all resources that are provisioned for this solution.

5. Using Git, stage all the files, create a commit and push the changes to the AWS CodeCommit repository.

6. Within a few minutes of the push, the Amazon EventBridge rule will detect that new content has been pushed to the repository, and it will trigger the Serverless Terraform pipeline (AWS CodePipeline).

7. The AWS CodePipeline pipeline will retrieve the artifacts from the AWS CodeCommit repository, store it in the Amazon S3 bucket and then proceed to the next stage, where it will create a terraform plan for the changes.

8. The pipeline will then send an email to the Infrastructure change approver’s email address, asking them to either approve or reject the changed.

9. When you receive the email, approve the changes.

10. The AWS CodePipeline pipeline will then proceed to the next stage, where it will deploy the changes to your AWS Account. In this scenario, it will create the Amazon EKS cluster and all the other resources described in the terraform resource blocks.

Note: If your Amazon EKS cluster already exists, in this case, only the monitoring resources will be provisioned.

Prometheus and Grafana microservices will be provisioned inside the Amazon EKS cluster. A Grafana dashboard to monitor the Amazon EKS cluster will be created, which will retrieve the metrics using Prometheus as the data source. The Grafana dashboard administrator credentials will be stored inside the AWS System Manager Parameter Store. An ingress rule will also be created, which will expose the Grafana dashboard to the internet using an Application Load Balancer.

11. The TF_INFRA_APPLY AWS CodeBuild project logs will contain the url for accessing the Grafana dashboard using the ingress rule.

12. You can use the AWS Management Portal to confirm that all resources have been provisioned successfully.

Testing the solution

Follow the instructions below to access the Grafana dashboard.

  1. After the Infrastructure AWS CodePipeline pipeline has successfully deployed the Amazon EKS and the monitoring solution, check the TF_INFRA_APPLY AWS CodeBuild Project logs for the grafana_service_infress_http_hostname output. Its value contains the url for accessing the Grafana dashboard.
  2. Next, using the AWS Management Console, access the AWS Systems Manager Portal and then browse to the Parameter Store section.
  3. Locate two parameters that have names similar to below. These are the Grafana dashboard administrator credentials. {env} corresponds to the environment value (the value that was set to ENV when the Serverless Terraform Pipeline was deployed)
    • /{env}/grafana/admin/username
    • /{env}/grafana/admin/password
  4. The credentials stored in AWS Systems Manager Parameter Store are base64 encoded. Use the following command to decode them. <value> refers to either username or password.
    • echo <value> | base64 –decode ; echo
  5. Use your internet browser to browse to the Grafana dashboard, using the url from step 1 above.
  6. Login to the Grafana dashboard using the decoded username and password.
  7. After you are successfully logged in, you should see the Grafana dashboard panel, as displayed in the preview section above.

Cleaning up

After you have finished with the resources provisioned using this blog, it is important that you destroy them. This will ensure that you are not charged unnecessarily by AWS.

To destroy the resources, we will use the admin Terraform code included with the Serverless Terraform repository. If you haven’t got a copy of this repository, then clone it using the following command

git clone https://github.com/nivleshc/blog-create-pipeline-to-deploy-terraform-code.git
  1. Open the folder containing the cloned copy of the Serverless Terraform Pipeline repository and browse into the admin subfolder.
  2. From inside the admin folder, open the file named Makefile.
  3. Update the values for the following variables inside this file. The values must match those that were used to provision the Servereless Terraform Pipeline.
    • <myenv> – change this to the environment name that was used
    • <myprojectname> – change this to the project name that was used
    • <mys3bucketname> – change this to the Amazon S3 bucket name that was used
  4. To ensure we don’t get left with any orphaned resources, which will require manual deletion, strictly follow the order below to destroy the resources.
    • first destroy all resources provisioned using the infrastructure pipeline
    • then destroy all resources for the Serverless Terraform pipeline
    • lastly, we need to destroy all resources that were provisioned as part of the prerequisites for the Serverless Terraform Pipeline.
  5. Using a command line utility (for example Terminal on MacOS), browse to the admin folder.
  6. To destroy all resources provisioned using the Infrastructure pipeline, use the following commands.
    • make terraform_infra_init – this initialises the Terraform project
    • make terraform_infra_show – this displays all the resources provisioned using the Infrastructure pipeline
    • make terraform_infra_destroy – this destroys all resources provisioned using the infrastructure pipeline
  7. To destroy the Serverless Terraform Pipeline resources, use the following commands.
    • make terraform_pipeline_init – this initialises the Terraform project
    • make terraform_pipeline_show – this displays all the resources provisioned for the Serverless Terraform pipeline
    • make terraform_infra_destroy – this destroys all resources that make up the Serverless Terraform pipeline
  8. Lastly, to destroy the Serverless Terraform Pipeline prerequisite resources, run the following commands.
    • make terraform_prereq_init – this initialises the Terraform project
    • make terraform_prereq_show – this displays all the prerequisite resources that had been provisioned
    • make terraform_infra_destroy – this destroys all the prerequisite resources

I hope this blog gave you great insights on how to setup a monitoring system for your Amazon EKS cluster. A good monitoring system will help you in making informed decisions, optimise resource allocation, and identify potential issues before they escalate into major problems.

Till the next time, stay safe!