Prometheus is a widely utilized time-series database for monitoring the health and performance of AWS infrastructure. With its ecosystem of data collection, storage, alerting, and analysis capabilities, among others, the open source tool set offers a complete package of monitoring solutions. Prometheus is ideal for scraping metrics from cloud-native services, storing the data for analysis, and monitoring the data with alerts.
In this article, we’ll take a look at the Prometheus ecosystem and offer some key considerations for setting up Prometheus to monitor AWS, highlight some of its shortcomings, and take a look at how to go about solving them with Logz.io.
Prometheus has three core components – scraping which is done from the endpoints that exporters expose, a time series database, and an alerting system called Alert Manager.
Using this system, an exporter reads metrics from AWS infrastructure and exposes the data for Prometheus to scrape. For example, you can run a node exporter on EC2 and then configure Prometheus to pull metrics from your machines. A node exporter will collect all ofl your system information and then open a small server to expose these metrics.
While Prometheus scraping can be used to collect metrics from all kinds of infrastructure, it’s hugely popular based on its comparative ease-of-use for Kubernetes-based environments. Its auto discovery for new Kubernetes services has dramatically simplified Kubernetes monitoring. And we all know how popular Kubernetes is among today’s cloud developers.
Once data is scraped using Prometheus, its time-series database stores these metrics, while AlertManager monitors them, and then pushes notifications to your desired endpoint.
Other tools in this ecosystem of course include Grafana , Trickster , Thanos , M3DB , Cortex , Pushgateway , and a number of other Prometheus exporters.
Trickster is a caching layer on top of Prometheus that can cache queries that are very frequent and /or large in scale; this can prove extremely useful in lowering the pressure on Prometheus itself.
The Thanos, Cortex, and M3DB databases can be used to extend the functionality of Prometheus features including high availability, horizontal scaling, and historical back up. While Prometheus is a single-node solution, you can write the data to these time series databases to consolidate data from multiple servers for analysis.
Pushgateway enables push-based metrics in your Prometheus setup. By default, Prometheus can only read metrics from defined sources. You can simply push the metrics to Pushgateway, and Prometheus will then pull the metrics from there.
And while Prometheus is a powerful solution for collecting and storing metrics from cloud-native environments, its visualization capabilities are lacking.
As a result most Prometheus users visualize their data with Grafana – an open source data visualization tool that easily connects to Prometheus. It has great support for Prometheus’ query language and is a highly capable and flexible metric visualization solution.
As mentioned, Prometheus runs on a single node so it is inherently not designed for high availability. Since Prometheus stores metrics on a disk in a single machine, as the data grows, many users end up decreasing their related range of fine metrics to accommodate growing scale. In some cases, this comes at the expense of monitoring critical information.
To scale your system without reducing the cardinality of your metrics, you can however implement tools like Thanos and Trickster to centralize your Prometheus metrics for storage analysis.
But of course, adding additional components means invoking additional installations, adding infrastructure, creating more configurations, undertaking more upgrades, and increasing other maintenance tasks – all of which requires time. As a result, high availability Prometheus deployments can become increasingly difficult to manage as data volumes grow.
Finally, metrics is only one piece of the observability puzzle, and Prometheus isn’t purpose built to collect and store logs or traces. For this reason, Prometheus users will inevitably end up isolating their metrics from their log and trace data – which can prove a recipe for observability tool sprawl. Those who want to unify their logs, metrics, and traces in one solution will need a different approach.
Key AWS Metrics to Monitor
Usage defines the percentage of consumption of any resource. For example, if you’re saving 10 GB of data on a 100 GB disk, the usage percentage is 10%. There are different ways to monitor usage.
CPU usage is important to monitor because it helps you discover any issue with or high consumption of CPU. This metric is available for AWS services like EC2 machines, load balancers, RDS, etc. The threshold for this, for example, can be when all your CPU cores hit 100% utilization.
Disk is the permanent storage (secondary storage) available to be consumed. This can be a critical metric to keep an eye on since if there is no disk left, all your software could stop working. Generally, the threshold for this is 90%. If you see 90% consumption, you should quickly extend the disk size. Services like RDS and EC2 have these metrics available.
Memory is the RAM used during any processing, with 100% memory utilization possibly triggering the OOM killer, terminating your process. The threshold here can be 80% utilization. Services like RDS, Elasticache, EC2, and ECS have these metrics.
Bandwidth is the network I/O being consumed by your services. You have to make sure that your network I/O doesn’t reach the limit of networking defined by AWS, which is 10 Gbps in most cases. You can monitor this in services like Managed NAT, EC2, Elasticache, and RDS.
Request count helps you identify the usage of a given resource. This number tells you the number of times someone requests this resource. You have to watch for any anomaly here. Most AWS services have this metric, with the most important ones being load balancers, Elasticache, RDS, and EC2.
An error number shows if there is an increase or decrease in errors. Below are a few important error metrics that you should watch.
ELB Status Code
You should keep an eye on Elastic Load Balancer Status codes as well. An increase in error status codes means that your application may not be performing well.
S3 Access Errors
This metric gives the number of requests that resulted in failed states either due to a permission error or “not found” error.
ELB and ALB generally have this metric. It is one of the most important metrics to monitor since it tells you how many healthy backends there are to serve requests. Any decline in this number can be a problem, so make sure to configure an alert for it.
AWS Performance Metrics
In the modern era of cloud computing, where latency can also be treated as an error, it is important to keep a watch on performance metrics. These will help let you know if any scaling is required to run your application properly. Below are a few metrics that you should monitor in this space.
Latency numbers are very important. These can tell you a lot about your application saturation and how it can scale for further requests. If you see latency increase, there may be some problem with your application or you may need to increase the number of instances of your application.
Surge Queue Length
Surge queue length is the number of requests waiting to be served. This metric comes with ELB and ALB. You don’t want your requests to be in a queue, as this can dramatically increase response time.
How Prometheus works
- Prometheus collects data in the form of time series. The time series are built through a pull model.
- The Prometheus server queries (scrapes) a list of data sources (sometimes called exporters) at a specific polling frequency.
- Prometheus data is stored in the form of metrics, with each metric having a name that is used for referencing and querying it.
- Prometheus stores data locally on disk, which helps for fast data storage and fast querying the ability to store metrics in remote storage.
- Each Prometheus server is standalone, not depending on network storage or other remote services.
This guide will be a step-by-step tutorial. To follow along, be sure to have an AWS account. To create a new account, click this link.
Create a Linux EC2 instance
The first step in this tutorial is to create two Linux instances. The first instance machine you’ll be creating is for Prometheus. Log into the AWS console to launch an instance. Select the free tier-eligible Amazon Linux 2 instance.
Next, choose t2.micro as the instance type.
Select the default VPC and subnet and leave other settings at default. You can choose to add settings to your instance based on a personal decision, but for this tutorial, we’ll leave these settings at default.
For the security group, set the name as Prometheus-sg. We’ll open port 22 to be able to SSH into our Linux machine, we’ll open port 9090 for Prometheus, port 9100 for Node-exporter, and port 9093 for Alertmanager.
Be sure to add a key pair so you can log in via SSH:
Back at the top, add a name tag. The name tag for this instance will be prometheus-server.
After reviewing the settings one more time, confirm all instance settings are correct. Then click launch.
Excellent! Our Linux server instance is up and running.
Create a second Linux EC2 instance
The first instance machine you created was for Prometheus. The second instance machine will be for Node-exporter. You can follow the same instruction above to create the second Linux instance machine, naming it prometheus-server.
Now you have your infrastructure ready, we can continue the process. Use SSH to access linux-server to install node-exporter. The key pair generated for this machine is Linux-machine. Depending on where your key is and how your local machine is configured, your SSH command could look similar to this:Now, we have successfully logged in to our Linux-server instance. We’ll visit prometheus.io to download node-exporter.
Congratulations! You’re done with installing a node-exporter on your Linux-server.
Integrating Prometheus with your AWS services
Using the CloudWatch Exporter to expose AWS metrics for Prometheus scraping is a popular way to monitor AWS. Let’s go through an example of implementing this exporter to collect EC2 metric data.
Integration of EC2 with Prometheus with the CloudWatch Exporter
To integrate your EC2 machines with Prometheus, first install the CloudWatch agent on them using the following command:
java -jar target/cloudwatch_exporter-*-SNAPSHOT-jar-with-dependencies.jar 9106 example.yml
Next, configure your Prometheus server to start scraping metrics from these machines:
job_name: cloudwatch metrics_path: ip_of_ec2_machine:port/metrics
Now, configure the CloudWatch agent to instruct what metrics to scrape from the machines.
- Install the cloud watch agent. You can follow this link to install it or use below command
sudo yum install amazon-cloudwatch-agent
- Update the Prometheus scrape config to identify the new metrics sources.
global: scrape_interval: 1m scrape_timeout: 10s scrape_configs: - job_name: MY_JOB sample_limit: 10000 ec2_sd_configs: - region: us-east-1 port: 9404 filters: - name: instance-id values: - i-98765432109876543 - i-12345678901234567
You can get the detailed instructions for the above steps in the AWS documentation .
Integration of CloudWatch Metrics with Prometheus
The easiest way to gather all of your metrics is taking them directly from CloudWatch, as most events are logged there. Simply install a CloudWatch exporter in one of your machines and run it:
java -jar target/cloudwatch_exporter-*-SNAPSHOT-jar-with-dependencies.jar 9106 example.yml
Input the proper configuration along with AWS credentials ; these values can go in the environment variable:
export AWS_ACCESS_KEY_ID = “aws_key” export AWS_SECRET_ACCESS_KEY = “aws_secret”
Now, configure your Prometheus server to start scraping metrics from the CloudWatch exporter metric endpoints:
job_name: cloudwatch metrics_path: ip_of_cloud_watch_exporter_vm:port/metrics
Further documentation on this from Logz.io is available, plus, you can read about AWS Lambda integration with Prometheus .
Solving Prometheus Issues with Logz.io
As we’ve seen in the above discussion, scaling Prometheus can be a significant challenge and you may end up managing multiple components including Thanos, Trickster, Grafana, and underlying infrastructure. As an alternative, Logz.io can solve this problem for you, and very easily at that.
Using Logz.io, you can configure your existing Prometheus server to forward the metrics and thus offload the management complexity to the Logz.io Open 360™ observability platform.
Leave a Reply