UW SSEC Lustre Statistics How-To

Introduction
This guide will take the user step-by-step through the Lustre Monitoring deployment that the Space Science and Engineering Center uses for monitoring all of its Lustre file systems. The author of this guide is Andrew Wagner (andrew.wagner@ssec.wisc.edu). Where possible, I have linked to our production configuration files for software to give readers a good idea of the possible settings they can or should use for their own setups.

Hardware Requirements
Any existing server can be used for a proof of concept version of this guide. The requirements for several thousands checks per minute are low - a small VM can easily handle the load.

Our productions server can easily handle ~150k checks per minute and from a processing/disk I/O perspective can handle much more. Here are the specs:


 * Dell PowerEdge R515
 * 2x 8-Core AMD Opteron 4386
 * 300GB RAID1 15K SAS
 * 200GB Enterprise SSD
 * 64GB RAM

Software Requirements

 * Centos 6 x86_64
 * Centos 6 EPEL Repository
 * Configuration Management System (Puppet, Ansible, Salt, Chef, etc)

Notes on Scaling/Size of Metrics
Here at SSEC, we are collecting ~200k metrics per minute. The setup that we have could be scaled with minimal effort to a large size. Several million metrics per minute is not out of the question. However, if you have tens of millions of metrics collected per minute, this approach will not likely work for you.

What will the final product look like?
Before embarking a deploying this infrastructure, take a look at some example dashboards that we generated with our Grafana instance. These are not Lustre specific but show some finished products.

http://snapshot.raintank.io/dashboard/snapshot/lw0eZBCgUwHtZ2hEtF1At82c6bpl443l
 * MDF Switch for SSEC

http://snapshot.raintank.io/dashboard/snapshot/WXZG341nFdmWdcoEovHBPCWYmbEeumDv
 * Datacenter Coolers

http://snapshot.raintank.io/dashboard/snapshot/3SaSSlqEGyO9IjrfGru4V0nebn5TdiaD
 * Single Host

Setting up an OMD Monitoring Server
The first thing that we needed for our new monitoring deployment was a monitoring server. We were already using Check_MK with Nagios on our older monitoring server but the Open Monitoring Distribution nicely ties all of the components together. The distribution is available at http://omdistro.org/ and installs via RPM.

On a newly deployed Centos6 machine, I installed the OMD-1.20 RPM. This takes care of all of the work of installing Nagios, Check_MK, PNP4Nagios, etc.

After installation, I created the new OMD monitoring site:

This creates a new site that runs its own stack of Apache, Nagios, Check_MK and everything else in the OMD distribution. Now we can start the site:

You can now nagivate to http://example.fqdn.com/sitename of your server, i.e. http://example.ssec.wisc.edu/ssec and login with the default OMD credentials.

We chose to setup LDAPS authentication versus our Active Directory server to manage authentication. There is a good discussion of how to do this here: https://mathias-kettner.de/checkmk_multisite_ldap_integration.html

Additionally, we setup HTTPS for our web access to OMD: http://lists.mathias-kettner.de/pipermail/checkmk-en/2014-May/012225.html

At this point, you can start configuring your monitoring server to monitor hosts! Check_MK has a lot of configuration options, but it's a lot better than managing Nagios configurations by hand. Fortunately, Check_MK is widely used and well documented. The Check_MK documentation root is available at http://mathias-kettner.de/checkmk.html.

Deploying Agents to Lustre Hosts
To operate, the Check_MK agent on hosts runs as an xinetd service with a config file at /etc/xinetd.d/check_mk. That file includes the IP addresses allowed to access the agent in the only_from parameter. The OMD distribution comes with Check_MK agent RPMs. I rebuilt the RPM using rpmrebuild to include our updated IP addresses for our monitoring servers.

After rebuilding the RPM, push out the RPM to all hosts that will be monitored. We use a custom repository and Puppet for managing our existing software, so adding the RPM to the repo and pushing out via Puppet can be done with a simple module.

After deployment, we can verify the agents work by adding them to Check_MK via the GUI or configuration file and inventorying them. This will allow us to monitor a wide array of default metrics such as CPU Load, CPU Utilization, Memory use, and many others.

Writing Local Checks to Run via Check_MK Agent
Now that the Check_MK agents are deployed to the Lustre servers, we can add Check_MK local agent checks to measure whatever we want. The documentation for local checks is here: http://mathias-kettner.de/checkmk_localchecks.html.

The output of the check has to have a Nagios status number, Name, Performance Data, and Check Output.

Check out the examples in the Check_MK documentation for formatting of output. You can use whatever language your server supports to execute the local check. At SSEC, Scott Nolin has implemented several Perl scripts to poll Lustre statistics and output in the Check_MK format. You can read more about the checks here: http://wiki.opensfs.org/Lustre_Statistics_Guide.

Check_MK RRD Graphs
Once you start collecting this performance data, OMD automatically uses PNP4NAGIOS to create RRD graphs for each collected metric. Check_MK then will display these RRDs in the monitoring interface. This can be useful for small scale testing where you are only collecting a few tens of metrics. However, a thorough stat collection on large Lustre file systems can yield hundreds or even thousands of individual metrics. Check_MK and PNP4NAGIOS are thoroughly outclassed when asked to display such a large number of RRD graphs and respond poorly to high I/O situations.

Thus, we turn to the Graphite/Carbon metric storage system.

Deploying Graphite/Carbon
The Graphite/Carbon software package collects metrics and stores them in Whisper databases files. Graphite is the web frontend and Carbon is the backend that controls the Whisper database files. Whisper files are similar to RRD files in that they have a defined size and fixed constraints on how the file manages time series data as time passes. However, it has many key improvements as described here: http://graphite.readthedocs.org/en/latest/whisper.html

The installation and basic setup of Graphite and Carbon is pretty easy. We used the version of Graphite found in EPEL.

This installs both Graphite/Carbon. Graphite is a basic web frontend for visualizing data. The web configuration can be found at /etc/httpd/conf.d/graphite-web.conf. While the Graphite frontend works alright, at SSEC we vastly prefer the usability of Grafana. The next section describes how that frontend is deployed and configured.

There are three Carbon services that need to be set to run on startup:


 * carbon-aggregator
 * carbon-cache
 * carbon-relay

The Carbon configuration files can be found at /etc/carbon. Below, I've linked to our settings for the various different Carbon configuration files. I don't attest to the correctness of these settings, but if you have no idea how where to start, these will at least get you up and running!


 * http://www.ssec.wisc.edu/~andreww/files/carbon.conf
 * http://www.ssec.wisc.edu/~andreww/files/storage-aggregation.conf
 * http://www.ssec.wisc.edu/~andreww/files/storage-schemas.conf

Once Carbon is running, you can actually use the Graphite/Carbon installation if you don't want to have dashboards and such. Graphite is well documented and you can read more about the software here: http://graphite.readthedocs.org/en/latest/

Graphite Metric Namespace
Creating an appropriate namespace for Graphite metrics is difficult. We've gone through a dozen iterations at SSEC before arriving at one that is now largely satisfactory. The Graphite namespace refers to how you organize your metrics in the Graphite/Carbon system and how you will access them laster.

Below is an example namespace for an SSEC Lustre OSS in our Delta filesystem:

The above namespace describes the namespace for the bytes written to OST0010 on the delta-1-21 server under the lustre.oss category. You can almost think of these as paths to the RRD files in which Graphite stores metrics. Each field between the periods is mutable. For example, I could change write_bytes to read_bytes to get that metric for OST0010 on delta-1-21 or perhaps even change delta-1-21 to delta-3-11 and pick a different OST entirely. Each of those fields be named anything logical and then accessed with Graphite or Grafana to create graphs of what you are interested in visualizing.

Here's another example:

The above namespace links to the memory buffer metrics for a server named r720-0-5 of the compute type in the SSEC HPC Clustre Iris. Look at the chart below to get the best idea of how the namespace works withing Graphite in practice. One of the great things about Graphite is that you can use wildcards to match all metrics in a given namespace field. See how you might use that in practice below:

In the above namespace examples, I used wildcards to select all metrics in the given field. These selection methods can be combined with mathematic functions to create powerful graphs. For example, the last entry in the chart would display all of the cpu_usage metrics for the compute nodes in the iris group of servers.

Like in the above example, these Graphite namespaces show the power of wildcards. In this case, the last entry in the chart would display the write_bytes rate of all the OSTs in the delta filesystem. Graphite's built-in mathematic functions could sum the metrics to create a graph of the write rate of the entire filesystem.

Deploying Grafana
Grafana is the dashboard that SSEC prefers to use for data visualization.


 * To read about Grafana, check out this link: http://docs.grafana.org/
 * To try a test install of Grafana to get a feel for use, go here: http://play.grafana.org
 * To install Grafana, we used the RPM available here: http://docs.grafana.org/installation/rpm/

Building dashboards via the grafana gui is easy, and becomes the 'analyst tool of choice' for understanding data. These dashboards will serve for many needs.

When these type of dashboards are not enough, or the workflow makes them too tedious to build you can create *scripted dashboards* in grafana. These are javascript programs, and require some coding, so are more work to create - but potentially very powerful.

Using Graphios to Redirect Lustre Stats to Carbon
Given our decision to stick with a single monitoring infrastructure, we needed a way to direct the performance data coming in via Nagios into Graphite. Graphios takes a Nagios performance data file and parses it for a special set of data marked via special prefixes/postfixes. Graphios then pipes this data to the Carbon ingest port for storage and access via Graphite.

The Graphios Github page has lots of installation and configuration examples, some of which I've contributed. Detailed setup instructions for OMD are contained within the readme at https://github.com/shawn-sterling/graphios.