UW SSEC Lustre Statistics How-To: Difference between revisions

From OpenSFS Wiki
Jump to navigation Jump to search
No edit summary
Line 3: Line 3:


== Hardware Requirements ==
== Hardware Requirements ==
Any existing server can be used for a proof of concept version of this guide. The requirements for several thousands checks per minute are low - a small VM can easily handle the load.
Our productions server can easily handle ~150k checks per minute and from a processing/disk I/O perspective can handle much more. Here are the specs:
Dell PowerEdge R515
2x8-Core AMD Opteron 4386
200GB Enterprise SSD
64GB RAM


== Building the Lustre Monitoring Deployment ==
== Building the Lustre Monitoring Deployment ==

Revision as of 10:42, 23 March 2015

Introduction

This guide will take the user step-by-step through the Lustre Monitoring deployment that the Space Science and Engineering Center uses for monitoring all of its Lustre file systems. The author of this guide is Andrew Wagner ([email protected]).

Hardware Requirements

Any existing server can be used for a proof of concept version of this guide. The requirements for several thousands checks per minute are low - a small VM can easily handle the load.

Our productions server can easily handle ~150k checks per minute and from a processing/disk I/O perspective can handle much more. Here are the specs:

Dell PowerEdge R515 2x8-Core AMD Opteron 4386 200GB Enterprise SSD 64GB RAM

Building the Lustre Monitoring Deployment

Setting up an OMD Monitoring Server

The first thing that we needed for our new monitoring deployment was a monitoring server. We were already using Check_MK with Nagios on our older monitoring server but the Open Monitoring Distribution nicely ties all of the components together. The distribution is available at http://omdistro.org/ and installs via RPM.

On a newly deployed Centos6 machine, I installed the OMD-1.20 RPM. This takes care of all of the work of installing Nagios, Check_MK, PNP4Nagios, etc.

After installation, I created the new OMD monitoring site:

omd create ssec

This creates a new site that runs its own stack of Apache, Nagios, Check_MK and everything else in the OMD distribution. Now we can start the site:

omd start ssec

You can now nagivate to http://example.fqdn.com/sitename of your server, i.e. http://example.ssec.wisc.edu/ssec and login with the default OMD credentials.

We chose to setup LDAPS authentication versus our Active Directory server to manage authentication. There is a good discussion of how to do this here: https://mathias-kettner.de/checkmk_multisite_ldap_integration.html

Additionally, we setup HTTPS for our web access to OMD: http://lists.mathias-kettner.de/pipermail/checkmk-en/2014-May/012225.html

At this point, you can start configuring your monitoring server to monitor hosts! Check_MK has a lot of configuration options, but it's a lot better than managing Nagios configurations by hand. Fortunately, Check_MK is widely used and well documented.

Deploying Agents to Lustre Hosts

To operate, the check_mk_agent on hosts runs as an xinetd service with a config file at /etc/xinetd.d/check_mk. That file includes the IP addresses allowed to access the agent in the only_from parameter. I rebuilt the RPM using rpmrebuild to include our updated IP addresses.

After rebuilding the RPM, push out the RPM to all hosts that will be monitored. We use a custom repository and Puppet for managing our existing software, so adding the RPM to the repo and pushing out via Puppet can be done with a simple module.

Writing Local Checks to Run via Agents

Check_MK RRD Graphs

Deploying Graphite/Carbon

Deploying Grafana

Using Graphios to Redirect Lustre Stats to Carbon