Lustre Monitoring and Statistics Guide

DRAFT IN PROGRESS

Introduction

There are a variety of useful statistics and counters available on Lustre servers and clients. This is an attempt to detail some of these statistics and methods.

This does not include Lustre log analysis.

The presumed audience for this is system administrators attempting to better understand and monitor their Lustre file systems.

Lustre Versions

This information is based on working mostly with Lustre 2.4 and 2.5.

Reading /proc vs lctl

'cat /proc/fs/lustre...' vs 'lctl get_param' With newer Lustre versions, 'lctl get_pram' is the standard and recommended way to get these stats. This is to insure portability. I will use this method in all examples, a bonus is it can be often be a little shorter syntax.

Data Formats

Format of the various statistics type files varies (and I'm not sure if there is any reason for this). The format names here are entirely *my invention*, this isn't a standard for Lustre or anything.

It is useful to know the various formats of these files so you can parse the data and collect for use in other tools.

Stats

What I consider a "standard" stats files include for example each OST or MDT as a multi-line record, and then just the data.

Example:

obdfilter.scratch-OST0001.stats=
snapshot_time             1409777887.590578 secs.usecs
read_bytes                27846475 samples [bytes] 4096 1048576 14421705314304
write_bytes               16230483 samples [bytes] 1 1048576 14761109479164
get_info                  3735777 samples [reqs]

snapshot_time = when the stats were written

For read_bytes and write_bytes: First number = number of times (samples) the OST has handled a read or write. Second number = the minimum read/write size Third number = maximum read/write size Fourth = sum of all the read/write requests in bytes, the quantity of data read/written.

Jobstats

Jobstats are slightly more complex multi-line records. Each OST or MDT also has an entry for each jobid (or procname_uid perhaps), and then the data.

Example:

obdfilter.scratch-OST0000.job_stats=job_stats:
- job_id:          56744
  snapshot_time:   1409778251
  read:            { samples:       18722, unit: bytes, min:    4096, max: 1048576, sum:     17105657856 }
  write:           { samples:         478, unit: bytes, min:    1238, max: 1048576, sum:       412545938 }
  setattr:         { samples:           0, unit:  reqs }  punch:           { samples:          95, unit:  reqs }
- job_id: . . . ETC

Notice this is very similar to 'stats' above. But there's a lot of extra: { bling: }! Why? Just because it got coded that way?

Single

These really boil down to just a single number in a file. But if you use "lctl get_param" you get an output that is nice for parsing. For example:

[COMMAND LINE]# lctl get_param osd-ldiskfs.*OST*.kbytesavail


osd-ldiskfs.scratch-OST0000.kbytesavail=10563714384
osd-ldiskfs.scratch-OST0001.kbytesavail=10457322540
osd-ldiskfs.scratch-OST0002.kbytesavail=10585374532

Histogram

Some stats are histograms, these types aren't covered here. Typically they're useful on their own without further parsing(?)

brw_stats
extent_stats

Interesting Statistics Files

This is a collection of various stats files that I have found useful. It is *not* complete or exhaustive. For example, you will noticed these are mostly server stats. There are a wealth of client stats too not detailed here. Additions or corrections are welcome.

Host Type = MDS, OSS, client
Target = "lctl get_param target"
Format = data format discussed above

Host Type	Target	Format	Discussion
MDS	mdt.MDT.num_exports	single	number of exports per MDT - these are clients, including other lustre servers
MDS	mdt.*.job_stats	jobstats	Metadata jobstats. Note that with lustre DNE you may have more than one MDT, so even if you don't it may be wise to design any tools with that assumption.
OSS	obdfilter.*.job_stats	jobstats	the per OST jobstats.
MDS	mdt.*.md_stats	stats	Overall metadata stats per MDT
MDS	mdt.MDT.exports.@.stats	stats	Per-export metadata stats. Exports are clients, this also includes other lustre servers. The exports are named by interfaces, which can be unweildy. See "lltop" for an example of a script that used this data well. The sum of all the export stats should provide the same data as md_stats, but it is still very convenient to have md_stats, "ltop" uses them for example.
OSS	obdfilter.*.stats	stats	Operations per OST. Read and write data is particularly interesting
OSS	obdfilter.OST.exports.@.stats	stats	per-export OSS statistics
MDS	osd-.MDT*.filesfree or filestotal	single	available or total inodes
MDS	osd-.MDT*.kbytesfree or kbytestotal	single	available or total disk space
OSS	obdfilter.OST.kbytesfree or kbytestotal, filesfree, filestotal	single	inodes and disk space as in MDS version
OSS	ldlm.namespaces.filter-*.pool.stats	stats	lustre distributed lock manager (ldlm) stats. I do not fully understand all of these stats. It also appears that these same stats are duplicated a single stats. Perhaps this is just a convenience.
OSS	ldlm.namespaces.filter-*.lock_count	single	lustre distributed lock manager (ldlm) locks
OSS	ldlm.namespaces.filter-*.pool.granted	single	lustre distributed lock manager (ldlm) granted locks - normally this matches lock_count. I am not sure of what the differences are, or what it means when they don't match.
OSS	ldlm.namespaces.filter-*.pool.grant_rate	single	ldlm lock grant rate aka 'GR'
OSS	ldlm.namespaces.filter-*.pool.grant_speed	single	ldlm lock grant speed = grant_rate - cancel_rate. You can use this to derive cancel_rate 'CR'. Or you can just get 'CR' from the stats file I assume.

Working With the Data

Packages, tools, and techniques for working with Lustre statistics.

Open Source Monitoring Packages

LMT - provides 'top' style monitoring of server nodes, and historical data via mysql. https://github.com/chaos/lmt
lltop and xltop - monitoring with batch scheduler integration. Newer Lustre versions with jobstats likely provide similar data very conveniently, but these are still very good for examples of working with monitoring data. https://github.com/jhammond/lltop https://github.com/jhammond/xltop

Commercial Monitoring Packages

Terascala 'teraos'
DDN datablarker
Intel Enterprise Edition for Linux Managerator

Build it Yourself

Here are some basic steps and techniques for working with the Lustre statistics.

Gather the data on hosts you are monitoring. Deal with the syntax, extract what you want
Collect the data centrally - either pull or push it to your server, or collection of monitoring servers.
Process the data - this may be optional or minimal.
Alert on the data - optional but often useful.
Present the data - allow for visualization, analysis, etc.

Some recent tools for working with metrics and time series data have made some of the more difficult parts of this task relatively easy, especially graphical presentation.

Here are details of some solutions tested or in use:

Collectl and Ganglia

Collectl supports Lustre stats. Note there have recently been some changes, Lustre support in collectl is moving to plugins: http://sourceforge.net/p/collectl/mailman/message/31992463 https://github.com/pcpiela/collectl-lustre

This process is not based on the new versions, but they should work similarly.

collectl - does the gather by writing to a text file on the host being monitored
ganglia does the collect via gmond and python script 'collectl.py' and present via ganglia web pages - there is no alerting.

See https://wiki.rocksclusters.org/wiki/index.php/Roy_Dragseth#Integrating_collectl_and_ganglia

Perl and Graphite

Graphite is a very convenient tool for storing, working with, and rendering graphs of time-series data. At SSEC we did a quick prototype for collecting and sending MDS and OSS data using perl. The choice of perl is not particularly important, python or the tool of your choice is fine.

Software Used:

Graphite - http://graphite.readthedocs.org/en/latest/
Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts
lustrestats scripts (lost to the sands of time?) - these are simply run every minute via cron on the servers
Grafana - http://grafana.org - this is a dashboard and graph editor for graphite. It is not required, as graphite can be used directly, but is very convenient. I allows for not just ease of creating dashboards, but also encoruages rapid, interactive analysis of the data. Note that elasticsearch can be used to store dashboards for grafana, but not required.

check_mk and Graphite

Instead of directly sending with perl, use check_mk local agent and pnp4nagios mean a reasonable infrastructure already therealerting simple graphios plugin

Collecting via perl allowed us to send the timestamp from the Lustre stats (when they exist!) directly to Carbon, Graphite's data collection tool. When using the check_mk method this timestamp is lost, timestamps are then based on when the local agent check runs. This will introduce some inaccuracy - a delay of up to your sample rate.

Collecting via both methods allows you to see this. This graph shows all the "export" stats summed for each method, with derivative applied to create a rate of change. "CMK" is the check_mk data and "timestamped" was from the perl script. Plotting the raw counter data of course shows very little, but with this derived data you can see the difference.

This data was sampled once per minute:

For our uses at SSEC, this is acceptable. Sampling much more frequently will of course make the error smaller.

Graphite - http://graphite.readthedocs.org/en/latest/
Lustrestats.pm - perl module to parse different types of lustre stats, used by lustrestats scripts
OMD - check_mk, nagios, pnp4nagios
check_mk local scripts - these are called via check_mk, at whatever rate is desired.
graphios
Grafana - http://grafana.org

Screenshots of a few panels -

File:Meta-overview.PNG

File:Fs-dahsboard.png

Logstash, python, and Graphite

Brock Palen discusses this method: http://www.failureasaservice.com/2014/10/lustre-stats-with-graphite-and-logstash.html

NCI project

Note: I don't know if this will have source available. http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf

References and Links

Daniel Kobras, "Lustre - Finding the Lustre Filesystem Bottleneck", LAD2012. http://www.eofs.eu/fileadmin/lad2012/06_Daniel_Kobras_S_C_Lustre_FS_Bottleneck.pdf
Florent Thery, "Centralized Lustre Monitoring on Bull Platforms", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/11_Florent_Thery_LAD2013-lustre-bull-monitoring.pdf
Daniel Rodwell and Patrick Fitzhenry, "Fine-Grained File System Monitoring with Lustre Jobstat", LUG2014. http://www.opensfs.org/wp-content/uploads/2014/04/D3_S31_FineGrainedFileSystemMonitoringwithLustreJobstat.pdf
Gabriele Paciucci and Andrew Uselton, "Monitoring the Lustre* file system to maintain optimal performance", LAD2013. http://www.eofs.eu/fileadmin/lad2013/slides/15_Gabriele_Paciucci_LAD13_Monitoring_05.pdf
Christopher Morrone, "LMT Lustre Monitoring Tools", LUG2011. http://cdn.opensfs.org/wp-content/uploads/2012/12/400-430_Chris_Morrone_LMT_v2.pdf

Lustre Monitoring and Statistics Guide

Contents

DRAFT IN PROGRESS

Introduction

Lustre Versions

Reading /proc vs lctl

Data Formats

Stats

Jobstats

Single

Histogram

Interesting Statistics Files

Working With the Data

Open Source Monitoring Packages

Commercial Monitoring Packages

Build it Yourself

Collectl and Ganglia

Perl and Graphite

check_mk and Graphite

Logstash, python, and Graphite

NCI project

References and Links

Navigation menu

Lustre Monitoring and Statistics Guide

DRAFT IN PROGRESS

Introduction

Lustre Versions

Reading /proc vs lctl

Data Formats

Stats

Jobstats

Single

Histogram

Interesting Statistics Files

Working With the Data

Open Source Monitoring Packages

Commercial Monitoring Packages

Build it Yourself

Collectl and Ganglia

Perl and Graphite

check_mk and Graphite

Logstash, python, and Graphite

NCI project

References and Links

Navigation menu

Search