Lustre Monitoring and Statistics Guide: Difference between revisions

From OpenSFS Wiki
Jump to navigation Jump to search
(Created page with "== DRAFT IN PROGRESS == == Introduction == what this is about == Reading /proc vs lctl == 'cat /proc/fs/lustre...' vs 'lctl get_param' With newer lustre versions, 'lctl ...")
 
No edit summary
Line 4: Line 4:
== Introduction ==
== Introduction ==


what this is about
There are a variety of useful statistics and counters available on lustre servers and clients. This is an attempt to detail some of these statistics.


The presumed audience for this is system administrators attempting to better understand and monitor their lustre file systems.
== Lustre Versions ==
This information is based on working mostly with lustre 2.4 and 2.5.


== Reading /proc vs lctl ==
== Reading /proc vs lctl ==
Line 15: Line 20:
Format of the various statistics type files varies (and I'm not sure if there's any reason for this). The format names here are entirely *my invention*, this isn't a standard for lustre or anything.
Format of the various statistics type files varies (and I'm not sure if there's any reason for this). The format names here are entirely *my invention*, this isn't a standard for lustre or anything.


jobstats Jobstats are multi-line records. Each OST or MDT then has an entry for each jobid (or hostname, or however we collect job stats). Example:  
=== Stats ===
 
What I consider a "standard" stats files include for example each OST or MDT as a multi-line record, and then just the data.  
 
Example:
<pre>
<pre>
obdfilter.scratch-OST0000.job_stats=job_stats:- job_id:          56744
obdfilter.scratch-OST0001.stats=snapshot_time            1409777887.590578 secs.usecsread_bytes                27846475 samples [bytes] 4096 1048576 14421705314304write_bytes              16230483 samples [bytes] 1 1048576 14761109479164get_info                  3735777 samples [reqs]
</pre>
 
snapshot_time = when the stats were written
 
For read_bytes and write_bytes:
First number = number of times (samples) the OST has handled a read or write.
Second number = the minimum read/write size
Third number = maximum read/write size
Fourth = sum of all the read/write requests in bytes, the quantity of data read/written.
 
=== Jobstats  ===
 
Jobstats are slightly more complex multi-line records. Each OST or MDT also has an entry for each jobid (or procname_uid perhaps), and then the data.
 
Example:
<pre>
obdfilter.scratch-OST0000.job_stats=job_stats:
- job_id:          56744
   snapshot_time:  1409778251
   snapshot_time:  1409778251
   read:            { samples:      18722, unit: bytes, min:    4096, max: 1048576, sum:    17105657856 }
   read:            { samples:      18722, unit: bytes, min:    4096, max: 1048576, sum:    17105657856 }
   write:          { samples:        478, unit: bytes, min:    1238, max: 1048576, sum:      412545938 }
   write:          { samples:        478, unit: bytes, min:    1238, max: 1048576, sum:      412545938 }
   setattr:        { samples:          0, unit:  reqs }  punch:          { samples:          95, unit:  reqs }
   setattr:        { samples:          0, unit:  reqs }  punch:          { samples:          95, unit:  reqs }
- job_id: .... etc
- job_id: . . . ETC
</pre>
</pre>
Notice this is very similar to 'stats' above. But there's a lot of extra: { bling: }! Why? Just because it got coded that way?
=== Single ===

Revision as of 15:28, 4 November 2014

DRAFT IN PROGRESS

Introduction

There are a variety of useful statistics and counters available on lustre servers and clients. This is an attempt to detail some of these statistics.

The presumed audience for this is system administrators attempting to better understand and monitor their lustre file systems.

Lustre Versions

This information is based on working mostly with lustre 2.4 and 2.5.

Reading /proc vs lctl

'cat /proc/fs/lustre...' vs 'lctl get_param' With newer lustre versions, 'lctl get_pram' is the standard and recommended way to get these stats. This is to insure portability. I will use this method in all examples, a bonus is it can be often be a little shorter syntax.

Data Formats

Format of the various statistics type files varies (and I'm not sure if there's any reason for this). The format names here are entirely *my invention*, this isn't a standard for lustre or anything.

Stats

What I consider a "standard" stats files include for example each OST or MDT as a multi-line record, and then just the data.

Example:

obdfilter.scratch-OST0001.stats=snapshot_time             1409777887.590578 secs.usecsread_bytes                27846475 samples [bytes] 4096 1048576 14421705314304write_bytes               16230483 samples [bytes] 1 1048576 14761109479164get_info                  3735777 samples [reqs]

snapshot_time = when the stats were written

For read_bytes and write_bytes: First number = number of times (samples) the OST has handled a read or write. Second number = the minimum read/write size Third number = maximum read/write size Fourth = sum of all the read/write requests in bytes, the quantity of data read/written.

Jobstats

Jobstats are slightly more complex multi-line records. Each OST or MDT also has an entry for each jobid (or procname_uid perhaps), and then the data.

Example:

obdfilter.scratch-OST0000.job_stats=job_stats:
- job_id:          56744
  snapshot_time:   1409778251
  read:            { samples:       18722, unit: bytes, min:    4096, max: 1048576, sum:     17105657856 }
  write:           { samples:         478, unit: bytes, min:    1238, max: 1048576, sum:       412545938 }
  setattr:         { samples:           0, unit:  reqs }  punch:           { samples:          95, unit:  reqs }
- job_id: . . . ETC

Notice this is very similar to 'stats' above. But there's a lot of extra: { bling: }! Why? Just because it got coded that way?

Single