Lustre Monitoring and Statistics Guide: Difference between revisions

From OpenSFS Wiki
Jump to navigation Jump to search
No edit summary
(Replaced content with "This content has moved to the [http://wiki.lustre.org/Lustre_Monitoring_and_Statistics_Guide lustre.org Wiki].")
 
(34 intermediate revisions by one other user not shown)
Line 1: Line 1:
== DRAFT IN PROGRESS ==
This content has moved to the [http://wiki.lustre.org/Lustre_Monitoring_and_Statistics_Guide lustre.org Wiki].
 
 
== Introduction ==
 
There are a variety of useful statistics and counters available on lustre servers and clients. This is an attempt to detail some of these statistics.
 
The presumed audience for this is system administrators attempting to better understand and monitor their lustre file systems.
 
== Lustre Versions ==
 
This information is based on working mostly with lustre 2.4 and 2.5.
 
== Reading /proc vs lctl ==
 
'cat /proc/fs/lustre...' vs 'lctl get_param'
With newer lustre versions, 'lctl get_pram' is the standard and recommended way to get these stats. This is to insure portability. I will use this method in all examples, a bonus is it can be often be a little shorter syntax.
 
== Data Formats ==
Format of the various statistics type files varies (and I'm not sure if there is any reason for this). The format names here are entirely *my invention*, this isn't a standard for lustre or anything.
 
It is useful to know the various formats of these files so you can parse the data and collect for use in other tools.
 
=== Stats ===
 
What I consider a "standard" stats files include for example each OST or MDT as a multi-line record, and then just the data.
 
Example:
<pre>
obdfilter.scratch-OST0001.stats=
snapshot_time            1409777887.590578 secs.usecs
read_bytes                27846475 samples [bytes] 4096 1048576 14421705314304
write_bytes              16230483 samples [bytes] 1 1048576 14761109479164
get_info                  3735777 samples [reqs]
</pre>
 
snapshot_time = when the stats were written
 
For read_bytes and write_bytes:
First number = number of times (samples) the OST has handled a read or write.
Second number = the minimum read/write size
Third number = maximum read/write size
Fourth = sum of all the read/write requests in bytes, the quantity of data read/written.
 
=== Jobstats  ===
 
Jobstats are slightly more complex multi-line records. Each OST or MDT also has an entry for each jobid (or procname_uid perhaps), and then the data.
 
Example:
<pre>
obdfilter.scratch-OST0000.job_stats=job_stats:
- job_id:          56744
  snapshot_time:  1409778251
  read:            { samples:      18722, unit: bytes, min:    4096, max: 1048576, sum:    17105657856 }
  write:          { samples:        478, unit: bytes, min:    1238, max: 1048576, sum:      412545938 }
  setattr:        { samples:          0, unit:  reqs }  punch:          { samples:          95, unit:  reqs }
- job_id: . . . ETC
</pre>
 
Notice this is very similar to 'stats' above. But there's a lot of extra: { bling: }! Why? Just because it got coded that way?
 
=== Single ===
 
These really boil down to just a single number in a file. But if you use "lctl get_param" you get an output that is nice for parsing. For example:
<pre>[COMMAND LINE]# lctl get_param osd-ldiskfs.*OST*.kbytesavail
 
 
osd-ldiskfs.scratch-OST0000.kbytesavail=10563714384
osd-ldiskfs.scratch-OST0001.kbytesavail=10457322540
osd-ldiskfs.scratch-OST0002.kbytesavail=10585374532
</pre>
 
=== Histogram ===
 
Some stats are histograms, these types aren't covered here. Typically they're useful on their own without further parsing(?)
 
 
* brw_stats
* extent_stats
 
== Scripts to Parse Data Formats ==
 
Here are some example perl modules to help parse the various data formats. Better, faster, stronger scripts and methods are welcome.
 
== Interesting Statistics Files  ==
 
This is a collection of various stats files that I have found useful. It is *not* complete or exhaustive. Additions or corrections are welcome.
 
host type, target, format, discussion
 
* Host Type = MDS, OSS, client
* Target = "lctl get_param target"
* Format = data format discussed above
 
{| class="wikitable"
|-
!Host Type !! Target !! Format !! Discussion
|-
| MDS || mdt.*.job_stats || jobstats || Metadata jobstats. Note that with lustre DNE you may have more than one MDT, so even if you don't it may be wise to design any tools with that assumption.
|-
| OSS || obdfilter.*.job_stats || jobstats || the per OST jobstats.
|-
| MDS || mdt.*.md_stats || stats || Overall metadata stats per MDT
|-
| MDS || mdt.*MDT*.exports.*@*.stats || stats || Per-export metadata stats. Exports are clients, this also includes other lustre servers. The exports are named by interfaces, which can be unweildy. See "lltop" for an example of a script that used this data well. The sum of all the export stats should provide the same data as md_stats, but it is still very convenient to have md_stats, "ltop" uses them for example.
|-
| OSS || obdfilter.*.stats || stats || Operations per OST. Read and write data is particularly interesting
|-
| OSS || obdfilter.*OST*.exports.*@*.stats || stats || per-export OSS statistics
|-
| MDS || osd-*.*MDT*.filesfree or filestotal || single || available or total inodes
|-
| MDS || osd-*.*MDT*.kbytesfree or kbytestotal || single || available or total disk space
|-
| OSS || obdfilter.*OST*.kbytesfree or kbytestotal, filesfree, filestotal || single || inodes and disk space as in MDS version
|}

Latest revision as of 15:14, 3 June 2015

This content has moved to the lustre.org Wiki.