LWG Minutes 2015-11-04

Attendance:

Cray: Justin Miller, Ben Evans, Patrick Farrell, Chris Horn, Cory Spitz

ORNL: Sarp Oral, James Simmons

Indiana: Steve Simms

Intel: Joseph Gmitter, Peter Jones, Paul Sathis

Fermilab: Alex Kulyavtsev

Seagate: Kapak Shah

Anyone else?

Board is likely to confirm Peter and Sarp at their next meeting. We’re happy and grateful to turn the reigns over to Peter & Sarp.

2.8 release status/update

Peter: Really stressing DNE under failover. Finding some problems, working through patches and doing more debugging. Livermore has reported some oddities, like LU-6994, which is a blocker. Trying to identify what is needed left to fix, then code freeze will commence. Long pole is DNE.

Kalpak: Seagate has been doing some performance testing. Seeing metadata perf slowdown (especially opens) compared to Seagate’s 2.5.1 w/fixes back ported. Will get an LU ticket opened soon. Probably not a blocker. Not testing DNE.

James: reports on several efforts
 * large scale tests with 2.8 clients (had problem with routers using mlx5 & OFED 3.12) LU-7351
 * mlx5 & OFED 3.12 (as above)
 * profiling on DNE2. Perf drops down for larger stripes
 * Power8 nodes: small I/O packets cause evictions with continuous reconnects and evictions
 * Still carrying DNE patches, LU-7039 and others, but not much DNE testing right now
 * debugging LU-7173: MDS oops w/2.8 client [2.5 server]

Next large scale testing out a week, depends on LNET router and SC15

Status/update of In-tree (Linux) client development

James: Flood of patches in-progress. libcfs looking like mostly cleanup. libcfs & lnet could still be ready by next LUG. By the time RHEL 8 comes out, we could have no need to install on routers.

James is going to ask Greg K-H to release libcfs & LNET first. James can help coordinate some effort there. Would be good to have a wiki with how to contribute. We don’t have a page for the upstream effort right now. James will do a quick write-up on the wiki.lustre.org for how to submit upstream.

Looking ahead to 2.9

First 5 things on the current projects list look like possible candidates. Anything else? James pointed out several smaller issues [that I wasn’t able to capture]: LU-5030 (sys fs), LU-5541, LNET lctl (avoid lctl on router nodes). LU-7140 large RPCs. mlx5 needs support.

Peter: we should have been scoping 2.9 already. Peter took an AI to come back to this group with a straw man proposal for what’s in the next release to be discussed on the next call.

Lustre release cadence

We haven’t been good about hitting our 6 month schedules. Cory proposed a 9 month cadence just to recognize reality. Certainly pros/cons to any scheme. Should be up for discussion. How/when to decide?

Now is the time to evaluate with 2.8 wrapping up and 2.9 planning under weigh.

Peter: Worst slippage was on 2.4 that had a huge amount of change. 2.2 came in right on time. Seems to depend upon how much change we decide to insert.

Paul S. chimed in from a business/sales perspective. Need to put in writing what we want/expect and then stick to it. Not good if we say 9 months and it slips to 12.

Still relying on volunteers. Tree contract currently doesn’t cover all the work needed.

Feature branches are intertwined in this all. We need a test plan executed to prove that features are ready to be integrated to master. Cory still plans to start wiki page stub on feature branching.

Are there test plans for 2.9 candidates? There should be. We need a matrix of whom is doing what. Peter will work stabilization into the 2.9 straw man and we’ll discuss on the next call.

Any other business?

Anyone at SC? Just Sarp & Peter from this call.

We decided to cancel next scheduled 11/18 call due to SC.

Next call on 12/2.