LWG Minutes 2014-08-27

From OpenSFS
Jump to: navigation, search

Agenda

  • Lustre 2.7.0 status update
  • Contract development status
  • Discuss Lustre Quality Assurance improvement

Attendance

  • Chris Morrone (LLNL)
  • Sebastien Buisson (Bull)
  • Paul Sathis (Intel)
  • Cory Spitz (Cray)
  • Peter Jones (Intel)
  • Justin Miller (IU)
  • James Simmons (ORNL)
  • Steve Young (Seagate)
  • Andreas Dilger (Intel)

Minutes

Lustre 2.7.0

Not much news. Development underway, landings continue.

Contract Development Status

  • CLIO Implementation contract is well under way. SOW more or less complete pending input from lawyers.
  • Lustre Protocol Documentation contract starting up.

Discuss Lustre Quality Assurance Improvement

It was suggested that the Lustre Protocol Documentation project and code simplification and refactoring projects would be the place to start.

We can try to move towards stricter peer reviews.

We could have presentations and discussions about peer review expectations at Lustre developer meetings. This segued into a discussion about having a new Lustre developers meeting.

People expressed concern that travel budgets might not allow yet another face-to-face meeting in addition to LUG, LAD, and SC. Is was suggested that we can add a day (or more) onto the end of LUG just for those involved in Lustre development.

General agreement seemed to be that we could also try to hold a new separate developer meeting in, it was suggested, January. The rationale being that it is somewhat after holiday season, but also well distanced in time from LUG, LAD, and SC.

We could attempt to get people to meet face-to-face, and have video conferencing capabilities for those who cannot meet.

We would find out if that new meeting is really going to be useful by comparing attendance with the also proposed developer meeting adjacent to LUG.

It was asked whether Cray would open up their testing tool as was suggested at the working group meetings at LUG. Cory explain that it would not be the testing tool/suite that was opened up, but perhaps some of the individual filesystem test cases could be opened and/or they could post explanations of what some of the test cases do. It is taking longer than Cory had hoped.

Cory asked about whether we currently know code coverage stats for existing tests.

Cory asked how do we formalize testing improvment? It did sound like much was proposed in response.

Cray is looking into utilizing NRS to create delays and investigate new failure modes.

Andreas stated that everything that people do to improve quality is good. He suggested that there isn't going to be one thing that makes it all better.

Having quantitative measurements of quality is good idea, but can take a great deal of time to do that. Metrics are nice in theory, but difficult to get right. How many timeouts and evictions during testing might be useful. Bugs per line of code really hard to quantify, and not of much value.

Questions relayed by Chris for conversation

  • Should OpenSFS have contract to fund two full time testers and part time project

manager?

Cory - a little uncomfortable with that idea. Current vendors already have those resources. How would that work? Would they report to Board or what?

Andreas - Couldn't be someone random. Would have to be someone organization already involved in Lustre. Hard to show progress.

Andreas - what do people think of code documentation project?

Folks agreed it was good. General agreement seemed to be that code documentation should continue.

  • Should OpenSFS have a contract to fund an open source testing framework?

Cray - probably a good idea. nervous that we would not do it the right way

Andreas - ask if one of the existing frameworks could be open sourced without starting a new one. Not sure if Intel would be willing to do so or not. Not under Andreas' power to make that call.

Andreas also suggest that we had been down that path before years ago, and the scope quickly balooned out of reasonable proportions.

Too many people and organizations with different ideas of where they want to test.

Chris - It could easily be a very big and expensive project

Andreas - Also could potentially be a reduction of heterogeneity in our collective testing if we all adopt the same tests. If everyone running the same thing, will we find enough bugs?

Different testing systems have different goals: Cray system about launching jobs and monitoring results, Intel system about booting nodes and testing patches.

End of the hour reached. Participants encouraged to continue conversation on the mailing list.