LWG Minutes 2015-04-15
- Monitoring effort
- Are we moving development platform to lustre.org
- Testing coordination, getting more communication
- Documentation effort, especially on the wiki at lustre.org
- Release cadence
We need more people to take projects on and lead them, become a champion
Maintain webpage on wiki, what work is going on and what is left to do
Regularly report back to the LWG
A lot of people doing a lot on their own, would be great to get together and not repeat
LLNL has LMT - it is not dead, but we don't have DNE support and Javaclient isn't developed
Robert: It collects the data for DNE but it doesn't display the data
Andreas: If the java client is dead, it should be removed from the projectand clearly marked as deprecated. People still use it because they don't know the status
Chris: We're the main guys driving development, not a lot of development from the outside
We'd be happy to collaborate on any other tools
Who will be the project lead on this?
Simms: Folks from Wisconsin are willing to offer up their work
SDSC: Any products they put out we'll deploy, but not volunteering to bethe lead
James: What is every using now to monitor their sites?
Kluge: Logstash, kabanaSDSC: CEA: OpenTSDB, kabana
Jason Hill: homegrown scripts tied into nagios LMT, homegrown scripts, lltop for job monitoringLooking to switch to graphana
Chris: At least consider who at your sites could be a champion for thisproject
This ties into documentation. We need to understand what is in the proc files.
James: We need to know what to measure.
Chris: Friends at Wisconsin have created a wiki page documenting much of this
Andreas: The manual has a lot information. A single page giving a how to on monitoring would help.
Cory: I think what you're asking for, the BWG has made an effort on this.
Simms: They're may have been an effort but nothing has been settled upon. So much effort is going into this, OpenSFS needs to get behind one or two tools to help new Lustre users
Fermi: We should focus efforts to standardize interfaces so people don't need to reproduce collection efforts
James: Ultimately we'll be moving to sysfs which will replace procfs
Chris: We'll also need to use debugfs because sysfs is one line per file
James: debugfs is sequence file based and could help
Fermi: We could take the approach with event model like what was done with changelogs, publish data with thresholds
James: I agree with you, and sysfs can push uevents for event handling.
SDSC: There is a lot of feature releases and change, it is hard for newusers to find out and use these features Wiki is a place to publish standard feature uses
Chris: Landing collateral needs to include this type of information
Andreas: The manual gets updated but there may be a need for a 5-min demo
Chris: Go back and think about what we can collaborate
We should start working on a large list of landing collateral for new features, including appropriate documentation
Oleg: People hate new restrictions
Chris: Sure, but that is how we make it happen
Cory: We could set it up so a +1 can review from a publication bot
SDSC: I just need the steps to get it running, it doesn't have to be exhaustive
Andreas: There is stuff in the manual, people just don't look at it
Chris: The manual is almost a lost cause. Mismash, in docbook. There are real barriers
Andreas: Updating the manual isn't what people are looking for, is it a README or a wiki page?
Vitaly: Is this about manual or about design documents? New feature sshould include design docs
Andreas: Design docs are on wiki, but not all PAC leaders publish
Chris: The design docs are pretty poor anyway, and too high level for administrators
Andreas: man pages do describe how to do a lot of this. If they don't look in the manual, or man page, where are they going to look?
Chris: The wiki should be the place to go. Ceph as an example. We should have the same for Lustre. We need editors to create high level starting pages for people who don't know exactly what they're looking for
Alex: Developer documentation is lacking
Chris: True, there is a lot of lacking documentation. We need to start somewhere. There is an OpenSFS contract for the protocol documentation.
Alex: These get outdated quickly.
Chris: All new features that change the protocol must update the protocol documentation with suggested changes. Landing criteria. New landing rules
Vitaly: It would helpful to organize all the existing design documents into a central location.
Chris: the wiki is a great place for that.
Oleg: They should also be in the tree so they match the version you've checked out. Then when patches come in, we'll know they've changed some design and what needs to be updated in the tree.
Peter Jones: Once we have the protocol document, I like the idea of requiring updates. How do you catch every change though?
Chris: Peer review process
Peter Jones: I was thinking of the gatekeepers role, compared to the overall volume of changes this will be a small percentage
Cory: It should be on the inspectors and reviewers
Oleg: There are certain warning flags that would show need for protocol documentation
Chris: Protocol needs to not change with every change. We also need to document on disk protocol If we keep the protocol documentation separate, it will help maintain compatability
Andreas: There clearly has been ongoing efforts to maintain protocol compatibility. Some changes update protocol but don't fundamentally change it. Need to be clear about when feature was introduced.
Chris: Protocol documentation would make it clear if this feature flag is set, you must support this.
Simms: We're all over the place. There is a range of documentation here, high to low. IU has an editor, and when you're doing something technical,you just push it to them and they work it into a document or update existing.
Fermi: We need to lower entry barrier, and allow developer to publish short notes. Tech editor and work it into a more formal document.
Simms: It would be useful to have a person(s) that it would help to have somebody not in this room who is an editor to create a consistent style and enforce it. Better documentation would increase adoption of Lustre.
Chris: Take a step back. We need to focus on how we're going to make progress. People have pet interest, start using the wiki, publish. Who are the main markets for documentation? Starting points for developers, new users, admins. We need a champion who can rally people to create this starting point.
SDSC: We should compare against GPFS and do better than that.
Chris: Google presence is improving, and good information will bubble to the top. We also need focus on the packaging and installation in comparison to GPFS
Simms: suggest brainstorming from people in the room
Fermilab: We need to publish these suggestions to the community so they can contribute too
Simms: I'm prepared to ask IU to use a slice of time from our editors to help with the initiative. We need ideas on what specifically we can help with.
Andreas: OpenSFS could be funding a technical editor to help with this consistent style issue.
Rich: Features are awesome, bug fixes are necessary, but this will have the biggest impact on adoption.
Cory: We brought down the promoter fee, but they're still committed to providing resources. We should ask the board to poll all members to donate a 1/4FTE it might be easier to get something done.B rainstorming exercise about what needs to be done
Chris: We don't have any money, so we'll have to be creative to get something done.
Chris, CEA, Marc are willing to spend some their time
Andreas: Developers are not the people to be updating this. Every university has a "how to use Lustre effectively" It's the consumers that have the most to contribute. At CFS there was a policy of not replying directly to inquiries. Update FAQ and point at that.
James: We get the Mellanox question about once a month
Simms: I've proposed somebody follows lustre-discuss and captures knowledge and adds it to a appropriate venue
Andreas: Are there documentation engines we can leverage here?
Marc: Stackexchange is openly available
Chris: This fractures the documentation. Do we want this and the wiki?
Oleg: It would help for specific questions
John Hammond: We could use the editor for the wiki, and the community would use stackexchange
Chris: We need to cut Operations manual in half. 1/2 operations and other 1/2 for developers LNET should be promoted to the top.
John Hammond: The manual looks like HTML 1.1 and modernization would help attract people to Lustre. Also reformat from one big page.
Andreas: It needs to be google searchable as well.
Chris: We also need to have version releases of manual
Andreas: But this requires backing porting changes
Oleg: At the start of some manuals they have a changelog that specifies at what version things changed. Changelog section seems to be very important.
Andreas: There are facilities for this in the manual already.
Chris: Not exactly what people want, they want it up front. Needs to be mainly flat text, but at the beginning explain what the differences are.
Fermilab: There are issues with formatting. Command line examples go over the edge.
Chris: We need responsible people to report bugs and an editor that goes in and makes the changes.
Jason Hill: Sent a note to their technical editors to sit down with a Lustre admin to begin fleshing this out.
Marc: I'm happy to review changes for technical accuracy.
Andreas: The information people want is there, it needs to be other people that doing some of these things. This is the wrong audience for this discussion.
Peter Jones: Isn't Richard Friedman a technical writer?
Terri: He's paid by the hour and his allotment is already used.
Cory: Can Richard work on some of the ideas we've listed? Getting the manual indexable
Nirmal: The log messages are a whole other challenge. The knowledgeable should help with this.
Chris: This would be great for the stackexchange
Jason Hill: This also relates to tutorial style content that Chris and Jason have discussed. We've talked about talking about tutorial day at LUG or the Lustre Ecosystem workshop
Simms: Terri had a good idea before. What if we were to have developers talk about subsystems? These are the pieces parts, video tape it and put it up on a site. Helps new developers.Like a podcast or a "Lustre hour"
SDSC: Large conferences would be a great place to capture content like this
Fermilab: Everything should at least be in a central location.
Andreas: Do we need to get rid of the source control and review process?
Alexander: It does make it much easier for people to contribute.
SDSC: I think you need both, wiki and manual are separate documents for different purposes.
Chris: We've been slow at updating the manual, but we can make the most progress with the wiki. Work on that first and circle back to the manual.
Simms: We need to identify what documentation
- Log decoder
- Parameters and configuration
- Where to go when something breaks
- Tutorials* Make it pretty
- FAQ* LNET* Disks
- Clients* Servers
- Best practices
- PDFs in a container that gives details for each subsystem
- Design docs
- Protocol documentation
- Documentation inside the code
- man pages
- Style rubric and coding guidelines
- Configuration and Tuning
Robert Read: readthedocs.org
John Hammond: The code also needs to be cleaner and with more clairty. ex.better variable names
Andreas: Instead of external design documents, code needs to be documented. Header and function blocks that clearly articulate what the code does.
What are we doing with developer resources?
Chris: People seem to think it is inevitable it will move to lustre.org but I don't necessarily agre. We're in too much a limbo right now. Do people think have to move them away?
James: Who is going to run this?
Simms: Need to prioritize what we're going to do, and identify how we're going to do it
SDSC: People that object doing it under Intel need to pay for doing this
Andreas: The Intel system is open, and free for the community to use. This would be a duplication of effort.There is no clear benefit to wholesale changing this.
Chris: There is a vocal element that says the community needs to host this and get away from Intel. Personally I feel this is a duplication of effort and lots of other things to do.
Peter: Why make a bunch of complexity for ourselves? We could easily move the manual and issue tracker to OpenSFS and solve most of the complaints.
Andreas: A git.lustre.org URL redirection is a quick fix to this. downloads.lustre.org
Robert: I suggest moving the manual to github as a first step towards this.
Jason: If there is no pressure from Intel, there is no pressing need change this. Much larger issues to tackle.
CEA: Gerrit and git would be easiest to move.
Jason Hill: Big hill to get people to change their development work flow.
Peter Jones: Sounds like not this year.
Robert Read: Mirror the git tree on github.Andreas: I'll donate the lustre github account
Kyr: Static analysis
Cory: This relates to testing efforts
Chris postponed topic until next LWG meeting
- Add CCPCheck automated testing to top of the agenda for next LWG meeting
- Richard - indexable manual, implement some of the documentation ideas, URL redirection