Contract SFS-DEV-004: Difference between revisions
No edit summary |
No edit summary |
||
(2 intermediate revisions by 2 users not shown) | |||
Line 26: | Line 26: | ||
For the contract statement of work, see [[Media:SFS-DEV-004_SOW.pdf|SFS-DEV-004_SOW.pdf]]. The goal of the CLIO Simplification Implementation contract is the implementation in the Lustre source code of the CLIO Simplification Design that resulted from [[Media:CLIOSimplificationDesign_HighLevelDesign.pdf|Project 2]] of [[Contract SFS-DEV-003]]. | For the contract statement of work, see [[Media:SFS-DEV-004_SOW.pdf|SFS-DEV-004_SOW.pdf]]. The goal of the CLIO Simplification Implementation contract is the implementation in the Lustre source code of the CLIO Simplification Design that resulted from [[Media:CLIOSimplificationDesign_HighLevelDesign.pdf|Project 2]] of [[Contract SFS-DEV-003]]. | ||
=== Contract Completion === | |||
The contract reached completion in July of 2015. | |||
== Key People == | == Key People == | ||
Line 60: | Line 64: | ||
! Milestone task !! Target Completion !! Actual Completion | ! Milestone task !! Target Completion !! Actual Completion | ||
|- | |- | ||
| Implementation || Jan 26th 2015 || | | Implementation || Jan 26th 2015 || Mar 13th 2015 | ||
|- | |- | ||
| Test and fix || Apr 6th 2015 || | | Test and fix || Apr 6th 2015 || Jun 16th 2015 | ||
|- | |- | ||
| Demonstration || May 4th 2015 || | | Demonstration || May 4th 2015 || Omitted (see change request 004-002) | ||
|- | |- | ||
| Landing || Jun 1st 2015 || | | Landing || Jun 1st 2015 || Jul 7th 2015 | ||
|} | |} | ||
Line 963: | Line 967: | ||
== Lessons Learned == | == Lessons Learned == | ||
Variability should be communicated from the raw results through to the validation metrics. | Attendees: James, Jinshan, Doug, Richard, John. | ||
"Variability should be communicated from the raw results through to the validation metrics." | |||
"Schedule wasn't too bad: six weeks behind on demo, four weeks behind on Landing." | |||
"Good headlines for the project: 89KLOC removed." | |||
"Clean-up project was not overwhelming. One-or two regressions. Exposed problems in the test suite. Really need to figure out why they only become visible after time. go back and analyze why this issues aren't immediately." | |||
"Thought it went pretty well." | |||
"Spent time re-basing and avoiding collisions. Topic branches would be useful for something like this in the future." | |||
"CLIO work was separated well from the DNE2 and LFSCK3/4 work areas. This avoided costly collisions during landing" | |||
"Test matrix needed to be made available to assist communication, record configuration, and communicate data." | |||
"Using the OpenSFS wiki as the canonical project tool worked well. It allowed our work to be visible and avoided copying work between wikis." |
Latest revision as of 14:43, 19 November 2015
Overview
The Lustre client implementation for the IO path (called CLIO) is responsible for issuing RPC commands for reading and writing data to the OSTs. CLIO was reconstructed in Lustre 2.0 for cross-platform portability. The CLIO implementation is too complex for the current usage, thus making the code hard to understand and maintain. This project implements the tasks described during work for OpenSFS contract 003 (see below). This work includes:
cl_lock re-factoring
cl_lock is highly complex and difficult to maintain. As a result, enhancements to the client code are time consuming and a significant number of bugs have been traced to cl_lock portions of the code.
ioctl calls implementation
ioctl calls are inconsistently implemented in CLIO. By re-organizing these calls, the removal of the old OBD API becomes possible.
removal of obsolete OBD API call-backs
Remove unused code that misleads and confuses developers who are unfamiliar with Lustre client code.
removal of non-linux interfaces
Remove unused code that misleads and confuses developers who are unfamiliar with Lustre client code.
removal of strip md access beyond LOV layer
Remove code that does not observe public interfaces as it misleads and confuses developers who are unfamiliar with Lustre client code.
For the contract statement of work, see SFS-DEV-004_SOW.pdf. The goal of the CLIO Simplification Implementation contract is the implementation in the Lustre source code of the CLIO Simplification Design that resulted from Project 2 of Contract SFS-DEV-003.
Contract Completion
The contract reached completion in July of 2015.
Key People
OpenSFS
- Sarp Oral - OpenSFS Contract Administrator
- Christopher Morrone - OpenSFS Technical Representative
Project Approval Committee (PAC)
- Christopher Morrone - PAC Chair
- Colin Faber
- Patrick Farrell
- Jason Hill
- James Simmons
- Cory Spitz
Intel
- Richard Henwood - Project Manager
- Andreas Dilger - Consulting Architect
- Jinshan Xiong - Lead Engineer
Important Dates
The official start date of work is agreed to be October 13, 2014.
The contract lists milestone target dates in weeks relative to the start date. With the start date agreed, here we can just list actual dates to keep things easy to understand.
Milestone task | Target Completion | Actual Completion |
---|---|---|
Implementation | Jan 26th 2015 | Mar 13th 2015 |
Test and fix | Apr 6th 2015 | Jun 16th 2015 |
Demonstration | May 4th 2015 | Omitted (see change request 004-002) |
Landing | Jun 1st 2015 | Jul 7th 2015 |
Meeting Minutes
- SFS-DEV-004 Minutes 2014-11-06
- SFS-DEV-004 Minutes 2014-12-12
- SFS-DEV-004 Minutes 2014-01-28
- SFS-DEV-004 Minutes 2015-02-26
- SFS-DEV-004 Minutes 2015-03-26
- SFS-DEV-004 Minutes 2015-05-28
- SFS-DEV-004 Minutes 2015-06-25
Work Completed
cl_lock re-factoring (simplified and cache-less) DONE
LU-3259 cl_lock re-factoring The cl_lock is necessary because it communicates the DLM lock for a specific IO. The current implementation is highly complex. This work will write a simplified cl_lock. The new lock will be cache-less and replace the current implementation.
Planning | Review | Landed |
---|---|---|
10858 LU-3259 clio: cl_lock simplification | ||
Removal of liblustre DONE
Planning | Review | Landed |
---|---|---|
10657 LU-2675 build: remove liblustre and libsysio | ||
11772 LU-2675 mgc: remove libmgc.c | ||
Completed as part of Removal of Dead Code project 10172 LU-2675 llite: remove liblustre includes | ||
Completed as part of Removal of Dead Code project 10195 LU-2675 lmv: remove liblustre includes | ||
Completed as part of Removal of Dead Code project 10196 LU-2675 lov: remove liblustre includes | ||
11423 LU-2675 build: remove Darwin "support" | ||
11385 LU-2675 build: remove WinNT "support" |
function calls implementation and cleanup obsolete OBD methods DONE
LU-5823 Replace some obsolete obd operations with CLIO ioctl interface OBD API operations for read, write, setattr, getattr, etc. became obsolete after MDT, OFD and client reconstructing were completed. This work removes these redundant operations. OBD API operations for read, write, setattr, getattr, etc because obsolete after MDT, OFD and client restructuring were completed. Redundant code remains in CLIO and interfaces that are not referenced by any module will be targeted for removal.
Planning | Review | Landed |
---|---|---|
12452 LU-5823 clio: add coo_getstripe interface | ||
12494 LU-5823 clio: add cl_object_find_cbdata() | ||
13422 LU-5823 clio: use CIT_SETATTR for FSFILT_IOC_SETFLAGS | ||
12535 LU-5823 clio: add cl_object_fiemap() | ||
12638 LU-5823 clio: add coo_obd_info_get and coo_data_version | ||
12748 LU-5823 clio: remove IOC_LOV_GETINFO | ||
12639 LU-5823 clio: get rid of lov_stripe_md reference | ||
13426 LU-5814 obd: remove unused LSM parameters | ||
13514 LU-5823 llite: Remove access of stripe in ll_setattr_raw |
Function | Status |
---|---|
o_precreate | GONE |
o_create | Used by Echoclient |
o_create_async | GONE |
o_destroy | Used by Echoclient |
o_setattr | Used by Echoclient |
o_setattr_async | GONE |
o_getattr | Used by Echoclient |
o_getattr_async | GONE |
o_brw | GONE |
o_merge_lvb | GONE |
o_adjust_kms | GONE |
o_punch | GONE |
o_sync | GONE |
o_migrate | GONE |
o_copy | GONE |
o_preprw | Used by Echoclient and OST server-side code |
o_commitrw | Used by Echoclient and OST server-side code |
o_enqueue | GONE |
o_cancel | GONE |
o_change_cbdata | GONE |
o_find_cbdata | GONE |
o_change_cbdata | GONE |
o_extent_calc | GONE |
Remove lov_stripe_md (LSM) direct access beyond LOV layer DONE
LU-5814 encapsulate lov_stripe_md (LSM) to LOV layer The current CLIO implementation has a good interface to file layout operations. Legacy code still exists that does not use this interface. The code that does not use the file layout interface will be reviewed and targeted for removal or re-design to use the file layout interface.
Planning | Review | Landed |
---|---|---|
12442 LU-5814 lov: remove LL_IOC_RECREATE_{FID,OBJ} | ||
12446 LU-5814 echo: remove userspace LSM handling | ||
12445 LU-5814 lov: remove unused {get,set}_info handlers | ||
12447 LU-5418 echo: replace lov_stripe_md with lov_oinfo | ||
12618 LU-5814 llite: remove ll_objects_destroy() | ||
12581 LU-5814 lov: flatten struct lov_stripe_md] | ||
13426 LU-5814 obd: remove unused LSM parameters | ||
13680 LU-5814 lov: add cl_object_layout_get() | ||
13690 LU-5814 llite: replace lli_has_smd with lli_layout_type | ||
13694 LU-5814 llite: add cl_object_maxbytes() | ||
13695 LU-5814 lov: use obd_get_info() to get def/max LOV EA sizes | ||
13722 LU-5814 lov: remove LSM from struct lustre_md | ||
13696 LU-5814 lov: move LSM to LOV layer | ||
13737 LU-5814 obd: rename obd_unpackmd() to md_unpackmd() |
NOTE: struct obd_info:oi_md cannot be removed now because of inter-dependencies between clean-up patches.
Remove non-linux interfaces DONE
Two parts
Remove some cfs_ prefixed functions. DONE
Planning | Review | Landed |
---|---|---|
NON-INTEL: 6956 LU-1346 libcfs: cleanup libcfs primitive (linux-prim.h) | ||
NON-INTEL: 11797 LU-3963 libcfs: remove last of cfs list wrappers | ||
NON-INTEL: 13070 LU-3963 libcfs: Use kernel's strncasecmp and remove cfs_get_blocked_sigs |
NOTE:
- cfs_snprintf() does have uses, not equivilent to snprintf(), will remain.
- cfs_hlist* are needed for Linux kernel compatibility and will remain.
Remove ccc_ layer DONE
LU-5971 removal of ccc_ layerWith the removal of liblustre, the ccc_ layer is redundant and complex. The remaining useful functions will be merged into vfs vm posix layer and the ccc_ layer will be removed.
Planning | Review | Landed |
---|---|---|
12592 LU-5971 llite: merge lclient.h into llite/vvp_internal.h | ||
13075 LU-5971 llite: rename ccc_device to vvp_device | ||
13077 LU-5971 llite: rename ccc_object to vvp_object | ||
13086 LU-5971 llite: rename ccc_page to vvp_page | ||
13088 LU-5971 llite: rename ccc_lock to vvp_lock | ||
13351 LU-5971 llite: merge ccc_io and vvp_io | ||
13347 LU-5971 llite: remove struct ll_ra_read | ||
13363 LU-5971 llite: use vui prefix for struct vvp_io members | ||
13376 LU-5971 llite: move vvp_io functions to vvp_io.c | ||
13377 LU-5971 llite: rename ccc_req to vvp_req | ||
13714 LU-5971 llite: rename struct ccc_grouplock to ll_grouplock |
Regressions
Planning | Review | Landed |
---|---|---|
Move definition of LDLM_GID_ANY to lustre_dlm.h | ||
LU-6046 audit comments in cl_object.h |
Test and Fix Phase
For this phase, we will complete the following:
- Contractor demonstrates the code passing the complement of tests in Contractor's Autotest environment with the code applied to the Lustre Master tree.
- Contract demonstrates the code runs successfully at scale (typically completing a 48 hour SWL run on the Hyperion platform at Lawrence Livermore National Laboratory).
- Contractor executes performance regression testing identifying and addressing performance regressions related to the development of the revised code. This performance testing will be run on a system with at least 100 clients and will compare results of IOR, mdtest on builds before and after the implementation of the CLIO Simplification HLD. Degradation of more than 5% will be taken as a failure, but small drops will be accepted as within normal variation.
Introduction
The following milestone completion document applies to CLIO Simplification Project recorded in the OpenSFS Lustre Development contract SFS-DEV-004 agreed September 25, 2014. The CLIO Simplification code is functionally complete and recorded in the Implementation Milestone. Completion of this milestone requires the following tasks to be executed:
- Contractor demonstrates the code passing the complement of tests in Contractor's Autotest environment with the code applied to the Lustre Master tree.
- Contractor demonstrates the code runs successfully at scale (typically completing a 48 hour SWL run on the Hyperion platform at the Lawrence Livermore National Laboratory).
- Contractor executes performance regression testing, identifying and addressing performance regressions related to the development of the revised code. This performance testing will be run on a system with at least 100 clients and will compare results of IOR, mdtest on builds before and after the implementation of the CLIO Simplification High Level Design. Degradation of more than 5% will be taken as a failure, but small drops will be accepted as within normal variation.
NOTE: This task list includes agreed enhancements in item 3. They are: lnet-selftest has been omitted as redundant. Mdtest has been selected as a better alternative to mdsrate. Overview of CLIO Simplification. CLIO Simplification work was completed with six high-level tasks. These are:
- cl_lock re-factoring (simplified and cache-less).
- Liblustre removal.
- Implement function calls and cleanup obsolete OBD methods.
- Remove lov_stripe_md (LSM) direct access beyond LOV layer.
- Remove cfs_ prefixed functions, where appropriate.
- Remove ccc_ layer.
This work has been completed in the following patches:
Autotest results
The complete series of patches are recorded at http://review.whamcloud.com/13737/ and below. This series (patch set 3) successfully passed Autotest on March 27th. This result is recorded here:
NOTE: since completing these tests, many unrelated patches have landed on master that have obligated a re-base of this patch series.
48hr SWL run on Hyperion
SWL completed a 48 hour run on Hyperion on March 12 with no observed issues. A summary of the test is below:
Summary ======= Start Time: Thu Mar 12 05:59:15 PDT 2015 Job Totals Passed: 14346 Failed: 0 Terminated: 64 Unknown: 0 Total: 14410 Failure Rate: 0.00% Run Times Wall Clock Run Time: 2253.22 hrs. Node Run Time: 14018.81 node-hrs. SWL Node Utilization: 145.18% SLURM Node Utilization: n/a Excessive Run-time Variation Job Count: 0 Overall Job Coverage: 21.7% (138/636) Passed Job Coverage: 21.5% (137/636) End Time: Sat Mar 14 06:16:02 PDT 2015 SWL Run Time: 173807 sec. (48.28) hrs.) Failure Mode Summary Mode Count Description ====== ======= ========================================================== 129 3431 TBD Failure Mode Breakdown Test Mode Count Description ======== ====== ======= ========================================================== IO 129 3431 TBD Report generated on Sat Mar 14 06:50:43 PDT 2015
NOTE: SWL runs continuously. This test run was ended after 48 hours. Jobs that where running when the test run was completed are recorded as terminated in this summary.
Performance tests
This series of tests is designed to verify that the CLIO Simplification project has not negatively affected the performance of the Lustre filesystem. This test was executed on Hyperion. Hyperion runs with 16 threads per single-client tests, and 1600 IOR threads for 100-client tests. The baseline for performance was selected as Lustre 2.6.0. The build with CLIO Patches applied was created from http://review.hpdd.intel.com/13318/ (since merged into http://review.hpdd.intel.com/13714 for landing). Five consecutive tests were run for each metric. The mean of the five runs was computed. This mean is used to calculate percentage difference against the baseline. The complete result set is recorded in Appendix A. Observed CLIO Simplification performance that is slower than 5% of the baseline is presented in red. Guidelines for the reader:
- CLIO Simplification patches have been landed into Master over the last 9 months. During this time, 988 patches have landed which may or may not be responsible for changes in performance.
- Variability in performance computing is commonly observed during tests on Hyperion. It is not unknown to see a 10% variation between consecutive runs of the same code. The figure below illustrates variability in performance over 15 recent consecutive tags. NOTE: for two runs of 2.6.90 on different dates (2.6.90.1 and 2.6.90.2 on the figure below) show significant differences in read performance for otherwise identical Lustre versions.
Performance of 15 consecutive tags including 2.6.0 and 2.7.0 releases as well as more recent master tags. Read and write bandwidth of a single shared file with 100 clients is recorded. Significant variability can be observed between any two consecutive tags.
2.6.0 vs CLIO | 2.6.53.1 vs CLIO | |
Read performance difference | 101% | 97% |
Write performance difference | 90% | 100% |
OBSERVATIONS: Figure 1 shows the baseline of 2.6.0 (far left) has write performance that is above the typical performance for 15 recent tags. This unusually high value for the 2.6.0 baseline, means that more typical observations are apparently slow by comparison. Choosing a more common value for 100 client SSF read and write (i.e. tag 2.6.53.1) provides a better baseline value and shows performance within tolerance.
IOR, 100 Clients, File Per Process (FPP)
Read performs: 96% Write perform: 102% OBSERVATION: Both observations show CLIO Simplification performance within tolerance.
IOR, Single Client, File Per Process (FPP)
Hyperion | OpenSFS test 1 | OpenSFS test 2 | |
Read performance difference | 104% | 104% | |
Write performance difference | 92% | 150% | 97% |
OBSERVATIONS: Write performance was below tolerance during our run on Hyperion on this test. This apparent slow-down was not reproducible over three re-runs on the OpenSFS cluster where between 97% and 150% write performance was observed.
Read performs: 101% Write perform: 388% OBSERVATIONS: Observation of the 2.6.0 baseline showed write performance on Hyperion at 340MB/s. The CLIO Simplification client performed at 1400MB/s. Task LU-1669 (out of scope for this contract) is thought to be primarily responsible for the large improvement in write performance between 2.6.0 and the accumulated unrelated CLIO Simplification patches.
mdtest, Single Client
Best FFP tree filestat: 115%
Best SSF tree dircreate: 103%
Worst FFP tree treecreate: 97%
Worst SSF tree rm: 97%
OBSERVATIONS: All 32 observations show CLIO Simplification performance within tolerance. Only the best and the worst are included here.
mdtest, 100 Clients
Hyperion run 1 | Hyperion run 1 | Hyperion run 2 | Hyperion run 2 | |
File per process | Single shared file | File per process | Single shared file | |
Dir create | 108% | 104% | 88% | 98% |
Dir stat | 96% | 100% | 100% | 100% |
Dir rm | 110% | 98% | 95% | 96% |
File create | 181% | 235% | 201% | 168% |
File stat | 96% | 99% | 98% | 100% |
File rm | 108% | 104% | 104% | 102% |
Tree create | 100% | 86% | 100% | 100% |
Tree rm | 114% | 104% | 106% | 97% |
OBSERVATIONS: Out of 32 metrics only two metric are observed out of tolerance for mdtest at 100 clients scale. Out of tolerance observations were not repeated between consecutive runs and they are judged to be due to variability occurring at 100 client scale. A significant performance increase is observed on metrics including ‘file create’ and ‘file rm’.
Conclusion
Functional testing and SWL testing of the CLIO Simplification stack was completed successfully. Performance testing is also complete. Of 49 values presented within the performance testing, four were observed below 5% of the 2.6.0 baseline value. On review, these four observations can be attributed to the challenges of running repeatable tests with low variability at scale. Further tests were run to check whether CLIO Simplification patches introduced these observed regressions, but no evidence was found to support this being the case. We are confident that the CLIO Simplification patches do not introduce regressions and this phase of the project is complete.
Appendix
A pdf of the report, including detailed results are included in this document. File:CLIO test and fix complete report.pdf
Landing Phase
Patches for a simplified CLIO stack were landed over a period of approximately six months with the final outstanding CLIO Simplification patch landed on Mon, 29 Jun 2015. The complete list of patches created by Intel and their commitment into the Lustre Master is recorded below.
During pursuit of the project goals, more than 89KLOC have been removed. This figure represents over 10% of the Lustre code base and the successful execution of this project delivers a simplified code base for future enhancement. With successful completion of this project, this milestone was agreed on 2015-07-07.
Change requests
004-001 CLIO ioctl's should be functions
CHANGE REQUEST: 004-001 CLIO ioctl's should be functions.
BACKGROUND: Ioctl calls were included in the CLIO Simplification design to replace some obsolete ODB operations. Alternatively, individual functions can replace the ODB operations instead.
CHANGE: Do not implement ioctl calls. Implement functions.
ACTIONS REQUIRED:
- Ensure none of the current patches are land.
- Update the design document with the new design.
- Update the ticket LU-5823 with new activity.
- Execute work to complete LU-5823.
STATUS: Approved on 4th Dec 2014
004-002 Omit CLIO Demonstration milestone and associated milestone payment
CHANGE REQUEST: 004-002 Omit CLIO Demonstration milestone and associated milestone payment
BACKGROUND: The Demonstration milestone has been rendered redundant by a precisely specified 'Test and Fix' milestone that will execute before Demonstration. In the current plan, 'Test and Fix' (10 weeks) includes specific tests to run (including 48hr SWL, and performance characterization). There was agreement that a useful Demonstration Milestone is exactly defined by 'Test and Fix' Milestone. The plan requires specification of Demonstration during the Implementation phase and no additional work beyond 'Test and Fix' has been identified.
CHANGE: Omit Demonstration Milestone from the plan of record.
ACTIONS REQUIRED:
- Move directly from 'Test and Fix' milestone to 'Landing' milestone.
- Communicate new plan to stakeholders.
STATUS: Submitted for review 2nd March 2015
Lessons Learned
Attendees: James, Jinshan, Doug, Richard, John.
"Variability should be communicated from the raw results through to the validation metrics."
"Schedule wasn't too bad: six weeks behind on demo, four weeks behind on Landing."
"Good headlines for the project: 89KLOC removed."
"Clean-up project was not overwhelming. One-or two regressions. Exposed problems in the test suite. Really need to figure out why they only become visible after time. go back and analyze why this issues aren't immediately."
"Thought it went pretty well."
"Spent time re-basing and avoiding collisions. Topic branches would be useful for something like this in the future."
"CLIO work was separated well from the DNE2 and LFSCK3/4 work areas. This avoided costly collisions during landing"
"Test matrix needed to be made available to assist communication, record configuration, and communicate data."
"Using the OpenSFS wiki as the canonical project tool worked well. It allowed our work to be visible and avoided copying work between wikis."