Contract SFS-DEV-004: Difference between revisions

From OpenSFS Wiki
Jump to navigation Jump to search
No edit summary
 
(17 intermediate revisions by 2 users not shown)
Line 3: Line 3:
== Overview ==
== Overview ==


The goal of the CLIO Simplification Implementation contract is the implementation in the Lustre source code of the CLIO Simplification Design that resulted from [[Media:CLIOSimplificationDesign_HighLevelDesign.pdf|Project 2]] of [[Contract SFS-DEV-003]].
The Lustre client implementation for the IO path (called CLIO) is responsible for issuing RPC commands for reading and writing data to the OSTs. CLIO was reconstructed in Lustre 2.0 for cross-platform portability. The CLIO implementation is too complex for the current usage, thus making the code hard to understand and maintain. This project implements the tasks described during work for OpenSFS contract 003 (see below). This work includes:


For the contract statement of work, see [[Media:SFS-DEV-004_SOW.pdf|SFS-DEV-004_SOW.pdf]]
=== cl_lock re-factoring ===
 
cl_lock is highly complex and difficult to maintain. As a result, enhancements to the client code are time consuming and a significant number of bugs have been traced to cl_lock portions of the code.
 
=== ioctl calls implementation ===
 
ioctl calls are inconsistently implemented in CLIO. By re-organizing these calls, the removal of the old OBD API becomes possible.
 
=== removal of obsolete OBD API call-backs ===
 
Remove unused code that misleads and confuses developers who are unfamiliar with Lustre client code.
 
=== removal of non-linux interfaces ===
 
Remove unused code that misleads and confuses developers who are unfamiliar with Lustre client code.
 
=== removal of strip md access beyond LOV layer ===
 
Remove code that does not observe public interfaces as it misleads and confuses developers who are unfamiliar with Lustre client code.
 
For the contract statement of work, see [[Media:SFS-DEV-004_SOW.pdf|SFS-DEV-004_SOW.pdf]]. The goal of the CLIO Simplification Implementation contract is the implementation in the Lustre source code of the CLIO Simplification Design that resulted from [[Media:CLIOSimplificationDesign_HighLevelDesign.pdf|Project 2]] of [[Contract SFS-DEV-003]].
 
=== Contract Completion ===
 
The contract reached completion in July of 2015.


== Key People ==
== Key People ==
Line 40: Line 64:
! Milestone task !! Target Completion !! Actual Completion
! Milestone task !! Target Completion !! Actual Completion
|-
|-
| Implementation || Jan 26th 2015 ||
| Implementation || Jan 26th 2015 || Mar 13th 2015
|-
|-
| Test and fix || Apr 6th 2015 ||
| Test and fix || Apr 6th 2015 || Jun 16th 2015
|-
|-
| Demonstration || May 4th 2015 ||
| Demonstration || May 4th 2015 || Omitted (see change request 004-002)
|-
|-
| Landing || Jun 1st 2015 ||
| Landing || Jun 1st 2015 || Jul 7th 2015
|}
|}


Line 57: Line 81:
* [[SFS-DEV-004 Minutes 2015-03-26]]
* [[SFS-DEV-004 Minutes 2015-03-26]]
* [[SFS-DEV-004 Minutes 2015-05-28]]
* [[SFS-DEV-004 Minutes 2015-05-28]]
* [[SFS-DEV-004 Minutes 2015-06-25]]


== Working in Progress or Completed ==
== Work Completed ==


=== cl_lock re-factoring (simplified and cache-less) DONE ===
=== cl_lock re-factoring (simplified and cache-less) DONE ===
Line 69: Line 94:
! Planning !! Review !! Landed
! Planning !! Review !! Landed
|-
|-
|  || [http://review.whamcloud.com/11013 11013 LU-3259 clio: get rid of cl_req] || [http://review.whamcloud.com/10858 10858] LU-3259 clio: cl_lock simplification
|  || || [http://review.whamcloud.com/10858 10858] LU-3259 clio: cl_lock simplification
|-
|-
|  || ||  
|  || ||  
Line 97: Line 122:
|}
|}


=== function calls implementation and cleanup obsolete OBD methods STARTED ===
=== function calls implementation and cleanup obsolete OBD methods DONE ===


[https://jira.hpdd.intel.com/browse/LU-5823 LU-5823 Replace some obsolete obd operations with CLIO ioctl interface] OBD API operations for read, write, setattr, getattr, etc. became obsolete after MDT, OFD and client reconstructing were completed. This work removes these redundant operations. OBD API operations for read, write, setattr, getattr, etc because obsolete after MDT, OFD and client restructuring were completed. Redundant code remains in CLIO and interfaces that are not referenced by any module will be targeted for removal.
[https://jira.hpdd.intel.com/browse/LU-5823 LU-5823 Replace some obsolete obd operations with CLIO ioctl interface] OBD API operations for read, write, setattr, getattr, etc. became obsolete after MDT, OFD and client reconstructing were completed. This work removes these redundant operations. OBD API operations for read, write, setattr, getattr, etc because obsolete after MDT, OFD and client restructuring were completed. Redundant code remains in CLIO and interfaces that are not referenced by any module will be targeted for removal.
Line 105: Line 130:
! Planning !! Review !! Landed
! Planning !! Review !! Landed
|-
|-
|  || [http://review.whamcloud.com/#/c/13514/ 13514 LU-5823 llite: Remove access of stripe in ll_setattr_raw] || [http://review.whamcloud.com/12452 12452] LU-5823 clio: add coo_getstripe interface
|  || || [http://review.whamcloud.com/12452 12452] LU-5823 clio: add coo_getstripe interface
|-
|-
|  || || [http://review.whamcloud.com/12494 12494] LU-5823 clio: add cl_object_find_cbdata()
|  || || [http://review.whamcloud.com/12494 12494] LU-5823 clio: add cl_object_find_cbdata()
Line 120: Line 145:
|-
|-
|  || || [http://review.whamcloud.com/13426 13426] LU-5814 obd: remove unused LSM parameters
|  || || [http://review.whamcloud.com/13426 13426] LU-5814 obd: remove unused LSM parameters
|-
|  || || [http://review.whamcloud.com/#/c/13514/ 13514 LU-5823 llite: Remove access of stripe in ll_setattr_raw]
|}
|}
   
   
Line 174: Line 201:
|}
|}


=== Remove lov_stripe_md (LSM) direct access beyond LOV layer STARTED ===
=== Remove lov_stripe_md (LSM) direct access beyond LOV layer DONE ===


[https://jira.hpdd.intel.com/browse/LU-5814 LU-5814 encapsulate lov_stripe_md (LSM) to LOV layer]
[https://jira.hpdd.intel.com/browse/LU-5814 LU-5814 encapsulate lov_stripe_md (LSM) to LOV layer]
Line 183: Line 210:
! Planning !! Review !! Landed
! Planning !! Review !! Landed
|-
|-
|  || [http://review.whamcloud.com/13722 13722 LU-5814 lov: remove LSM from struct lustre_md] || [http://review.whamcloud.com/12442 12442] LU-5814 lov: remove LL_IOC_RECREATE_{FID,OBJ}
|  || || [http://review.whamcloud.com/12442 12442] LU-5814 lov: remove LL_IOC_RECREATE_{FID,OBJ}
|-
|-
|  || [http://review.whamcloud.com/13696 13696 LU-5814 lov: move LSM to LOV layer] || [http://review.whamcloud.com/12446 12446] LU-5814 echo: remove userspace LSM handling
|  || || [http://review.whamcloud.com/12446 12446] LU-5814 echo: remove userspace LSM handling
|-
|-
|  || [http://review.whamcloud.com/#/c/13737/ 13737 LU-5814 obd: rename obd_unpackmd() to md_unpackmd()] || [http://review.whamcloud.com/12445 12445] LU-5814 lov: remove unused {get,set}_info handlers
|  || || [http://review.whamcloud.com/12445 12445] LU-5814 lov: remove unused {get,set}_info handlers
|-
|-
|  || || [http://review.whamcloud.com/12447 12447] LU-5418 echo: replace lov_stripe_md with lov_oinfo
|  || || [http://review.whamcloud.com/12447 12447] LU-5418 echo: replace lov_stripe_md with lov_oinfo
|-
|-
|  || || [http://review.whamcloud.com/12618 12618] LU-5814 llite: remove ll_objects_destroy()  
|  || || [http://review.whamcloud.com/12618 12618] LU-5814 llite: remove ll_objects_destroy()  
Line 204: Line 231:
|-
|-
|  || || [http://review.whamcloud.com/13695 13695 LU-5814 lov: use obd_get_info() to get def/max LOV EA sizes]
|  || || [http://review.whamcloud.com/13695 13695 LU-5814 lov: use obd_get_info() to get def/max LOV EA sizes]
|-
|  || || [http://review.whamcloud.com/13722 13722 LU-5814 lov: remove LSM from struct lustre_md]
|-
|  || || [http://review.whamcloud.com/13696 13696 LU-5814 lov: move LSM to LOV layer]
|-
|  || || [http://review.whamcloud.com/#/c/13737/ 13737 LU-5814 obd: rename obd_unpackmd() to md_unpackmd()]
|}
|}


NOTE: struct obd_info:oi_md cannot be removed now because of inter-dependencies between clean-up patches.
NOTE: struct obd_info:oi_md cannot be removed now because of inter-dependencies between clean-up patches.


=== Remove non-linux interfaces STARTED ===
=== Remove non-linux interfaces DONE ===


Two parts
Two parts
Line 229: Line 262:
* cfs_hlist* are needed for Linux kernel compatibility and will remain.
* cfs_hlist* are needed for Linux kernel compatibility and will remain.


==== Remove ccc_ layer STARTED ====
==== Remove ccc_ layer DONE ====


[https://jira.hpdd.intel.com/browse/LU-5971 LU-5971 removal of ccc_ layer]With the removal of liblustre, the ccc_ layer is redundant and complex. The remaining useful functions will be merged into vfs vm posix layer and the ccc_ layer will be removed.
[https://jira.hpdd.intel.com/browse/LU-5971 LU-5971 removal of ccc_ layer]With the removal of liblustre, the ccc_ layer is redundant and complex. The remaining useful functions will be merged into vfs vm posix layer and the ccc_ layer will be removed.
Line 247: Line 280:
| || || [http://review.whamcloud.com/#/c/13088 13088] LU-5971 llite: rename ccc_lock to vvp_lock
| || || [http://review.whamcloud.com/#/c/13088 13088] LU-5971 llite: rename ccc_lock to vvp_lock
|-
|-
| || || [http://review.whamcloud.com/#/c/13351/ 13351] LU-5971 llite: merge ccc_io and vvp_io
| || || [http://review.whamcloud.com/#/c/13351/ 13351] LU-5971 llite: merge ccc_io and vvp_io
|-
|-
| || || [http://review.whamcloud.com/#/c/13347/ 13347] LU-5971 llite: remove struct ll_ra_read
| || || [http://review.whamcloud.com/#/c/13347/ 13347] LU-5971 llite: remove struct ll_ra_read
|-
|-
| || || [http://review.whamcloud.com/#/c/13363/ 13363] LU-5971 llite: use vui prefix for struct vvp_io members
| || || [http://review.whamcloud.com/#/c/13363/ 13363] LU-5971 llite: use vui prefix for struct vvp_io members
Line 280: Line 313:


# Contractor executes performance regression testing identifying and addressing performance regressions related to the development of the revised code. This performance testing will be run on a system with at least 100 clients and will compare results of IOR, mdtest on builds before and after the implementation of the CLIO Simplification HLD. Degradation of more than 5% will be taken as a failure, but small drops will be accepted as within normal variation.
# Contractor executes performance regression testing identifying and addressing performance regressions related to the development of the revised code. This performance testing will be run on a system with at least 100 clients and will compare results of IOR, mdtest on builds before and after the implementation of the CLIO Simplification HLD. Degradation of more than 5% will be taken as a failure, but small drops will be accepted as within normal variation.
=== Introduction ===
The following milestone completion document applies to CLIO Simplification Project recorded in the OpenSFS Lustre Development contract SFS-DEV-004 agreed September 25, 2014.
The CLIO Simplification code is functionally complete and recorded in the Implementation Milestone. Completion of this milestone requires the following tasks to be executed:
# Contractor demonstrates the code passing the complement of tests in Contractor's Autotest environment with the code applied to the Lustre Master tree.
# Contractor demonstrates the code runs successfully at scale (typically completing a 48 hour SWL run on the Hyperion platform at the Lawrence Livermore National Laboratory).
# Contractor executes performance regression testing, identifying and addressing performance regressions related to the development of the revised code. This performance testing will be run on a system with at least 100 clients and will compare results of IOR, mdtest on builds before and after the implementation of the CLIO Simplification High Level Design. Degradation of more than 5% will be taken as a failure, but small drops will be accepted as within normal variation.
NOTE: This task list includes agreed enhancements in item 3. They are: lnet-selftest has been omitted as redundant. Mdtest has been selected as a better alternative to mdsrate.
Overview of CLIO Simplification.
CLIO Simplification work was completed with six high-level tasks. These are:
* cl_lock re-factoring (simplified and cache-less).
* Liblustre removal.
* Implement function calls and cleanup obsolete OBD methods.
* Remove lov_stripe_md (LSM) direct access beyond LOV layer.
* Remove cfs_ prefixed functions, where appropriate.
* Remove ccc_ layer.
This work has been completed in the following patches:
{|
|'''Change #'''
|'''Work ticket'''
|-
|[http://review.hpdd.intel.com/11013 11013]
|[http://jira.hpdd.intel.com/browse/LU-3259 LU-3259 clio: get rid of cl_req]
|-
|[http://review.hpdd.intel.com/10858 10858]
|[http://jira.hpdd.intel.com/browse/LU-3259 LU-3259 clio: cl_lock simplification]
|-
|[http://review.hpdd.intel.com/10657 10657]
|[http://jira.hpdd.intel.com/browse/LU-2675 LU-2675 build: remove liblustre and libsysio]
|-
|[http://review.hpdd.intel.com/11772 11772]
|[http://jira.hpdd.intel.com/browse/LU-2675 LU-2675 mgc: remove libmgc.c]
|-
|[http://review.hpdd.intel.com/11423 11423]
|[http://jira.hpdd.intel.com/browse/LU-2675 LU-2675 build: remove Darwin “support”]
|-
|[http://review.hpdd.intel.com/11385 11385]
|[http://jira.hpdd.intel.com/browse/LU-2675 LU-2675 build: remove WinNT “support”]
|-
|[http://review.hpdd.intel.com/13514 13514]
|[http://jira.hpdd.intel.com/browse/LU-5823 LU-5823 llite: Remove access of stripe in ll_setattr_raw]
|-
|[http://review.hpdd.intel.com/12452 12452]
|[http://jira.hpdd.intel.com/browse/LU-5823 LU-5823 clio: add coo_getstripe interface]
|-
|[http://review.hpdd.intel.com/12494 12494]
|[http://jira.hpdd.intel.com/browse/LU-5823 LU-5823 clio: add cl_object_find_cbdata()]
|-
|[http://review.hpdd.intel.com/13422 13422]
|[http://jira.hpdd.intel.com/browse/LU-5823 LU-5823 clio: use CIT_SETATTR for FSFILT_IOC_SETFLAGS]
|-
|[http://review.hpdd.intel.com/12535 12535]
|[http://jira.hpdd.intel.com/browse/LU-5823 LU-5823 clio: add cl_object_fiemap()]
|-
|[http://review.hpdd.intel.com/12638 12638]
|[http://jira.hpdd.intel.com/browse/LU-5823 LU-5823 clio: add coo_obd_info_get and coo_data_version]
|-
|[http://review.hpdd.intel.com/12748 12748]
|[http://jira.hpdd.intel.com/browse/LU-5823 LU-5823 clio: remove IOC_LOV_GETINFO]
|-
|[http://review.hpdd.intel.com/12639 12639]
|[http://jira.hpdd.intel.com/browse/LU-5823 LU-5823 clio: get rid of lov_stripe_md reference]
|-
|[http://review.hpdd.intel.com/13426 13426]
|[http://jira.hpdd.intel.com/browse/LU-5814 LU-5814 obd: remove unused LSM parameters]
|-
|[http://review.hpdd.intel.com/13722 13722]
|[http://jira.hpdd.intel.com/browse/LU-5814 LU-5814 lov: remove LSM from struct lustre_md]
|-
|[http://review.hpdd.intel.com/12442 12442]
|[http://jira.hpdd.intel.com/browse/LU-5814 LU-5814 lov: remove LL_IOC_RECREATE_{FID,OBJ}]
|-
|[http://review.hpdd.intel.com/13696 13696]
|[http://jira.hpdd.intel.com/browse/LU-5814 LU-5814 lov: move LSM to LOV layer]
|-
|[http://review.hpdd.intel.com/12446 12446]
|[http://jira.hpdd.intel.com/browse/LU-5814 LU-5814 echo: remove userspace LSM handling]
|-
|[http://review.hpdd.intel.com/13737 13737]
|[http://jira.hpdd.intel.com/browse/LU-5814 LU-5814 obd: rename obd_unpackmd() to md_unpackmd()]
|-
|[http://review.hpdd.intel.com/12445 12445]
|[http://jira.hpdd.intel.com/browse/LU-5814 LU-5814 lov: remove unused {get,set}_info handlers]
|-
|[http://review.hpdd.intel.com/12447 12447]
|[http://jira.hpdd.intel.com/browse/LU-5418 LU-5418 echo: replace lov_stripe_md with lov_oinfo]
|-
|[http://review.hpdd.intel.com/12618 12618]
|[http://jira.hpdd.intel.com/browse/LU-5814 LU-5814 llite: remove ll_objects_destroy()]
|-
|[http://review.hpdd.intel.com/12581 12581]
|[http://jira.hpdd.intel.com/browse/LU-5814 LU-5814 lov: flatten struct lov_stripe_md]]
|-
|[http://review.hpdd.intel.com/13426 13426]
|[http://jira.hpdd.intel.com/browse/LU-5814 LU-5814 obd: remove unused LSM parameters]
|-
|[http://review.hpdd.intel.com/13680 13680]
|[http://jira.hpdd.intel.com/browse/LU-5814 LU-5814 lov: add cl_object_layout_get()]
|-
|[http://review.hpdd.intel.com/13690 13690]
|[http://jira.hpdd.intel.com/browse/LU-5814 LU-5814 llite: replace lli_has_smd with lli_layout_type]
|-
|[http://review.hpdd.intel.com/13694 13694]
|[http://jira.hpdd.intel.com/browse/LU-5814 LU-5814 llite: add cl_object_maxbytes()]
|-
|[http://review.hpdd.intel.com/13695 13695]
|[http://jira.hpdd.intel.com/browse/LU-5814 LU-5814 lov: use obd_get_info() to get def/max LOV EA sizes]
|-
|[http://review.hpdd.intel.com/12592 12592]
|[http://jira.hpdd.intel.com/browse/LU-5971 LU-5971 llite: merge lclient.h into llite/vvp_internal.h]
|-
|[http://review.hpdd.intel.com/13075 13075]
|[http://jira.hpdd.intel.com/browse/LU-5971 LU-5971 llite: rename ccc_device to vvp_device]
|-
|[http://review.hpdd.intel.com/13077 13077]
|[http://jira.hpdd.intel.com/browse/LU-5971 LU-5971 llite: rename ccc_object to vvp_object]
|-
|[http://review.hpdd.intel.com/13086 13086]
|[http://jira.hpdd.intel.com/browse/LU-5971 LU-5971 llite: rename ccc_page to vvp_page]
|-
|[http://review.hpdd.intel.com/13088 13088]
|[http://jira.hpdd.intel.com/browse/LU-5971 LU-5971 llite: rename ccc_lock to vvp_lock]
|-
|[http://review.hpdd.intel.com/13351 13351]
|[http://jira.hpdd.intel.com/browse/LU-5971 LU-5971 llite: merge ccc_io and vvp_io]
|-
|[http://review.hpdd.intel.com/13347 13347]
|[http://jira.hpdd.intel.com/browse/LU-5971 LU-5971 llite: remove struct ll_ra_read]
|-
|[http://review.hpdd.intel.com/13363 13363]
|[http://jira.hpdd.intel.com/browse/LU-5971 LU-5971 llite: use vui prefix for struct vvp_io members]
|-
|[http://review.hpdd.intel.com/13376 13376]
|[http://jira.hpdd.intel.com/browse/LU-5971 LU-5971 llite: move vvp_io functions to vvp_io.c]
|-
|[http://review.hpdd.intel.com/13377 13377]
|[http://jira.hpdd.intel.com/browse/LU-5971 LU-5971 llite: rename ccc_req to vvp_req]
|-
|[http://review.hpdd.intel.com/13714 13714]
|[http://jira.hpdd.intel.com/browse/LU-5971 LU-5971 llite: rename struct ccc_grouplock to ll_grouplock]
|-
|[http://review.hpdd.intel.com/13074 13074]
|[http://jira.hpdd.intel.com/browse/LU-6028 LU-6028 Move definition of LDLM_GID_ANY to lustre_dlm.h]
|-
|[http://review.hpdd.intel.com/13137 13137]
|[http://jira.hpdd.intel.com/browse/LU-6046 LU-6046 audit comments in cl_object.h]
|}
=== Autotest results ===
The complete series of patches are recorded at http://review.whamcloud.com/13737/ and below. This series (patch set 3) successfully passed Autotest on March 27th. This result is recorded here:
* [https://testing.hpdd.intel.com/test_sessions/2182427a-d444-11e4-b009-5254006e85c2 review-zfs]
* [https://testing.hpdd.intel.com/test_sessions/46100420-d420-11e4-8aa9-5254006e85c2 review-dne-part-1]
* [https://testing.hpdd.intel.com/test_sessions/04fc4a46-d454-11e4-8d0e-5254006e85c2 review-dne-part-2]
* [https://testing.hpdd.intel.com/test_sessions/d7c2db02-d3e1-11e4-bce9-5254006e85c2 review-ldiskfs]
NOTE: since completing these tests, many unrelated patches have landed on master that have obligated a re-base of this patch series.
=== 48hr SWL run on Hyperion ===
SWL completed a 48 hour run on Hyperion on March 12 with no observed issues. A summary of the test is below:
<pre>
Summary
=======
Start Time: Thu Mar 12 05:59:15 PDT 2015
Job Totals
  Passed:      14346
  Failed:      0
  Terminated:  64
  Unknown:      0
  Total:        14410
  Failure Rate: 0.00%
Run Times
  Wall Clock Run Time:      2253.22 hrs.
  Node Run Time:            14018.81 node-hrs.
  SWL Node Utilization:      145.18%
  SLURM Node Utilization:    n/a
Excessive Run-time Variation Job Count: 0
Overall Job Coverage: 21.7% (138/636)
Passed Job Coverage: 21.5% (137/636)
End Time: Sat Mar 14 06:16:02 PDT 2015
SWL Run Time: 173807 sec. (48.28) hrs.)
                              Failure Mode Summary
Mode  Count                      Description
====== ======= ==========================================================
129  3431    TBD
                              Failure Mode Breakdown
  Test    Mode  Count                      Description
======== ====== ======= ==========================================================
IO        129  3431    TBD
Report generated on Sat Mar 14 06:50:43 PDT 2015
</pre>
NOTE: SWL runs continuously. This test run was ended after 48 hours. Jobs that where running when the test run was completed are recorded as terminated in this summary.
=== Performance tests ===
This series of tests is designed to verify that the CLIO Simplification project has not negatively affected the performance of the Lustre filesystem. This test was executed on Hyperion. Hyperion runs with 16 threads per single-client tests, and 1600 IOR threads for 100-client tests. The baseline for performance was selected as Lustre 2.6.0. The build with CLIO Patches applied was created from http://review.hpdd.intel.com/13318/ (since merged into http://review.hpdd.intel.com/13714 for landing).
Five consecutive tests were run for each metric. The mean of the five runs was computed. This mean is used to calculate percentage difference against the baseline. The complete result set is recorded in Appendix A. Observed CLIO Simplification performance that is slower than 5% of the baseline is presented in red.
Guidelines for the reader:
* CLIO Simplification patches have been landed into Master over the last 9 months. During this time, 988 patches have landed which may or may not be responsible for changes in performance.
* Variability in performance computing is commonly observed during tests on Hyperion. It is not unknown to see a 10% variation between consecutive runs of the same code. The figure below illustrates variability in performance over 15 recent consecutive tags. NOTE: for two runs of 2.6.90 on different dates (2.6.90.1 and 2.6.90.2 on the figure below) show significant differences in read performance for otherwise identical Lustre versions.
[[File:cliofig1.png]]
Performance of 15 consecutive tags including 2.6.0 and 2.7.0 releases as well as more recent master tags. Read and write bandwidth of a single shared file with 100 clients is recorded. Significant variability can be observed between any two consecutive tags.
=== IOR, 100 Clients, Single Shared File (SSF) ===
{|
|<br />
|2.6.0 vs CLIO
|2.6.53.1 vs CLIO
|-
|Read performance difference
|'''101%'''
|'''97%'''
|-
|Write performance difference
|'''90%'''
|'''100%'''
|}
OBSERVATIONS: Figure 1 shows the baseline of 2.6.0 (far left) has write performance that is above the typical performance for 15 recent tags. This unusually high value for the 2.6.0 baseline, means that more typical observations are apparently slow by comparison. Choosing a more common value for 100 client SSF read and write (i.e. tag 2.6.53.1) provides a better baseline value and shows performance within tolerance.
=== IOR, 100 Clients, File Per Process (FPP) ===
Read performs: 96%
Write perform: 102%
OBSERVATION: Both observations show CLIO Simplification performance within tolerance.
=== IOR, Single Client, File Per Process (FPP) ===
{|
|<br />
|Hyperion
|OpenSFS test 1
|OpenSFS test 2
|-
|Read performance difference
|'''104%'''
|'''104%'''
|<br />
|-
|Write performance difference
|'''92%'''
|'''150%'''
|'''97%'''
|}
OBSERVATIONS: Write performance was below tolerance during our run on Hyperion on this test. This apparent slow-down was not reproducible over three re-runs on the OpenSFS cluster where between '''97%''' and '''150%''' write performance was observed.
=== IOR, Single Client, Single Shared File (SSF) ===
Read performs: 101%
Write perform: 388%
OBSERVATIONS: Observation of the 2.6.0 baseline showed write performance on Hyperion at 340MB/s. The CLIO Simplification client performed at 1400MB/s. Task LU-1669 (out of scope for this contract) is thought to be primarily responsible for the large improvement in write performance between 2.6.0 and the accumulated unrelated CLIO Simplification patches.
=== mdtest, Single Client ===
Best FFP tree filestat: '''115%'''<br />Best SSF tree dircreate: '''103%''' <br />Worst FFP tree treecreate: '''97%'''<br />Worst SSF tree rm: '''97%'''
OBSERVATIONS: All 32 observations show CLIO Simplification performance within tolerance. Only the best and the worst are included here.
=== mdtest, 100 Clients ===
{|
|
|Hyperion run 1
|Hyperion run 1
|Hyperion run 2
|Hyperion run 2
|-
|<br />
|File per process
|Single shared file
|File per process
|Single shared file
|-
|Dir create
|'''108%'''
|'''104%'''
|'''88%'''
|'''98%'''
|-
|Dir stat
|'''96%'''
|'''100%'''
|'''100%'''
|'''100%'''
|-
|Dir rm
|'''110%'''
|'''98%'''
|'''95%'''
|'''96%'''
|-
|File create
|'''181%'''
|'''235%'''
|'''201%'''
|'''168%'''
|-
|File stat
|'''96%'''
|'''99%'''
|'''98%'''
|'''100%'''
|-
|File rm
|'''108%'''
|'''104%'''
|'''104%'''
|'''102%'''
|-
|Tree create
|'''100%'''
|'''86%'''
|'''100%'''
|'''100%'''
|-
|Tree rm
|'''114%'''
|'''104%'''
|'''106%'''
|'''97%'''
|}
OBSERVATIONS: Out of 32 metrics only two metric are observed out of tolerance for mdtest at 100 clients scale. Out of tolerance observations were not repeated between consecutive runs and they are judged to be due to variability occurring at 100 client scale. A significant performance increase is observed on metrics including ‘file create’ and ‘file rm’.
=== Conclusion ===
Functional testing and SWL testing of the CLIO Simplification stack was completed successfully. Performance testing is also complete. Of 49 values presented within the performance testing, four were observed below 5% of the 2.6.0 baseline value. On review, these four observations can be attributed to the challenges of running repeatable tests with low variability at scale. Further tests were run to check whether CLIO Simplification patches introduced these observed regressions, but no evidence was found to support this being the case. We are confident that the CLIO Simplification patches do not introduce regressions and this phase of the project is complete.
=== Appendix ===
A pdf of the report, including detailed results are included in this document.
[[File:CLIO_test_and_fix_complete_report.pdf]]
== Landing Phase ==
Patches for a simplified CLIO stack were landed over a period of approximately six months with the final outstanding CLIO Simplification patch landed on Mon, 29 Jun 2015. The complete list of patches created by Intel and their commitment into the Lustre Master is recorded below.
{|
|Change
|LU ticket
|files changed
|lines added
|lines deleted
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=3f3a24dc5d7d421e1514dc49cc7c2eb5cb762b26 3f3a24d]
|[https://jira.hpdd.intel.com/browse/LU-3259 LU-3259 clio: cl_lock simplification]
|34
|1535
|6170
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=cdfbc722f4d63d3ed3740cbb549062f712010d90 cdfbc72]
|[https://jira.hpdd.intel.com/browse/LU-2675 LU-2675 build: remove liblustre and libsysio]
|149
|1
|39166
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=ca209745ef7d4e33ed0ba8fe8a3fe2ea6ed4eb8a ca20974]
|[https://jira.hpdd.intel.com/browse/LU-2675 LU-2675 mgc: remove libmgc.c]
|3
|1
|171
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=01def2b635ff0b7bacde158d9124334c42cd5d2b 01def2b]
|[https://jira.hpdd.intel.com/browse/LU-2675 LU-2675 build: remove Darwin &quot;support&quot;]
|126
|14
|13084
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=c2c14f31da5f69770d3a46627c81335f5b8d7821 c2c14f3]
|[https://jira.hpdd.intel.com/browse/LU-2675 LU-2675 build: remove WinNT &quot;support&quot;]
|75
|23
|24016
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=72714911b716b9ec8eba294d852164e7a3e4b380 7271491]
|[https://jira.hpdd.intel.com/browse/LU-5823 LU-5823 clio: add coo_getstripe interface]
|9
|173
|116
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=2d686e9c9cc3c3c47cce92a0ff495b04efacd3a9 2d686e9]
|[https://jira.hpdd.intel.com/browse/LU-5823 LU-5823 clio: add cl_object_find_cbdata()]
|10
|132
|162
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=1b209744469c5c4296aa496de114d53d03aaa071 1b20974]
|[https://jira.hpdd.intel.com/browse/LU-5823 LU-5823 clio: use CIT_SETATTR for FSFILT_IOC_SETFLAGS]
|15
|74
|336
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=c16ecc8600c57f5b2338c59649654bb2716780f6 c16ecc8]
|[https://jira.hpdd.intel.com/browse/LU-5823 LU-5823 clio: add cl_object_fiemap()]
|7
|687
|604
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=49b17944e1a61f88bddb5595bb053a555c8c08da 49b1794]
|[https://jira.hpdd.intel.com/browse/LU-5823 LU-5823 clio: add coo_obd_info_get and coo_data_version]
|9
|303
|85
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=3151aa574e2c9bd3343dad11577cba3c55c16dca 3151aa5]
|[https://jira.hpdd.intel.com/browse/LU-5823 LU-5823 clio: remove IOC_LOV_GETINFO]
|10
|8
|255
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=be5ef474be66c7978b427421556b50e5a1b51077 be5ef47]
|[https://jira.hpdd.intel.com/browse/LU-5823 LU-5823 clio: get rid of lov_stripe_md reference]
|2
|5
|15
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=315f6e0237b676a7512a4d2fa5765ad57483676e 315f6e0]
|[https://jira.hpdd.intel.com/browse/LU-5814 LU-5814 obd: remove unused LSM parameters]
|17
|49
|70
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=6acf93339ad3297f2e5c659f2269c05df6198f74 6acf933]
|[https://jira.hpdd.intel.com/browse/LU-5823 LU-5823 llite: Remove access of stripe in ll_setattr_raw]
|8
|62
|70
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=c2c332bd8418bfe117e8d0a2b9a10f6e6e0cd204 c2c332b]
|[https://jira.hpdd.intel.com/browse/LU-5814 LU-5814 lov: remove LL_IOC_RECREATE_{FIDOBJ}]
|4
|2
|175
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=761e93f471aed74cabd9de75310d7b028875a9f0 761e93f]
|[https://jira.hpdd.intel.com/browse/LU-5814 LU-5814 echo: remove userspace LSM handling]
|2
|52
|162
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=a3a6d677f39160a9c7c714b4fa9248ce78a1dd2a a3a6d67]
|[https://jira.hpdd.intel.com/browse/LU-5814 LU-5814 lov: remove unused {getset}_info handlers]
|3
|15
|179
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=eecdcf3004952f5d3a7126e0d6790c0be82f0a56 eecdcf3]
|[https://jira.hpdd.intel.com/browse/LU-5418 LU-5418 echo: replace lov_stripe_md with lov_oinfo]
|7
|102
|298
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=be56983d01670119ed88923cc9b5c336f4552302 be56983]
|[https://jira.hpdd.intel.com/browse/LU-5814 LU-5814 llite: remove ll_objects_destroy()]
|13
|15
|270
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=3740cc5acd83b99f185dc4bf8ea27cf472ab51d2 3740cc5]
|[https://jira.hpdd.intel.com/browse/LU-5814 LU-5814 lov: flatten struct lov_stripe_md]
|2
|11
|25
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=315f6e0237b676a7512a4d2fa5765ad57483676e 315f6e0]
|[https://jira.hpdd.intel.com/browse/LU-5814 LU-5814 obd: remove unused LSM parameters]
|17
|49
|70
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=427e6a469722cf14b2cd80cec991a4154b4aae50 427e6a4]
|[https://jira.hpdd.intel.com/browse/LU-5814 LU-5814 lov: add cl_object_layout_get()]
|12
|375
|323
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=25670bb8c21deb64cfbb277bdeeab6e7ee39aa0e 25670bb]
|[https://jira.hpdd.intel.com/browse/LU-5814 LU-5814 llite: remove lli_has_smd]
|7
|58
|116
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=94fe3dadb5774af0eb0fec6d21aa73a22ac4838c 94fe3da]
|[https://jira.hpdd.intel.com/browse/LU-5814 LU-5814 llite: add cl_object_maxbytes()]
|7
|128
|117
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=800e18fc318096e0e552e9cb1927ad99b61d205e 800e18f]
|[https://jira.hpdd.intel.com/browse/LU-5814 LU-5814 lov: use obd_get_info() to get def/max LOV EA sizes]
|5
|50
|44
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=5ccd7a4a556b1a847eb5bff8b2395522a6f4bca8 5ccd7a4]
|[https://jira.hpdd.intel.com/browse/LU-5814 LU-5814 lov: remove LSM from struct lustre_md]
|12
|178
|216
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=d136d6bda8bb59d5055d2f64bef2abe6fbbfceda d136d6b]
|[https://jira.hpdd.intel.com/browse/LU-5814 LU-5814 lov: move LSM to LOV layer]
|11
|110
|354
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=8f27184b14a192848429e52ac234805c324e1f7a 8f27184]
|[https://jira.hpdd.intel.com/browse/LU-5814 LU-5814 obd: rename obd_unpackmd() to md_unpackmd()]
|7
|32
|78
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=fb4f05246a7e738bf6b759811a32ad8f8743cb6e fb4f052]
|[https://jira.hpdd.intel.com/browse/LU-5971 LU-5971 llite: merge lclient.h into llite/vvp_internal.h]
|15
|469
|516
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=ae37991bcc117adfb5bb31a46aa635d396b24694 ae37991]
|[https://jira.hpdd.intel.com/browse/LU-5971 LU-5971 llite: rename ccc_device to vvp_device]
|5
|107
|133
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=e2d2fbc07bf8f45e19d8f3127c3a7088351126d6 e2d2fbc]
|[https://jira.hpdd.intel.com/browse/LU-5971 LU-5971 llite: rename ccc_object to vvp_object]
|15
|246
|304
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=be4372fddbada6d026f4188a7e88c6a11d0a83d4 be4372f]
|[https://jira.hpdd.intel.com/browse/LU-5971 LU-5971 llite: rename ccc_page to vvp_page]
|12
|223
|300
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=9bf46408b3c2c8b7f939d7000a9e8df38c3fd6ed 9bf4640]
|[https://jira.hpdd.intel.com/browse/LU-5971 LU-5971 llite: rename ccc_lock to vvp_lock]
|6
|57
|77
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=bb6dbca9c2c9bdcd33663d6449b27a671fcaf902 bb6dbca]
|[https://jira.hpdd.intel.com/browse/LU-5971 LU-5971 llite: merge ccc_io and vvp_io]
|11
|223
|297
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=719757dd2526e003911eed0b0830ed70d278cdd4 719757d]
|[https://jira.hpdd.intel.com/browse/LU-5971 LU-5971 llite: remove struct ll_ra_read]
|3
|38
|118
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=2bac2cd8f7bf7f31b92e976d500d89b958ab1788 2bac2cd]
|[https://jira.hpdd.intel.com/browse/LU-5971 LU-5971 llite: use vui prefix for struct vvp_io members]
|8
|216
|213
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=f384d789b21689f72dc1ad03bfbf16b114748ea2 f384d78]
|[https://jira.hpdd.intel.com/browse/LU-5971 LU-5971 llite: move vvp_io functions to vvp_io.c]
|4
|243
|269
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=88c8560727253eb04811cad643a3dcca5a553788 88c8560]
|[https://jira.hpdd.intel.com/browse/LU-5971 LU-5971 llite: rename ccc_req to vvp_req]
|6
|156
|138
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=6eda93c7b5f65324bdc831100a17c0bef1a3c078 6eda93c]
|[https://jira.hpdd.intel.com/browse/LU-5971 LU-5971 llite: reorganize variable and data structures]
|14
|289
|344
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=0596cbd6d1947d3f479892a3b4f92c50b092ab0c 0596cbd]
|[https://jira.hpdd.intel.com/browse/LU-6028 LU-6028 ldlm: move LDLM_GID_ANY to lustre_dlm.h]
|2
|5
|2
|-
|[http://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=193f39a6ce80e51809b00715c3740a2f977ece09 193f39a]
|[https://jira.hpdd.intel.com/browse/LU-6046 LU-6046 clio: update comments after cl_lock simplification]
|2
|19
|124
|-
|
|<br />Total<br />
|686
|6535
|89582
|}
During pursuit of the [http://wiki.opensfs.org/Contract_SFS-DEV-004 project goals], more than 89KLOC have been removed. This figure represents over 10% of the Lustre code base and the successful execution of this project delivers a simplified code base for future enhancement. With successful completion of this project, this milestone was agreed on 2015-07-07.


== Change requests ==
== Change requests ==
Line 320: Line 964:


==== STATUS: Submitted for review 2nd March 2015 ====
==== STATUS: Submitted for review 2nd March 2015 ====
== Lessons Learned ==
Attendees: James, Jinshan, Doug, Richard, John.
"Variability should be communicated from the raw results through to the validation metrics."
"Schedule wasn't too bad: six weeks behind on demo, four weeks behind on Landing."
"Good headlines for the project: 89KLOC removed."
"Clean-up project was not overwhelming. One-or two regressions. Exposed problems in the test suite. Really need to figure out why they only become visible after time. go back and analyze why this issues aren't immediately."
"Thought it went pretty well."
"Spent time re-basing and avoiding collisions. Topic branches would be useful for something like this in the future."
"CLIO work was separated well from the DNE2 and LFSCK3/4 work areas. This avoided costly collisions during landing"
"Test matrix needed to be made available to assist communication, record configuration, and communicate data."
"Using the OpenSFS wiki as the canonical project tool worked well. It allowed our work to be visible and avoided copying work between wikis."

Latest revision as of 14:43, 19 November 2015


Overview

The Lustre client implementation for the IO path (called CLIO) is responsible for issuing RPC commands for reading and writing data to the OSTs. CLIO was reconstructed in Lustre 2.0 for cross-platform portability. The CLIO implementation is too complex for the current usage, thus making the code hard to understand and maintain. This project implements the tasks described during work for OpenSFS contract 003 (see below). This work includes:

cl_lock re-factoring

cl_lock is highly complex and difficult to maintain. As a result, enhancements to the client code are time consuming and a significant number of bugs have been traced to cl_lock portions of the code.

ioctl calls implementation

ioctl calls are inconsistently implemented in CLIO. By re-organizing these calls, the removal of the old OBD API becomes possible.

removal of obsolete OBD API call-backs

Remove unused code that misleads and confuses developers who are unfamiliar with Lustre client code.

removal of non-linux interfaces

Remove unused code that misleads and confuses developers who are unfamiliar with Lustre client code.

removal of strip md access beyond LOV layer

Remove code that does not observe public interfaces as it misleads and confuses developers who are unfamiliar with Lustre client code.

For the contract statement of work, see SFS-DEV-004_SOW.pdf. The goal of the CLIO Simplification Implementation contract is the implementation in the Lustre source code of the CLIO Simplification Design that resulted from Project 2 of Contract SFS-DEV-003.

Contract Completion

The contract reached completion in July of 2015.

Key People

OpenSFS

  • Sarp Oral - OpenSFS Contract Administrator
  • Christopher Morrone - OpenSFS Technical Representative

Project Approval Committee (PAC)

  • Christopher Morrone - PAC Chair
  • Colin Faber
  • Patrick Farrell
  • Jason Hill
  • James Simmons
  • Cory Spitz

Intel

  • Richard Henwood - Project Manager
  • Andreas Dilger - Consulting Architect
  • Jinshan Xiong - Lead Engineer

Important Dates

The official start date of work is agreed to be October 13, 2014.

The contract lists milestone target dates in weeks relative to the start date. With the start date agreed, here we can just list actual dates to keep things easy to understand.

Milestone task Target Completion Actual Completion
Implementation Jan 26th 2015 Mar 13th 2015
Test and fix Apr 6th 2015 Jun 16th 2015
Demonstration May 4th 2015 Omitted (see change request 004-002)
Landing Jun 1st 2015 Jul 7th 2015

Meeting Minutes

Work Completed

cl_lock re-factoring (simplified and cache-less) DONE

LU-3259 cl_lock re-factoring The cl_lock is necessary because it communicates the DLM lock for a specific IO. The current implementation is highly complex. This work will write a simplified cl_lock. The new lock will be cache-less and replace the current implementation.

Planning Review Landed
10858 LU-3259 clio: cl_lock simplification

Removal of liblustre DONE

LU-2675 removal of liblustre

Planning Review Landed
10657 LU-2675 build: remove liblustre and libsysio
11772 LU-2675 mgc: remove libmgc.c
Completed as part of Removal of Dead Code project 10172 LU-2675 llite: remove liblustre includes
Completed as part of Removal of Dead Code project 10195 LU-2675 lmv: remove liblustre includes
Completed as part of Removal of Dead Code project 10196 LU-2675 lov: remove liblustre includes
11423 LU-2675 build: remove Darwin "support"
11385 LU-2675 build: remove WinNT "support"

function calls implementation and cleanup obsolete OBD methods DONE

LU-5823 Replace some obsolete obd operations with CLIO ioctl interface OBD API operations for read, write, setattr, getattr, etc. became obsolete after MDT, OFD and client reconstructing were completed. This work removes these redundant operations. OBD API operations for read, write, setattr, getattr, etc because obsolete after MDT, OFD and client restructuring were completed. Redundant code remains in CLIO and interfaces that are not referenced by any module will be targeted for removal.

Planning Review Landed
12452 LU-5823 clio: add coo_getstripe interface
12494 LU-5823 clio: add cl_object_find_cbdata()
13422 LU-5823 clio: use CIT_SETATTR for FSFILT_IOC_SETFLAGS
12535 LU-5823 clio: add cl_object_fiemap()
12638 LU-5823 clio: add coo_obd_info_get and coo_data_version
12748 LU-5823 clio: remove IOC_LOV_GETINFO
12639 LU-5823 clio: get rid of lov_stripe_md reference
13426 LU-5814 obd: remove unused LSM parameters
13514 LU-5823 llite: Remove access of stripe in ll_setattr_raw
Function Status
o_precreate GONE
o_create Used by Echoclient
o_create_async GONE
o_destroy Used by Echoclient
o_setattr Used by Echoclient
o_setattr_async GONE
o_getattr Used by Echoclient
o_getattr_async GONE
o_brw GONE
o_merge_lvb GONE
o_adjust_kms GONE
o_punch GONE
o_sync GONE
o_migrate GONE
o_copy GONE
o_preprw Used by Echoclient and OST server-side code
o_commitrw Used by Echoclient and OST server-side code
o_enqueue GONE
o_cancel GONE
o_change_cbdata GONE
o_find_cbdata GONE
o_change_cbdata GONE
o_extent_calc GONE

Remove lov_stripe_md (LSM) direct access beyond LOV layer DONE

LU-5814 encapsulate lov_stripe_md (LSM) to LOV layer The current CLIO implementation has a good interface to file layout operations. Legacy code still exists that does not use this interface. The code that does not use the file layout interface will be reviewed and targeted for removal or re-design to use the file layout interface.

Planning Review Landed
12442 LU-5814 lov: remove LL_IOC_RECREATE_{FID,OBJ}
12446 LU-5814 echo: remove userspace LSM handling
12445 LU-5814 lov: remove unused {get,set}_info handlers
12447 LU-5418 echo: replace lov_stripe_md with lov_oinfo
12618 LU-5814 llite: remove ll_objects_destroy()
12581 LU-5814 lov: flatten struct lov_stripe_md]
13426 LU-5814 obd: remove unused LSM parameters
13680 LU-5814 lov: add cl_object_layout_get()
13690 LU-5814 llite: replace lli_has_smd with lli_layout_type
13694 LU-5814 llite: add cl_object_maxbytes()
13695 LU-5814 lov: use obd_get_info() to get def/max LOV EA sizes
13722 LU-5814 lov: remove LSM from struct lustre_md
13696 LU-5814 lov: move LSM to LOV layer
13737 LU-5814 obd: rename obd_unpackmd() to md_unpackmd()

NOTE: struct obd_info:oi_md cannot be removed now because of inter-dependencies between clean-up patches.

Remove non-linux interfaces DONE

Two parts

Remove some cfs_ prefixed functions. DONE

Planning Review Landed
NON-INTEL: 6956 LU-1346 libcfs: cleanup libcfs primitive (linux-prim.h)
NON-INTEL: 11797 LU-3963 libcfs: remove last of cfs list wrappers
NON-INTEL: 13070 LU-3963 libcfs: Use kernel's strncasecmp and remove cfs_get_blocked_sigs

NOTE:

  • cfs_snprintf() does have uses, not equivilent to snprintf(), will remain.
  • cfs_hlist* are needed for Linux kernel compatibility and will remain.

Remove ccc_ layer DONE

LU-5971 removal of ccc_ layerWith the removal of liblustre, the ccc_ layer is redundant and complex. The remaining useful functions will be merged into vfs vm posix layer and the ccc_ layer will be removed.

Planning Review Landed
12592 LU-5971 llite: merge lclient.h into llite/vvp_internal.h
13075 LU-5971 llite: rename ccc_device to vvp_device
13077 LU-5971 llite: rename ccc_object to vvp_object
13086 LU-5971 llite: rename ccc_page to vvp_page
13088 LU-5971 llite: rename ccc_lock to vvp_lock
13351 LU-5971 llite: merge ccc_io and vvp_io
13347 LU-5971 llite: remove struct ll_ra_read
13363 LU-5971 llite: use vui prefix for struct vvp_io members
13376 LU-5971 llite: move vvp_io functions to vvp_io.c
13377 LU-5971 llite: rename ccc_req to vvp_req
13714 LU-5971 llite: rename struct ccc_grouplock to ll_grouplock

Regressions

Planning Review Landed
Move definition of LDLM_GID_ANY to lustre_dlm.h
LU-6046 audit comments in cl_object.h

Test and Fix Phase

For this phase, we will complete the following:

  1. Contractor demonstrates the code passing the complement of tests in Contractor's Autotest environment with the code applied to the Lustre Master tree.
  1. Contract demonstrates the code runs successfully at scale (typically completing a 48 hour SWL run on the Hyperion platform at Lawrence Livermore National Laboratory).
  1. Contractor executes performance regression testing identifying and addressing performance regressions related to the development of the revised code. This performance testing will be run on a system with at least 100 clients and will compare results of IOR, mdtest on builds before and after the implementation of the CLIO Simplification HLD. Degradation of more than 5% will be taken as a failure, but small drops will be accepted as within normal variation.

Introduction

The following milestone completion document applies to CLIO Simplification Project recorded in the OpenSFS Lustre Development contract SFS-DEV-004 agreed September 25, 2014. The CLIO Simplification code is functionally complete and recorded in the Implementation Milestone. Completion of this milestone requires the following tasks to be executed:

  1. Contractor demonstrates the code passing the complement of tests in Contractor's Autotest environment with the code applied to the Lustre Master tree.
  2. Contractor demonstrates the code runs successfully at scale (typically completing a 48 hour SWL run on the Hyperion platform at the Lawrence Livermore National Laboratory).
  3. Contractor executes performance regression testing, identifying and addressing performance regressions related to the development of the revised code. This performance testing will be run on a system with at least 100 clients and will compare results of IOR, mdtest on builds before and after the implementation of the CLIO Simplification High Level Design. Degradation of more than 5% will be taken as a failure, but small drops will be accepted as within normal variation.

NOTE: This task list includes agreed enhancements in item 3. They are: lnet-selftest has been omitted as redundant. Mdtest has been selected as a better alternative to mdsrate. Overview of CLIO Simplification. CLIO Simplification work was completed with six high-level tasks. These are:

  • cl_lock re-factoring (simplified and cache-less).
  • Liblustre removal.
  • Implement function calls and cleanup obsolete OBD methods.
  • Remove lov_stripe_md (LSM) direct access beyond LOV layer.
  • Remove cfs_ prefixed functions, where appropriate.
  • Remove ccc_ layer.

This work has been completed in the following patches:

Change # Work ticket
11013 LU-3259 clio: get rid of cl_req
10858 LU-3259 clio: cl_lock simplification
10657 LU-2675 build: remove liblustre and libsysio
11772 LU-2675 mgc: remove libmgc.c
11423 LU-2675 build: remove Darwin “support”
11385 LU-2675 build: remove WinNT “support”
13514 LU-5823 llite: Remove access of stripe in ll_setattr_raw
12452 LU-5823 clio: add coo_getstripe interface
12494 LU-5823 clio: add cl_object_find_cbdata()
13422 LU-5823 clio: use CIT_SETATTR for FSFILT_IOC_SETFLAGS
12535 LU-5823 clio: add cl_object_fiemap()
12638 LU-5823 clio: add coo_obd_info_get and coo_data_version
12748 LU-5823 clio: remove IOC_LOV_GETINFO
12639 LU-5823 clio: get rid of lov_stripe_md reference
13426 LU-5814 obd: remove unused LSM parameters
13722 LU-5814 lov: remove LSM from struct lustre_md
12442 LU-5814 lov: remove LL_IOC_RECREATE_{FID,OBJ}
13696 LU-5814 lov: move LSM to LOV layer
12446 LU-5814 echo: remove userspace LSM handling
13737 LU-5814 obd: rename obd_unpackmd() to md_unpackmd()
12445 LU-5814 lov: remove unused {get,set}_info handlers
12447 LU-5418 echo: replace lov_stripe_md with lov_oinfo
12618 LU-5814 llite: remove ll_objects_destroy()
12581 LU-5814 lov: flatten struct lov_stripe_md]
13426 LU-5814 obd: remove unused LSM parameters
13680 LU-5814 lov: add cl_object_layout_get()
13690 LU-5814 llite: replace lli_has_smd with lli_layout_type
13694 LU-5814 llite: add cl_object_maxbytes()
13695 LU-5814 lov: use obd_get_info() to get def/max LOV EA sizes
12592 LU-5971 llite: merge lclient.h into llite/vvp_internal.h
13075 LU-5971 llite: rename ccc_device to vvp_device
13077 LU-5971 llite: rename ccc_object to vvp_object
13086 LU-5971 llite: rename ccc_page to vvp_page
13088 LU-5971 llite: rename ccc_lock to vvp_lock
13351 LU-5971 llite: merge ccc_io and vvp_io
13347 LU-5971 llite: remove struct ll_ra_read
13363 LU-5971 llite: use vui prefix for struct vvp_io members
13376 LU-5971 llite: move vvp_io functions to vvp_io.c
13377 LU-5971 llite: rename ccc_req to vvp_req
13714 LU-5971 llite: rename struct ccc_grouplock to ll_grouplock
13074 LU-6028 Move definition of LDLM_GID_ANY to lustre_dlm.h
13137 LU-6046 audit comments in cl_object.h

Autotest results

The complete series of patches are recorded at http://review.whamcloud.com/13737/ and below. This series (patch set 3) successfully passed Autotest on March 27th. This result is recorded here:

NOTE: since completing these tests, many unrelated patches have landed on master that have obligated a re-base of this patch series.

48hr SWL run on Hyperion

SWL completed a 48 hour run on Hyperion on March 12 with no observed issues. A summary of the test is below:

Summary
=======

Start Time: Thu Mar 12 05:59:15 PDT 2015
Job Totals
  Passed:       14346
  Failed:       0
  Terminated:   64
  Unknown:      0
  Total:        14410
  Failure Rate: 0.00%
Run Times
  Wall Clock Run Time:       2253.22 hrs.
  Node Run Time:             14018.81 node-hrs.
  SWL Node Utilization:      145.18%
  SLURM Node Utilization:    n/a
Excessive Run-time Variation Job Count: 0
Overall Job Coverage: 21.7% (138/636)
Passed Job Coverage: 21.5% (137/636)
End Time: Sat Mar 14 06:16:02 PDT 2015
SWL Run Time: 173807 sec. (48.28) hrs.)

                              Failure Mode Summary
 Mode   Count                       Description
====== ======= ==========================================================
 129   3431    TBD

                               Failure Mode Breakdown
  Test    Mode   Count                       Description
======== ====== ======= ==========================================================
IO        129   3431    TBD

Report generated on Sat Mar 14 06:50:43 PDT 2015

NOTE: SWL runs continuously. This test run was ended after 48 hours. Jobs that where running when the test run was completed are recorded as terminated in this summary.

Performance tests

This series of tests is designed to verify that the CLIO Simplification project has not negatively affected the performance of the Lustre filesystem. This test was executed on Hyperion. Hyperion runs with 16 threads per single-client tests, and 1600 IOR threads for 100-client tests. The baseline for performance was selected as Lustre 2.6.0. The build with CLIO Patches applied was created from http://review.hpdd.intel.com/13318/ (since merged into http://review.hpdd.intel.com/13714 for landing). Five consecutive tests were run for each metric. The mean of the five runs was computed. This mean is used to calculate percentage difference against the baseline. The complete result set is recorded in Appendix A. Observed CLIO Simplification performance that is slower than 5% of the baseline is presented in red. Guidelines for the reader:

  • CLIO Simplification patches have been landed into Master over the last 9 months. During this time, 988 patches have landed which may or may not be responsible for changes in performance.
  • Variability in performance computing is commonly observed during tests on Hyperion. It is not unknown to see a 10% variation between consecutive runs of the same code. The figure below illustrates variability in performance over 15 recent consecutive tags. NOTE: for two runs of 2.6.90 on different dates (2.6.90.1 and 2.6.90.2 on the figure below) show significant differences in read performance for otherwise identical Lustre versions.

Cliofig1.png

Performance of 15 consecutive tags including 2.6.0 and 2.7.0 releases as well as more recent master tags. Read and write bandwidth of a single shared file with 100 clients is recorded. Significant variability can be observed between any two consecutive tags.

IOR, 100 Clients, Single Shared File (SSF)


2.6.0 vs CLIO 2.6.53.1 vs CLIO
Read performance difference 101% 97%
Write performance difference 90% 100%

OBSERVATIONS: Figure 1 shows the baseline of 2.6.0 (far left) has write performance that is above the typical performance for 15 recent tags. This unusually high value for the 2.6.0 baseline, means that more typical observations are apparently slow by comparison. Choosing a more common value for 100 client SSF read and write (i.e. tag 2.6.53.1) provides a better baseline value and shows performance within tolerance.

IOR, 100 Clients, File Per Process (FPP)

Read performs: 96% Write perform: 102% OBSERVATION: Both observations show CLIO Simplification performance within tolerance.

IOR, Single Client, File Per Process (FPP)


Hyperion OpenSFS test 1 OpenSFS test 2
Read performance difference 104% 104%
Write performance difference 92% 150% 97%

OBSERVATIONS: Write performance was below tolerance during our run on Hyperion on this test. This apparent slow-down was not reproducible over three re-runs on the OpenSFS cluster where between 97% and 150% write performance was observed.

IOR, Single Client, Single Shared File (SSF)

Read performs: 101% Write perform: 388% OBSERVATIONS: Observation of the 2.6.0 baseline showed write performance on Hyperion at 340MB/s. The CLIO Simplification client performed at 1400MB/s. Task LU-1669 (out of scope for this contract) is thought to be primarily responsible for the large improvement in write performance between 2.6.0 and the accumulated unrelated CLIO Simplification patches.

mdtest, Single Client

Best FFP tree filestat: 115%
Best SSF tree dircreate: 103%
Worst FFP tree treecreate: 97%
Worst SSF tree rm: 97%

OBSERVATIONS: All 32 observations show CLIO Simplification performance within tolerance. Only the best and the worst are included here.

mdtest, 100 Clients

Hyperion run 1 Hyperion run 1 Hyperion run 2 Hyperion run 2

File per process Single shared file File per process Single shared file
Dir create 108% 104% 88% 98%
Dir stat 96% 100% 100% 100%
Dir rm 110% 98% 95% 96%
File create 181% 235% 201% 168%
File stat 96% 99% 98% 100%
File rm 108% 104% 104% 102%
Tree create 100% 86% 100% 100%
Tree rm 114% 104% 106% 97%

OBSERVATIONS: Out of 32 metrics only two metric are observed out of tolerance for mdtest at 100 clients scale. Out of tolerance observations were not repeated between consecutive runs and they are judged to be due to variability occurring at 100 client scale. A significant performance increase is observed on metrics including ‘file create’ and ‘file rm’.

Conclusion

Functional testing and SWL testing of the CLIO Simplification stack was completed successfully. Performance testing is also complete. Of 49 values presented within the performance testing, four were observed below 5% of the 2.6.0 baseline value. On review, these four observations can be attributed to the challenges of running repeatable tests with low variability at scale. Further tests were run to check whether CLIO Simplification patches introduced these observed regressions, but no evidence was found to support this being the case. We are confident that the CLIO Simplification patches do not introduce regressions and this phase of the project is complete.

Appendix

A pdf of the report, including detailed results are included in this document. File:CLIO test and fix complete report.pdf

Landing Phase

Patches for a simplified CLIO stack were landed over a period of approximately six months with the final outstanding CLIO Simplification patch landed on Mon, 29 Jun 2015. The complete list of patches created by Intel and their commitment into the Lustre Master is recorded below.

Change LU ticket files changed lines added lines deleted
3f3a24d LU-3259 clio: cl_lock simplification 34 1535 6170
cdfbc72 LU-2675 build: remove liblustre and libsysio 149 1 39166
ca20974 LU-2675 mgc: remove libmgc.c 3 1 171
01def2b LU-2675 build: remove Darwin "support" 126 14 13084
c2c14f3 LU-2675 build: remove WinNT "support" 75 23 24016
7271491 LU-5823 clio: add coo_getstripe interface 9 173 116
2d686e9 LU-5823 clio: add cl_object_find_cbdata() 10 132 162
1b20974 LU-5823 clio: use CIT_SETATTR for FSFILT_IOC_SETFLAGS 15 74 336
c16ecc8 LU-5823 clio: add cl_object_fiemap() 7 687 604
49b1794 LU-5823 clio: add coo_obd_info_get and coo_data_version 9 303 85
3151aa5 LU-5823 clio: remove IOC_LOV_GETINFO 10 8 255
be5ef47 LU-5823 clio: get rid of lov_stripe_md reference 2 5 15
315f6e0 LU-5814 obd: remove unused LSM parameters 17 49 70
6acf933 LU-5823 llite: Remove access of stripe in ll_setattr_raw 8 62 70
c2c332b LU-5814 lov: remove LL_IOC_RECREATE_{FIDOBJ} 4 2 175
761e93f LU-5814 echo: remove userspace LSM handling 2 52 162
a3a6d67 LU-5814 lov: remove unused {getset}_info handlers 3 15 179
eecdcf3 LU-5418 echo: replace lov_stripe_md with lov_oinfo 7 102 298
be56983 LU-5814 llite: remove ll_objects_destroy() 13 15 270
3740cc5 LU-5814 lov: flatten struct lov_stripe_md 2 11 25
315f6e0 LU-5814 obd: remove unused LSM parameters 17 49 70
427e6a4 LU-5814 lov: add cl_object_layout_get() 12 375 323
25670bb LU-5814 llite: remove lli_has_smd 7 58 116
94fe3da LU-5814 llite: add cl_object_maxbytes() 7 128 117
800e18f LU-5814 lov: use obd_get_info() to get def/max LOV EA sizes 5 50 44
5ccd7a4 LU-5814 lov: remove LSM from struct lustre_md 12 178 216
d136d6b LU-5814 lov: move LSM to LOV layer 11 110 354
8f27184 LU-5814 obd: rename obd_unpackmd() to md_unpackmd() 7 32 78
fb4f052 LU-5971 llite: merge lclient.h into llite/vvp_internal.h 15 469 516
ae37991 LU-5971 llite: rename ccc_device to vvp_device 5 107 133
e2d2fbc LU-5971 llite: rename ccc_object to vvp_object 15 246 304
be4372f LU-5971 llite: rename ccc_page to vvp_page 12 223 300
9bf4640 LU-5971 llite: rename ccc_lock to vvp_lock 6 57 77
bb6dbca LU-5971 llite: merge ccc_io and vvp_io 11 223 297
719757d LU-5971 llite: remove struct ll_ra_read 3 38 118
2bac2cd LU-5971 llite: use vui prefix for struct vvp_io members 8 216 213
f384d78 LU-5971 llite: move vvp_io functions to vvp_io.c 4 243 269
88c8560 LU-5971 llite: rename ccc_req to vvp_req 6 156 138
6eda93c LU-5971 llite: reorganize variable and data structures 14 289 344
0596cbd LU-6028 ldlm: move LDLM_GID_ANY to lustre_dlm.h 2 5 2
193f39a LU-6046 clio: update comments after cl_lock simplification 2 19 124

Total
686 6535 89582

During pursuit of the project goals, more than 89KLOC have been removed. This figure represents over 10% of the Lustre code base and the successful execution of this project delivers a simplified code base for future enhancement. With successful completion of this project, this milestone was agreed on 2015-07-07.


Change requests

004-001 CLIO ioctl's should be functions

CHANGE REQUEST: 004-001 CLIO ioctl's should be functions.

BACKGROUND: Ioctl calls were included in the CLIO Simplification design to replace some obsolete ODB operations. Alternatively, individual functions can replace the ODB operations instead.

CHANGE: Do not implement ioctl calls. Implement functions.

ACTIONS REQUIRED:

  • Ensure none of the current patches are land.
  • Update the design document with the new design.
  • Update the ticket LU-5823 with new activity.
  • Execute work to complete LU-5823.


STATUS: Approved on 4th Dec 2014

004-002 Omit CLIO Demonstration milestone and associated milestone payment

CHANGE REQUEST: 004-002 Omit CLIO Demonstration milestone and associated milestone payment

BACKGROUND: The Demonstration milestone has been rendered redundant by a precisely specified 'Test and Fix' milestone that will execute before Demonstration. In the current plan, 'Test and Fix' (10 weeks) includes specific tests to run (including 48hr SWL, and performance characterization). There was agreement that a useful Demonstration Milestone is exactly defined by 'Test and Fix' Milestone. The plan requires specification of Demonstration during the Implementation phase and no additional work beyond 'Test and Fix' has been identified.

CHANGE: Omit Demonstration Milestone from the plan of record.

ACTIONS REQUIRED:

  • Move directly from 'Test and Fix' milestone to 'Landing' milestone.
  • Communicate new plan to stakeholders.

STATUS: Submitted for review 2nd March 2015

Lessons Learned

Attendees: James, Jinshan, Doug, Richard, John.

"Variability should be communicated from the raw results through to the validation metrics."

"Schedule wasn't too bad: six weeks behind on demo, four weeks behind on Landing."

"Good headlines for the project: 89KLOC removed."

"Clean-up project was not overwhelming. One-or two regressions. Exposed problems in the test suite. Really need to figure out why they only become visible after time. go back and analyze why this issues aren't immediately."

"Thought it went pretty well."

"Spent time re-basing and avoiding collisions. Topic branches would be useful for something like this in the future."

"CLIO work was separated well from the DNE2 and LFSCK3/4 work areas. This avoided costly collisions during landing"

"Test matrix needed to be made available to assist communication, record configuration, and communicate data."

"Using the OpenSFS wiki as the canonical project tool worked well. It allowed our work to be visible and avoided copying work between wikis."