Contract SFS-DEV-004: Difference between revisions

From OpenSFS Wiki
Jump to navigation Jump to search
Line 285: Line 285:
# Contractor executes performance regression testing identifying and addressing performance regressions related to the development of the revised code. This performance testing will be run on a system with at least 100 clients and will compare results of IOR, mdtest on builds before and after the implementation of the CLIO Simplification HLD. Degradation of more than 5% will be taken as a failure, but small drops will be accepted as within normal variation.
# Contractor executes performance regression testing identifying and addressing performance regressions related to the development of the revised code. This performance testing will be run on a system with at least 100 clients and will compare results of IOR, mdtest on builds before and after the implementation of the CLIO Simplification HLD. Degradation of more than 5% will be taken as a failure, but small drops will be accepted as within normal variation.


== Test and Fix Milestone ==
=== Introduction ===
=== Introduction ===
The following milestone completion document applies to CLIO Simplification Project recorded in the OpenSFS Lustre Development contract SFS-DEV-004 agreed September 25, 2014.
The following milestone completion document applies to CLIO Simplification Project recorded in the OpenSFS Lustre Development contract SFS-DEV-004 agreed September 25, 2014.
Line 499: Line 497:


Performance of 15 consecutive tags including 2.6.0 and 2.7.0 releases as well as more recent master tags. Read and write bandwidth of a single shared file with 100 clients is recorded. Significant variability can be observed between any two consecutive tags.
Performance of 15 consecutive tags including 2.6.0 and 2.7.0 releases as well as more recent master tags. Read and write bandwidth of a single shared file with 100 clients is recorded. Significant variability can be observed between any two consecutive tags.


== IOR, 100 Clients, Single Shared File (SSF) ==
== IOR, 100 Clients, Single Shared File (SSF) ==

Revision as of 07:36, 16 June 2015


Overview

The goal of the CLIO Simplification Implementation contract is the implementation in the Lustre source code of the CLIO Simplification Design that resulted from Project 2 of Contract SFS-DEV-003.

For the contract statement of work, see SFS-DEV-004_SOW.pdf

Key People

OpenSFS

  • Sarp Oral - OpenSFS Contract Administrator
  • Christopher Morrone - OpenSFS Technical Representative

Project Approval Committee (PAC)

  • Christopher Morrone - PAC Chair
  • Colin Faber
  • Patrick Farrell
  • Jason Hill
  • James Simmons
  • Cory Spitz

Intel

  • Richard Henwood - Project Manager
  • Andreas Dilger - Consulting Architect
  • Jinshan Xiong - Lead Engineer

Important Dates

The official start date of work is agreed to be October 13, 2014.

The contract lists milestone target dates in weeks relative to the start date. With the start date agreed, here we can just list actual dates to keep things easy to understand.

Milestone task Target Completion Actual Completion
Implementation Jan 26th 2015
Test and fix Apr 6th 2015
Demonstration May 4th 2015
Landing Jun 1st 2015

Meeting Minutes

Working in Progress or Completed

cl_lock re-factoring (simplified and cache-less) DONE

LU-3259 cl_lock re-factoring The cl_lock is necessary because it communicates the DLM lock for a specific IO. The current implementation is highly complex. This work will write a simplified cl_lock. The new lock will be cache-less and replace the current implementation.

Planning Review Landed
10858 LU-3259 clio: cl_lock simplification

Removal of liblustre DONE

LU-2675 removal of liblustre

Planning Review Landed
10657 LU-2675 build: remove liblustre and libsysio
11772 LU-2675 mgc: remove libmgc.c
Completed as part of Removal of Dead Code project 10172 LU-2675 llite: remove liblustre includes
Completed as part of Removal of Dead Code project 10195 LU-2675 lmv: remove liblustre includes
Completed as part of Removal of Dead Code project 10196 LU-2675 lov: remove liblustre includes
11423 LU-2675 build: remove Darwin "support"
11385 LU-2675 build: remove WinNT "support"

function calls implementation and cleanup obsolete OBD methods STARTED

LU-5823 Replace some obsolete obd operations with CLIO ioctl interface OBD API operations for read, write, setattr, getattr, etc. became obsolete after MDT, OFD and client reconstructing were completed. This work removes these redundant operations. OBD API operations for read, write, setattr, getattr, etc because obsolete after MDT, OFD and client restructuring were completed. Redundant code remains in CLIO and interfaces that are not referenced by any module will be targeted for removal.

Planning Review Landed
13514 LU-5823 llite: Remove access of stripe in ll_setattr_raw 12452 LU-5823 clio: add coo_getstripe interface
12494 LU-5823 clio: add cl_object_find_cbdata()
13422 LU-5823 clio: use CIT_SETATTR for FSFILT_IOC_SETFLAGS
12535 LU-5823 clio: add cl_object_fiemap()
12638 LU-5823 clio: add coo_obd_info_get and coo_data_version
12748 LU-5823 clio: remove IOC_LOV_GETINFO
12639 LU-5823 clio: get rid of lov_stripe_md reference
13426 LU-5814 obd: remove unused LSM parameters
Function Status
o_precreate GONE
o_create Used by Echoclient
o_create_async GONE
o_destroy Used by Echoclient
o_setattr Used by Echoclient
o_setattr_async GONE
o_getattr Used by Echoclient
o_getattr_async GONE
o_brw GONE
o_merge_lvb GONE
o_adjust_kms GONE
o_punch GONE
o_sync GONE
o_migrate GONE
o_copy GONE
o_preprw Used by Echoclient and OST server-side code
o_commitrw Used by Echoclient and OST server-side code
o_enqueue GONE
o_cancel GONE
o_change_cbdata GONE
o_find_cbdata GONE
o_change_cbdata GONE
o_extent_calc GONE

Remove lov_stripe_md (LSM) direct access beyond LOV layer STARTED

LU-5814 encapsulate lov_stripe_md (LSM) to LOV layer The current CLIO implementation has a good interface to file layout operations. Legacy code still exists that does not use this interface. The code that does not use the file layout interface will be reviewed and targeted for removal or re-design to use the file layout interface.

Planning Review Landed
13737 LU-5814 obd: rename obd_unpackmd() to md_unpackmd() 12442 LU-5814 lov: remove LL_IOC_RECREATE_{FID,OBJ}
12446 LU-5814 echo: remove userspace LSM handling
12445 LU-5814 lov: remove unused {get,set}_info handlers
12447 LU-5418 echo: replace lov_stripe_md with lov_oinfo
12618 LU-5814 llite: remove ll_objects_destroy()
12581 LU-5814 lov: flatten struct lov_stripe_md]
13426 LU-5814 obd: remove unused LSM parameters
13680 LU-5814 lov: add cl_object_layout_get()
13690 LU-5814 llite: replace lli_has_smd with lli_layout_type
13694 LU-5814 llite: add cl_object_maxbytes()
13695 LU-5814 lov: use obd_get_info() to get def/max LOV EA sizes
13722 LU-5814 lov: remove LSM from struct lustre_md
13696 LU-5814 lov: move LSM to LOV layer

NOTE: struct obd_info:oi_md cannot be removed now because of inter-dependencies between clean-up patches.

Remove non-linux interfaces STARTED

Two parts

Remove some cfs_ prefixed functions. DONE

Planning Review Landed
NON-INTEL: 6956 LU-1346 libcfs: cleanup libcfs primitive (linux-prim.h)
NON-INTEL: 11797 LU-3963 libcfs: remove last of cfs list wrappers
NON-INTEL: 13070 LU-3963 libcfs: Use kernel's strncasecmp and remove cfs_get_blocked_sigs

NOTE:

  • cfs_snprintf() does have uses, not equivilent to snprintf(), will remain.
  • cfs_hlist* are needed for Linux kernel compatibility and will remain.

Remove ccc_ layer STARTED

LU-5971 removal of ccc_ layerWith the removal of liblustre, the ccc_ layer is redundant and complex. The remaining useful functions will be merged into vfs vm posix layer and the ccc_ layer will be removed.

Planning Review Landed
12592 LU-5971 llite: merge lclient.h into llite/vvp_internal.h
13075 LU-5971 llite: rename ccc_device to vvp_device
13077 LU-5971 llite: rename ccc_object to vvp_object
13086 LU-5971 llite: rename ccc_page to vvp_page
13088 LU-5971 llite: rename ccc_lock to vvp_lock
13351 LU-5971 llite: merge ccc_io and vvp_io
13347 LU-5971 llite: remove struct ll_ra_read
13363 LU-5971 llite: use vui prefix for struct vvp_io members
13376 LU-5971 llite: move vvp_io functions to vvp_io.c
13377 LU-5971 llite: rename ccc_req to vvp_req
13714 LU-5971 llite: rename struct ccc_grouplock to ll_grouplock

Regressions

Planning Review Landed
Move definition of LDLM_GID_ANY to lustre_dlm.h
LU-6046 audit comments in cl_object.h

Test and Fix Phase

For this phase, we will complete the following:

  1. Contractor demonstrates the code passing the complement of tests in Contractor's Autotest environment with the code applied to the Lustre Master tree.
  1. Contract demonstrates the code runs successfully at scale (typically completing a 48 hour SWL run on the Hyperion platform at Lawrence Livermore National Laboratory).
  1. Contractor executes performance regression testing identifying and addressing performance regressions related to the development of the revised code. This performance testing will be run on a system with at least 100 clients and will compare results of IOR, mdtest on builds before and after the implementation of the CLIO Simplification HLD. Degradation of more than 5% will be taken as a failure, but small drops will be accepted as within normal variation.

Introduction

The following milestone completion document applies to CLIO Simplification Project recorded in the OpenSFS Lustre Development contract SFS-DEV-004 agreed September 25, 2014. The CLIO Simplification code is functionally complete and recorded in the Implementation Milestone. Completion of this milestone requires the following tasks to be executed:

  1. Contractor demonstrates the code passing the complement of tests in Contractor's Autotest environment with the code applied to the Lustre Master tree.
  2. Contractor demonstrates the code runs successfully at scale (typically completing a 48 hour SWL run on the Hyperion platform at the Lawrence Livermore National Laboratory).
  3. Contractor executes performance regression testing, identifying and addressing performance regressions related to the development of the revised code. This performance testing will be run on a system with at least 100 clients and will compare results of IOR, mdtest on builds before and after the implementation of the CLIO Simplification High Level Design. Degradation of more than 5% will be taken as a failure, but small drops will be accepted as within normal variation.

NOTE: This task list includes agreed enhancements in item 3. They are: lnet-selftest has been omitted as redundant. Mdtest has been selected as a better alternative to mdsrate. Overview of CLIO Simplification. CLIO Simplification work was completed with six high-level tasks. These are:

  • cl_lock re-factoring (simplified and cache-less).
  • Liblustre removal.
  • Implement function calls and cleanup obsolete OBD methods.
  • Remove lov_stripe_md (LSM) direct access beyond LOV layer.
  • Remove cfs_ prefixed functions, where appropriate.
  • Remove ccc_ layer.

This work has been completed in the following patches:

Change # Work ticket
11013 LU-3259 clio: get rid of cl_req
10858 LU-3259 clio: cl_lock simplification
10657 LU-2675 build: remove liblustre and libsysio
11772 LU-2675 mgc: remove libmgc.c
11423 LU-2675 build: remove Darwin “support”
11385 LU-2675 build: remove WinNT “support”
13514 LU-5823 llite: Remove access of stripe in ll_setattr_raw
12452 LU-5823 clio: add coo_getstripe interface
12494 LU-5823 clio: add cl_object_find_cbdata()
13422 LU-5823 clio: use CIT_SETATTR for FSFILT_IOC_SETFLAGS
12535 LU-5823 clio: add cl_object_fiemap()
12638 LU-5823 clio: add coo_obd_info_get and coo_data_version
12748 LU-5823 clio: remove IOC_LOV_GETINFO
12639 LU-5823 clio: get rid of lov_stripe_md reference
13426 LU-5814 obd: remove unused LSM parameters
13722 LU-5814 lov: remove LSM from struct lustre_md
12442 LU-5814 lov: remove LL_IOC_RECREATE_{FID,OBJ}
13696 LU-5814 lov: move LSM to LOV layer
12446 LU-5814 echo: remove userspace LSM handling
13737 LU-5814 obd: rename obd_unpackmd() to md_unpackmd()
12445 LU-5814 lov: remove unused {get,set}_info handlers
12447 LU-5418 echo: replace lov_stripe_md with lov_oinfo
12618 LU-5814 llite: remove ll_objects_destroy()
12581 LU-5814 lov: flatten struct lov_stripe_md]
13426 LU-5814 obd: remove unused LSM parameters
13680 LU-5814 lov: add cl_object_layout_get()
13690 LU-5814 llite: replace lli_has_smd with lli_layout_type
13694 LU-5814 llite: add cl_object_maxbytes()
13695 LU-5814 lov: use obd_get_info() to get def/max LOV EA sizes
12592 LU-5971 llite: merge lclient.h into llite/vvp_internal.h
13075 LU-5971 llite: rename ccc_device to vvp_device
13077 LU-5971 llite: rename ccc_object to vvp_object
13086 LU-5971 llite: rename ccc_page to vvp_page
13088 LU-5971 llite: rename ccc_lock to vvp_lock
13351 LU-5971 llite: merge ccc_io and vvp_io
13347 LU-5971 llite: remove struct ll_ra_read
13363 LU-5971 llite: use vui prefix for struct vvp_io members
13376 LU-5971 llite: move vvp_io functions to vvp_io.c
13377 LU-5971 llite: rename ccc_req to vvp_req
13714 LU-5971 llite: rename struct ccc_grouplock to ll_grouplock
13074 LU-6028 Move definition of LDLM_GID_ANY to lustre_dlm.h
13137 LU-6046 audit comments in cl_object.h

Autotest results

The complete series of patches are recorded at http://review.whamcloud.com/13737/ and below. This series (patch set 3) successfully passed Autotest on March 27th. This result is recorded here:

NOTE: since completing these tests, many unrelated patches have landed on master that have obligated a re-base of this patch series.

48hr SWL run on Hyperion

SWL completed a 48 hour run on Hyperion on March 12 with no observed issues. A summary of the test is below:

Summary
=======

Start Time: Thu Mar 12 05:59:15 PDT 2015
Job Totals
  Passed:       14346
  Failed:       0
  Terminated:   64
  Unknown:      0
  Total:        14410
  Failure Rate: 0.00%
Run Times
  Wall Clock Run Time:       2253.22 hrs.
  Node Run Time:             14018.81 node-hrs.
  SWL Node Utilization:      145.18%
  SLURM Node Utilization:    n/a
Excessive Run-time Variation Job Count: 0
Overall Job Coverage: 21.7% (138/636)
Passed Job Coverage: 21.5% (137/636)
End Time: Sat Mar 14 06:16:02 PDT 2015
SWL Run Time: 173807 sec. (48.28) hrs.)

                              Failure Mode Summary
 Mode   Count                       Description
====== ======= ==========================================================
 129   3431    TBD

                               Failure Mode Breakdown
  Test    Mode   Count                       Description
======== ====== ======= ==========================================================
IO        129   3431    TBD

Report generated on Sat Mar 14 06:50:43 PDT 2015

NOTE: SWL runs continuously. This test run was ended after 48 hours. Jobs that where running when the test run was completed are recorded as terminated in this summary.

Performance tests

This series of tests is designed to verify that the CLIO Simplification project has not negatively affected the performance of the Lustre filesystem. This test was executed on Hyperion. Hyperion runs with 16 threads per single-client tests, and 1600 IOR threads for 100-client tests. The baseline for performance was selected as Lustre 2.6.0. The build with CLIO Patches applied was created from http://review.hpdd.intel.com/13318/ (since merged into http://review.hpdd.intel.com/13714 for landing). Five consecutive tests were run for each metric. The mean of the five runs was computed. This mean is used to calculate percentage difference against the baseline. The complete result set is recorded in Appendix A. Observed CLIO Simplification performance that is slower than 5% of the baseline is presented in red. Guidelines for the reader:

  • CLIO Simplification patches have been landed into Master over the last 9 months. During this time, 988 patches have landed which may or may not be responsible for changes in performance.
  • Variability in performance computing is commonly observed during tests on Hyperion. It is not unknown to see a 10% variation between consecutive runs of the same code. The figure below illustrates variability in performance over 15 recent consecutive tags. NOTE: for two runs of 2.6.90 on different dates (2.6.90.1 and 2.6.90.2 on the figure below) show significant differences in read performance for otherwise identical Lustre versions.

Cliofig1.png

Performance of 15 consecutive tags including 2.6.0 and 2.7.0 releases as well as more recent master tags. Read and write bandwidth of a single shared file with 100 clients is recorded. Significant variability can be observed between any two consecutive tags.

IOR, 100 Clients, Single Shared File (SSF)


2.6.0 vs CLIO 2.6.53.1 vs CLIO
Read performance difference 101% 97%
Write performance difference 90% 100%

OBSERVATIONS: Figure 1 shows the baseline of 2.6.0 (far left) has write performance that is above the typical performance for 15 recent tags. This unusually high value for the 2.6.0 baseline, means that more typical observations are apparently slow by comparison. Choosing a more common value for 100 client SSF read and write (i.e. tag 2.6.53.1) provides a better baseline value and shows performance within tolerance.

IOR, 100 Clients, File Per Process (FPP)

Read performs: 96% Write perform: 102% OBSERVATION: Both observations show CLIO Simplification performance within tolerance.

IOR, Single Client, File Per Process (FPP)


Hyperion OpenSFS test 1 OpenSFS test 2
Read performance difference 104% 104%
Write performance difference 92% 150% 97%

OBSERVATIONS: Write performance was below tolerance during our run on Hyperion on this test. This apparent slow-down was not reproducible over three re-runs on the OpenSFS cluster where between 97% and 150% write performance was observed.

IOR, Single Client, Single Shared File (SSF)

Read performs: 101% Write perform: 388% OBSERVATIONS: Observation of the 2.6.0 baseline showed write performance on Hyperion at 340MB/s. The CLIO Simplification client performed at 1400MB/s. Task LU-1669 (out of scope for this contract) is thought to be primarily responsible for the large improvement in write performance between 2.6.0 and the accumulated unrelated CLIO Simplification patches.

mdtest, Single Client

Best FFP tree filestat: 115%
Best SSF tree dircreate: 103%
Worst FFP tree treecreate: 97%
Worst SSF tree rm: 97%

OBSERVATIONS: All 32 observations show CLIO Simplification performance within tolerance. Only the best and the worst are included here.

mdtest, 100 Clients

Hyperion run 1 Hyperion run 1 Hyperion run 2 Hyperion run 2

File per process Single shared file File per process Single shared file
Dir create 108% 104% 88% 98%
Dir stat 96% 100% 100% 100%
Dir rm 110% 98% 95% 96%
File create 181% 235% 201% 168%
File stat 96% 99% 98% 100%
File rm 108% 104% 104% 102%
Tree create 100% 86% 100% 100%
Tree rm 114% 104% 106% 97%

Change requests

004-001 CLIO ioctl's should be functions

CHANGE REQUEST: 004-001 CLIO ioctl's should be functions.

BACKGROUND: Ioctl calls were included in the CLIO Simplification design to replace some obsolete ODB operations. Alternatively, individual functions can replace the ODB operations instead.

CHANGE: Do not implement ioctl calls. Implement functions.

ACTIONS REQUIRED:

  • Ensure none of the current patches are land.
  • Update the design document with the new design.
  • Update the ticket LU-5823 with new activity.
  • Execute work to complete LU-5823.


STATUS: Approved on 4th Dec 2014

004-002 Omit CLIO Demonstration milestone and associated milestone payment

CHANGE REQUEST: 004-002 Omit CLIO Demonstration milestone and associated milestone payment

BACKGROUND: The Demonstration milestone has been rendered redundant by a precisely specified 'Test and Fix' milestone that will execute before Demonstration. In the current plan, 'Test and Fix' (10 weeks) includes specific tests to run (including 48hr SWL, and performance characterization). There was agreement that a useful Demonstration Milestone is exactly defined by 'Test and Fix' Milestone. The plan requires specification of Demonstration during the Implementation phase and no additional work beyond 'Test and Fix' has been identified.

CHANGE: Omit Demonstration Milestone from the plan of record.

ACTIONS REQUIRED:

  • Move directly from 'Test and Fix' milestone to 'Landing' milestone.
  • Communicate new plan to stakeholders.

STATUS: Submitted for review 2nd March 2015

Lessons Learned

Variability should be communicated from the raw results through to the validation metrics.