MDS SMP Node Affinity Demonstration wiki version: Difference between revisions

From OpenSFS Wiki
Jump to navigation Jump to search
(Created page with "== Final Report for the Parallel Directory Operations Subproject on the Single Metadata Server Performance Improvements of the SFS-DEV-001 contract. == Revision History {| |...")
 
No edit summary
 
Line 1: Line 1:
== Final Report for the Parallel Directory Operations Subproject on the Single Metadata Server Performance Improvements of the SFS-DEV-001 contract. ==
= Overview =


Revision History
This document describes the work required to demonstrate that the SMP Node Affinity code meets the agreed acceptability criteria. The SMP Node Affinity code is functionally complete. The purpose of this final demonstration phase is to show that the code provides an enhancement to Lustre when used in a production-like environment.


{|
The purpose of demonstration is to show appropriate functionality of the sub-project. This shall be done through execution of test cases designed to prove the Acceptance Criteria defined during the Solution Architecture.
|'''Date'''
 
|'''Revision'''
== Acceptance Criteria ==
|'''Author'''
 
|-
SMP Node Affinity will be accepted as properly functioning if:
|04/11/12
 
|Original
* Improving metadata performance (creation/removal/stat) with these conditions:
|R. Henwood
** Reasonable number of clients (16+ clients)
|}
** Fat MDS (8+ cores)
** Narrow stripecount file only (0, 1, 2, 4 stripecount)
* No performance regression for single client metadata performance
* No functionality regression


= Executive Summary =
= Baseline =


This document finalizes the activities undertaken during the Single Metadata Server Performance Improvements, Sub Project 1.2: Parallel Directory Operations within the OpenSFS Lustre Development contract SFS-DEV-001 signed 7/30/2011.
The baseline measurement for this Demonstration will be Lustre 2.2 GA. SMP Node Affinity patches are included in Lustre Master the Lustre Master version will be compared against Lustre 2.2 GA for the purposes of demonstrating acceptance.


Notable milestones during the project include:
= Hardware =


* Shared directory mknod maximum throughput increased 76% from 25000 to 44000 within the demonstration environment.
MDS node:
* Parallel Directory Operations was implemented with 2500 lines of code and was completed and landed on Lustre Master for  inclusion into Lustre release 2.2 on March 30th 2012.
* All assets from the project have been attached to the public ticket LU-50.


== Statement of work ==
* 2 x Intel Xeon(R) X5650 2.67GHz Six-core Processor (2-HT each core), which will present as 24 CPUs on Linux
* 24GB DDR3 1333MHz Memory
* 2PT 40Gb/s 4X QSFP InfiniBand adapter card (Mellanox MT26428)
* 1 QDR IB port on motherboard
* SSD as external journal device (INTEL SSDSA2CW120G3), SATA II Enterprise Hard Drive as MDT (single disk, WDC WD2502ABYS-02B7A0)


Parallel Directory Operations allows multiple RPC service threads to operate on a single directory without contending on a single lock protecting the underlying directory in the ldiskfs file system. Single directory performance is one of the most critical use cases for HPC workloads as many applications create a separate output file for each task in a job, requiring hundreds of thousands of files to be created in a single directory within a short window of time. Currently, both filename lookup and file system-modifying operations such as create and unlink are protected with a single lock for the whole directory.
OSS:


Parallel Directory Operations will implement a parallel locking mechanism for single ldiskfs directories, allowing multiple threads to do lookup, create, and unlink operations in parallel. In order to avoid performance bottlenecks for very large directories, as the directory size increases, the number of concurrent locks possible on a single directory will also increase.
* 2 x AMD Opteron 6128 2.0GHz Eight-Core Processor
* 16GB DDR3 1333MHz Memory
* 2PT 40Gb/s 4X QSFP InfiniBand adapter card (Mellanox MT26428)
* 1 QDR IB port on motherboard
* 3 x 1TB SATA II Enterprise Hard Drive (single disk, WDC WD1003FBYX-01Y7B0)


== Summary of scope ==
Client:


=== In scope ===
* Quad-Core Intel E5507 2.26G/4MB/800
* Mellanox ConnectX 6 QDR Infiniband 40Gbps Controller (MTS3600Q-1BNC)
* 12GB DDR3 1333MHz E/R memory


* Parallel Directory Operations code development will take place against WC-Lustre 2.x baseline.
Network:
* Modifications to ldiskfs to provide Parallel Directory Operations.
* Modifications to OSD to exploit Parallel Directory ldiskfs.
* New locking behavior documented in the WC-Lustre 2.x manual.
* Parallel Directory Operations supported on all Linux distributions that WC-Lustre supports.


=== Out of scope ===
* Infiniband between all the nodes: MTS3600Q managed QDR switch with 36 ports.


* Only EXT4 back-end filesystem will be supported.
= Test Methodology =
* As a rule, patches to EXT4 will not be prepared for up-stream inclusion.


The complete scope statement is available at [http://wiki.whamcloud.com/display/opensfs/Parallel+Directory+Operations+Scope+Statement http://wiki.whamcloud.com/display/opensfs/Parallel+Directory+Operations+Scope+Statement]
The following tools will be used to provide a production-like load. Performance of the MDS will be measured using LNet selftest and mdtest.


== Summary of Solution Architecture ==
== 1. LNet selftest ==
Parallel Directory Operations (PDO) is concerned with a single component of the Lustre file system: ldiskfs. Ldiskfs uses a hashed-btree (htree) to organize and locate directory entries, which is protected by a single mutex lock. The single lock protection strategy is simple to understand and implement, but is also a performance bottleneck because directory operations must obtain and hold the lock for their duration.


The Parallel Directory Operations (PDO) project implements a new locking mechanism that ensures it is safe for multiple threads to concurrently search and/or modify directory entries in the htree. PDO means MDS and OSS service threads can process multiple create, lookup, and unlink requests in parallel for the shared directory. Users will see performance improvement for these commonly performed operations.
Note: In-order to generate a suitable load, high "concurrency" is required on the client side. High concurrency in this environment is measured as over one thousand RPCs from 16 clients.


=== Acceptance Criteria ===
== 2. mdtest ==


PDO patch will be accepted as properly functioning if:
<ul>
<li>multiple mounts on each client.
<ul>
<li>each thread works under a private mount.</li>
<li>sufficient workload for &quot;shared directory&quot; is achievable because target directory has separate inode for each mount on client.</li>
<li>{|
|width="100%"|<p>It is possible to generate a high working load by turning off <code>mdc_rpc_lock</code> on all clients. <code>mdc_rpc_lock</code> however only works for directory per thread. For the shared directory case operations will be serialized by VFS mutex on client side. For this reason multiple mounts are chosen as they perform well in both cases.</p>
|}
</li></ul>
</li>
<li>Verify server working load by checking total number of threads on MDS.
<ul>
<li>Service threads are created on demand. If there are sufficient active RPCs on server side the service threads number should reach the upper-limit.</li>
<li>If active RPCs are more than upper-limit of service threads number, they will not generate extra workload to filesystem or ptlrpc service threads. Each thread can handle one RPC at a time so the additional RPCs are just queued on the service.</li></ul>
</li></ul>


# Improved parallel create / remove performance under large shared directory for narrow stripe files (0 – 4) on multi-core server (8+ cores)
= Test File System Configuration =
# No performance regression for other metadata performance tests
# No functionality regression


The complete Solution Architecture is available at: [http://wiki.whamcloud.com/display/opensfs/Parallel+Directory+Solution+Architecture http://wiki.whamcloud.com/display/opensfs/Parallel+Directory+Solution+Architecture]
* 16 Clients, each with 32+ threads.
* 1 MDS, 2+ OSS (6 OSTs on each OSS).
* Test repeated three times. Median and max are recorded.
* Test completed with both Lustre 2.2 and Lustre-Master.


== Summary of High Level Design ==
= Test Results =


The design of the new htree lock requires thoughtful consideration. The design describes the new lock implementation and API and covers scenarios including:
LNet selftest results:


* read directory for indexed directory.
* aggregation selftest ping rate on server
* lookup within indexed directory.
** Client [1, 2, 4, 8, 16] X Thread [32, 64]
* unlink within indexed directory.
* single client selftest ping rate
* create within indexed directory.
** Client [1] X Thread [1, 2, 4, 8, 16, 32, 64]
* aggregation bandwidth of &quot;selftest 4K BRW read &amp; write&quot; on server
** Read: Client [1, 2, 4, 8, 16] X [32, 64] threads
** Write: Client [1, 2, 4, 8, 16] X [32, 64] threads


The complete High Level Design is available at: [http://wiki.whamcloud.com/display/opensfs/Parallel+Directory+High+Level+Design http://wiki.whamcloud.com/display/opensfs/Parallel+Directory+High+Level+Design]
mdtest results:


== Summary of Demonstration ==
* all following tests should run for mknod, 1-stripe, 2-stripe, 4-stripe.
* total file number.
** 256K files.
** 1 million files.
* 32 mounts for each client, directory per thread.
** Client [1, 2, 4, 8, 16] X 32 threads create/stat/unlink.
** Directory per thread.
** Shared directory for all threads.


[[File:SMP_Demo_fig1.png]]
= Test Duration =


One unexpected result is visible: with a stripe count higher than 0, only a small performance increase is observed in Opencreate. After additional work, the cause of this is judged to be a performance issue in path of MDD-&gt;LOV-&gt;OSC-&gt;OST. Further investigation on this topic will be conducted during the SMP Node Affinity project.
LNet selftest


The complete Demonstration Milestone is available at: [http://wiki.whamcloud.com/display/opensfs/Demonstration+Milestone+for+Parallel+Directory+Operations http://wiki.whamcloud.com/display/opensfs/Demonstration+Milestone+for+Parallel+Directory+Operations]
* 2 hours to setup environment, prepare scripts
* 2 releases, 2 hour for each release
* TOTAL: 6 hours


== Delivery ==
mdtest


Complete code is available at:
* 4 hours to setup environment, prepare scripts
* 2 releases (v2_2, master)
* 2 different total files (256K, 1 million)
* 4 stripe setting (0, 1, 2, 4)
* 2 (dir-per-thread, shared dir) X 6 (1, 2, 4, 8, 12, 16 clients) = 12
* 3 (repeat)
* assume each round take 4 minutes
* TOTAL: 4 * 60 + 2 * 2 * 4 * 12 * 3 * 4 = 2544minutes = 42.4 hours


[[#change,375|http://review.whamcloud.com/#change,375]]
Estimated total time


Commit at which code completed Milestone review by Senior and Principal Engineer at:
* 6 + 42.4 = 48.4 hours


[http://git.whamcloud.com/?p=fs%2Flustre-release.git;a=commit;h=19223651ed250966c0445c91dc91a5b9131dec35 http://git.whamcloud.com/?p=fs%2Flustre-release.git;a=commit;h=19223651ed250966c0445c91dc91a5b9131dec35]
With these estimates, a minimum of 4 working days is required. One additional day should be included as a risk reserve. A further 1-2 days to summarize data.

Latest revision as of 12:52, 22 April 2015

Overview

This document describes the work required to demonstrate that the SMP Node Affinity code meets the agreed acceptability criteria. The SMP Node Affinity code is functionally complete. The purpose of this final demonstration phase is to show that the code provides an enhancement to Lustre when used in a production-like environment.

The purpose of demonstration is to show appropriate functionality of the sub-project. This shall be done through execution of test cases designed to prove the Acceptance Criteria defined during the Solution Architecture.

Acceptance Criteria

SMP Node Affinity will be accepted as properly functioning if:

  • Improving metadata performance (creation/removal/stat) with these conditions:
    • Reasonable number of clients (16+ clients)
    • Fat MDS (8+ cores)
    • Narrow stripecount file only (0, 1, 2, 4 stripecount)
  • No performance regression for single client metadata performance
  • No functionality regression

Baseline

The baseline measurement for this Demonstration will be Lustre 2.2 GA. SMP Node Affinity patches are included in Lustre Master the Lustre Master version will be compared against Lustre 2.2 GA for the purposes of demonstrating acceptance.

Hardware

MDS node:

  • 2 x Intel Xeon(R) X5650 2.67GHz Six-core Processor (2-HT each core), which will present as 24 CPUs on Linux
  • 24GB DDR3 1333MHz Memory
  • 2PT 40Gb/s 4X QSFP InfiniBand adapter card (Mellanox MT26428)
  • 1 QDR IB port on motherboard
  • SSD as external journal device (INTEL SSDSA2CW120G3), SATA II Enterprise Hard Drive as MDT (single disk, WDC WD2502ABYS-02B7A0)

OSS:

  • 2 x AMD Opteron 6128 2.0GHz Eight-Core Processor
  • 16GB DDR3 1333MHz Memory
  • 2PT 40Gb/s 4X QSFP InfiniBand adapter card (Mellanox MT26428)
  • 1 QDR IB port on motherboard
  • 3 x 1TB SATA II Enterprise Hard Drive (single disk, WDC WD1003FBYX-01Y7B0)

Client:

  • Quad-Core Intel E5507 2.26G/4MB/800
  • Mellanox ConnectX 6 QDR Infiniband 40Gbps Controller (MTS3600Q-1BNC)
  • 12GB DDR3 1333MHz E/R memory

Network:

  • Infiniband between all the nodes: MTS3600Q managed QDR switch with 36 ports.

Test Methodology

The following tools will be used to provide a production-like load. Performance of the MDS will be measured using LNet selftest and mdtest.

1. LNet selftest

Note: In-order to generate a suitable load, high "concurrency" is required on the client side. High concurrency in this environment is measured as over one thousand RPCs from 16 clients.

2. mdtest

  • multiple mounts on each client.
    • each thread works under a private mount.
    • sufficient workload for "shared directory" is achievable because target directory has separate inode for each mount on client.
    • {| |width="100%"|

      It is possible to generate a high working load by turning off mdc_rpc_lock on all clients. mdc_rpc_lock however only works for directory per thread. For the shared directory case operations will be serialized by VFS mutex on client side. For this reason multiple mounts are chosen as they perform well in both cases.

      |}

  • Verify server working load by checking total number of threads on MDS.
    • Service threads are created on demand. If there are sufficient active RPCs on server side the service threads number should reach the upper-limit.
    • If active RPCs are more than upper-limit of service threads number, they will not generate extra workload to filesystem or ptlrpc service threads. Each thread can handle one RPC at a time so the additional RPCs are just queued on the service.

Test File System Configuration

  • 16 Clients, each with 32+ threads.
  • 1 MDS, 2+ OSS (6 OSTs on each OSS).
  • Test repeated three times. Median and max are recorded.
  • Test completed with both Lustre 2.2 and Lustre-Master.

Test Results

LNet selftest results:

  • aggregation selftest ping rate on server
    • Client [1, 2, 4, 8, 16] X Thread [32, 64]
  • single client selftest ping rate
    • Client [1] X Thread [1, 2, 4, 8, 16, 32, 64]
  • aggregation bandwidth of "selftest 4K BRW read & write" on server
    • Read: Client [1, 2, 4, 8, 16] X [32, 64] threads
    • Write: Client [1, 2, 4, 8, 16] X [32, 64] threads

mdtest results:

  • all following tests should run for mknod, 1-stripe, 2-stripe, 4-stripe.
  • total file number.
    • 256K files.
    • 1 million files.
  • 32 mounts for each client, directory per thread.
    • Client [1, 2, 4, 8, 16] X 32 threads create/stat/unlink.
    • Directory per thread.
    • Shared directory for all threads.

Test Duration

LNet selftest

  • 2 hours to setup environment, prepare scripts
  • 2 releases, 2 hour for each release
  • TOTAL: 6 hours

mdtest

  • 4 hours to setup environment, prepare scripts
  • 2 releases (v2_2, master)
  • 2 different total files (256K, 1 million)
  • 4 stripe setting (0, 1, 2, 4)
  • 2 (dir-per-thread, shared dir) X 6 (1, 2, 4, 8, 12, 16 clients) = 12
  • 3 (repeat)
  • assume each round take 4 minutes
  • TOTAL: 4 * 60 + 2 * 2 * 4 * 12 * 3 * 4 = 2544minutes = 42.4 hours

Estimated total time

  • 6 + 42.4 = 48.4 hours

With these estimates, a minimum of 4 working days is required. One additional day should be included as a risk reserve. A further 1-2 days to summarize data.