http://wiki.opensfs.org/index.php?title=MDS_SMP_Node_Affinity_ScopeStatement_wiki_version&feed=atom&action=historyMDS SMP Node Affinity ScopeStatement wiki version - Revision history2024-03-29T08:49:42ZRevision history for this page on the wikiMediaWiki 1.39.3http://wiki.opensfs.org/index.php?title=MDS_SMP_Node_Affinity_ScopeStatement_wiki_version&diff=1764&oldid=prevRichardHenwood: Created page with " == Introduction == The following scope statement applies to the SMP Node Affinity Scope Statement project within the SFS-DEV-001 contract/SOW dates 08/01/2..."2015-04-22T17:22:21Z<p>Created page with " == Introduction == The following scope statement applies to the SMP Node Affinity Scope Statement project within the SFS-DEV-001 contract/SOW dates 08/01/2..."</p>
<p><b>New page</b></p><div> <br />
== Introduction ==<br />
<br />
The following scope statement applies to the SMP Node Affinity Scope Statement project within the SFS-DEV-001 contract/SOW dates 08/01/2011.<br />
<br />
== Problem Statement ==<br />
<br />
Followers of CPU design are observing steady increases in core count. As core count increases thread resource contention (particularly locking) threatens to throttle Lustre server performance.<br />
<br />
A multi-core CPUs are now common. Core count is expected to continue climbing and as it does a Lustre server will need to increase thread count to exploit the CPU capacity. On Lustre today additional threads will contend a single wait-queue, share one request queue, and share only a couple of global locks. Hence, without additional work, Lustre servers are expected to perform badly on multi-core CPUs for reasons including:<br />
<br />
* Overhead of lock contention.<br />
* Overhead of process switching between cores.<br />
* Scheduler attempting to balance threads across cores.<br />
<br />
These factors can be addressed by splitting the computing cores into configurable Compute Partitions. Lustre RPC service threads will then be bound to a specific Compute Partition. This design:<br />
<br />
* Reduces the overhead of threads switching cores by keeping the thread running on the same core as the cache memory.<br />
* Avoids needless contention on the inter-CPU memory subsystem.<br />
* Keeps RPC request processing local to the resources that they affect.<br />
* Allows the protocol stack to scale effectively as the number of cores increases.<br />
<br />
== Project Goals ==<br />
<br />
SMP node affinity is concerned with improving vertical scalability of a Lustre server by addressing software insufficiency on multi-core machines. This will enable Lustre to fully exploit increasingly powerful server hardware as it becomes available.<br />
<br />
SMP node affinity will implement the following features:<br />
<br />
* General libcfs APIs to provide a framework to support Compute Partitions<br />
* Fine-grained SMP locking for Lustre<br />
* SMP node affinity threading mode for Lustre<br />
<br />
Benchmarks for metadata performance will be made on a cluster with a reasonable number of clients and a large (12+) number of cores.<br />
<br />
The new features will demonstrate that performance of a single metadata server with a large (12+) cores is improved for file operations: create/remove/stat.<br />
<br />
<br />
Code will land on WC-Lustre master branch.<br />
<br />
<br />
== In-Scope ==<br />
<br />
* High level design document for SMP node affinity of Lustre.<br />
* Improve small message rate of LNET on multiple (12+) core server.<br />
* Improve small RPC rate of ptlrpc on multiple (12+) core server.<br />
* Improve general metadata performance on a multiple (12+) core server for narrow strip-count file.<br />
* OST stack performance will not exhibit regression.<br />
* SMP node affinity code will take place against WC-Lustre 2.x baseline. Code will cover:<br />
** General SMP improvements for Lustre and LNET.<br />
** libcfs APIs to support CPU partition and NUMA allocators.<br />
** SMP node affinity threading mode for Lustre and LNET.<br />
*** LND threads<br />
*** ptlrpc service threads<br />
*** ptlrpc reply handling threads<br />
* Update manual to include SMP node affinity tuning parameters.<br />
<br />
== Out of Scope ==<br />
<br />
* Wide strip count files involve additional factors that will not be helpful for demonstrating performance improvements.<br />
<br />
== Project Constraints ==<br />
<br />
* Liang is the only engineer with the expertise to lead this work.<br />
* Test cluster with suitable multiple (12+) core nodes available to WC engineers regardless of nationality.<br />
<br />
== Key Deliverables ==<br />
<br />
* Signed-off Milestone document for the project phases:<br />
** Solution Architecture<br />
** High-level Design<br />
* Test plan<br />
* Source code against WC-Lustre 2.x that implements the feature requirements and runs on the test cluster.<br />
* Code landed on Master WC-Lustre 2.x as Issue [[LU-56]].<br />
<br />
=== Glossary ===<br />
<br />
* CPU partition<br> CPU partition is subset of processing resource of system, a CPU partition could be any number of CPU cores in system: it can be a single core, or any specified number of cores, or stand for all cores in a NUMA node, or represent all CPUs of a system. The number of CPU partitions can be set by libcfs APIs.<br> A fat server will be divided into several compute partitions, each compute partition contains: cores in CPU-partition, memory pool, message queue, threads pool, it's a little like virtual machine, although it's much simpler since the the Lustre RPC processing threads will still have access to all Lustre global state if needed.</div>RichardHenwood