DNE RemoteDirectories Final wiki version

From OpenSFS
Jump to: navigation, search

Executive Summary

This document finalizes the activities undertaken during the DNE, Sub Project 2.1: Remote Directories within the OpenSFS Lustre Development contract SFS-DEV-001 signed 7/30/2011. Notable milestones during the project include:

  • Linear scaling demonstrated.
  • Scale testing completed on Hyperion.
  • All assets from the project have been attached to the public ticket LU-50.

Statement of Work

DNE1: Remote Directories distributes the Lustre namespace over multiple metadata targets (MDTs) under administrative control using a Lustre-specific mkdir command. Whereas normal users are only able to create child directories and files on the same MDT as the parent directory, administrators can use this command to create a directory on a different MDT. The contents of any directory remain limited to a single MDT. Rename and hardlink operations between files and directories on different MDTs return EXDEV, forcing applications and utilities to treat them as if they are on different file systems. This limits the complexity of the implementation of this subproject while delivering capacity and performance scaling benefits for the entire namespace in aggregate. Metadata update operations that span multiple MDTs are sequenced and synchronized to create and/or increment the link count on a MDT object before it is referenced by the remote directory entry and to update the remote directory entry before decrementing the link count and/or destroying the MDT object it referenced. Although this may result in an orphan MDT object under some failure conditions, it ensures that the Lustre namespace remains intact under any and all failure scenarios. All the other metadata operations avoid synchronous I/O and execute with full performance. This project includes the implementation of OST FIDs (File Identifiers). These are required to overcome a limitation in the current 2.x Lustre protocol that would otherwise prevent a single file system from having more than 8 MDTs. Addressing this technical debt in the first subproject of DNE avoids protocol compatibility issues that would arise if this feature were implemented after Remote Directories were used in production.

Summary of Scope

In Scope

  • Interoperability of 1.8/2.1 clients with with WC-Lustre 2.x FID enabled OSTs.
  • Multiple MDTs running on the same MDS node.
  • Failover/failback of MDT to active backup MDS node.
  • Administrative documentation in the form of man page for new user tools and update to the Lustre 2.x manual.

Out of Scope

  • Accessibility of remote DNE directories with 1.8 or 2.1 clients.
  • Interoperability of DNE-enabled MDTs and non-FID-based 1.8/2.1 OSTs.
  • Rename and hard-link operations will not work across directories (returning -EXDEV).

The complete scope statement is available at [1]

Summary of Solution Architecture

DNE1: Remote Directories introduces a useful minimum of distributed metadata functionality. The purpose primarily to ensure that efforts concentrate on clean code restructuring for DNE. The phase focuses on extensive testing to shake out bugs and oversights not only in the implementation but also in administrative procedures. DNE brings new usage patterns that must necessarily adapt to manage multiple metadata servers. The Lustre namespace will be distributed by allowing directory entries to reference sub-directories on different metadata targets (MDTs). Individual directories will remain bound to a single MDT, therefore metadata throughput on single directories will stay limited by single MDS performance, but metadata throughput aggregated over distributed directories will scale. The creation of non-local subdirectories will initially be restricted to administrators with a Lustre-specific mkdir command. This will ensure that administrators retain control over namespace distribution to guarantee performance isolation.

Acceptance Criteria

DNE1: Remote Directories will be deemed complete when:

  1. All unit/integration tests will pass.
  2. All performance scaling tests pass.
  3. The filesystem is still available when any MDT other than MDT0 is down or disabled.

The complete Solution Architecture is available at: [2]

Summary of High Level Design

The DNE1: Remote Directories stack is build with the following layers:

  • Metadata Target (MDT): unpacks requests and handles ldlm locks.
  • Object Update Target (OUT): In the same layer as the MDT. Handles updates between MDTs. An update is a low-level modification to a storage object that applies directly to OSDs. For example, increasing link count. An operation is a complete and consistent modification to the file system that applies to MDD layer. for example, rename or unlink.
  • Metadata Device (MDD): decomposes the received request(operation) into object updates.
  • Logical Object Device (LOD): mapping layer manages the communication between targets: MDT to OST, and MDT to MDT (OUT).
  • Object Storage Device (OSD): local object storage layer to handle local object updates.
  • Object Synchronous Proxy (OSP): is a synchronous proxy for the remote OSD that is used to communicate with an other MDT/OST.

The complete High Level Design is available at: [3]

Summary of Demonstration

One unexpected result is visible: with a stripe count higher than 0, only a small performance increase is observed in Opencreate. After additional work, the cause of this is judged to be a performance issue in path of MDD->LOV->OSC->OST. Further investigation on this topic will be conducted during the SMP Node Affinity project. The complete Demonstration Milestone is available at:

[4]

Delivery

Complete code is available at: [5] Commit at which code completed Milestone review by Senior and Principal Engineer at: http://git.whamcloud.com/?p=fs%2Flustre-release.git;a=commit;h=19223651ed250966c0445c91dc91a5b9131dec35