Introduction

With the release of DNE Phase I Remote Directories, Lustre* file systems now support more than one MDT in a single filesystem. This feature has some limitations:

Due to the synchronous nature of DNE Phase I remote operations, by default they are configured so only an administrator can create or unlink a remote directory. Create and unlink are the only 'cross-MDT' operations to be allowed in Phase I. All other cross-MDT operations (link, rename) will return EXDEV.
Cross MDT operations must be synchronous and metadata performance may be impacted especially when DNE is based on ZFS.
Moving files or directories between MDTs can only be achieved using file copy and unlink commands. As a result, data objects on OST will be moved resulting in redundant data transfer operations.
All of name entries of a directory must be on one MDT, so the single directory performance is the same as single MDT filesystem.

DNE Phase II will resolve these issues. This document is divided into two sections. The first section 'Asynchronous Cross-MDT Operation Design' is concerned with resolution of the first three issues above. A separate section 'Striped Directories Design' describes a design to resolve the remaining issue. [AZ: sounds more like Phase III now? (smile) ]

Asynchronous Cross-MDT Operation Design

From the first three limitations enumerated in the introduction, the most important one is how to implement asynchronous cross-MDT updates and its recovery. An assumption is made that the file system may become inconsistent after the recovery in some rare cases, and both servers and clients should be able to detect such inconsistency and return proper error code to the user. In the mean time, LFSCK will be able to detect such inconsistency and attempt to resolve them. This design document assumes knowledge of the DNE phase II Solution Architecture, DNE Phase 1 Solution Architecture and DNE Phase 1 High Level Design.

Definitions

Operation and Update

Operation means one complete metadata operation i.e open/create a file, mkdir or rename that leaves the namespace in a consistent state. A request from a client usually includes only one metadata operation. The MDT will decompose the operation into several updates, for example mkdir will be decomposed into name entry insertion, parent link count increment, and object create.

Master and slave (remote) MDT

In DNE, client typically sends the metadata request to one MDT, called master MDT for this request. The master MDT then decomposes the operation into updates, and redistributes these updates to other MDTs, which are calledslave or remote MDT for this request.

Functional Statements

In DNE Phase II:

All metadata operations are allowed to be cross-MDT, which will be asynchronous if they are not dependent on other uncommitted updates. If the cross-MDT operation is dependent on an uncommitted update from a different client, it will use the existing Commit-on-Share (COS) implementation to ensure that the prerequisite operation is committed first.
Normal users (without administrator privilege) can perform cross-MDT operations. [AZ: probably we still want some limitations here, imagine a regular user doing a rename from one directory into another one, causing lots of syncs and as a result very slow processing on MDT?][AED: except users can already do sync() repeatedly today and slow down the filesystem, so I don't see that this is worse.]
Migration tool will be provided to move individual inodes from one MDT to another MDT, without introducing redundant data object transfers on OSTs.

Implementation

In DNE Phase I, the master MDT collects the updates in the transaction declare phase, and then sends these updates to other MDTs during transaction start. For local transactions the declare phase is only for reserving resources, like journal credits, etc. To unify the transaction phase for local and remote operation, DNE Phase II will collect remote updates in the execution phase, i.e. between transaction start and stop, then distribute updates at transaction stop. The process is:

The client sends the request to the master MDT.
The master MDT enqueues all LDLM locks and get back the object attributes, then cache those attributes on the master MDT, and do sanity checks on the master MDT.
The master MDT creates and starts the transaction, and decomposes the operation into updates in MDD layer, which might include both local and remote updates, and records all of these updates in the prepared object_update_request buffer.
The master MDT executes local updates during the transaction execution phase.
In transaction stop, the master MDT will first generate an additional update to the master's last_rcvd file containing the lsd_client_data (master transno, client XID, object pre_versions) for that client's operation, and adds this to the update request buffer. If recovery of the master MDT is later needed, it is able to update the last_rcvd slot for the client by replaying the update log from a slave MDT, even if the client is not available.
The master MDT distributes updates and the update buffer prepared in step 3 to all of remote MDTs.
Slave MDT(s) will execute their local updates asynchronously, and also write all of the updates into their local log, then reply with their local transno to the master MDT.
After the master MDT gets all replies from slave MDT(s), it releases the LDLM locks and replies to the client with the master transno generated in step 5, and client will add the request into its replay list.
After the operation and its update log are committed to disk on the master MDT, it will piggyback the last_committed_transno to the client in RPC replies or pings, and client will remove the request from the replay list, which is the same as normal replay request.
When the updates are committed on the slave MDT(s), they will notify the master MDT using normal last_committed_transno in RPC replies or pings.
After the master MDT sees all of the remote updates are committed on the slave MDT(s), it will cancel its local update log record first.
When the local update record cancellation has committed, the master MDT sends requests to remote MDT(s) to cancel their corresponding update request log records (identified by ou_master_index + ou_batchid). The remote MDT(s) will use the ou_master_index + ou_batchid to cancel their update records belonging to that operation. These may be kept in memory (e.g. hash table) for easier location during normal operation. In case of an MDT crash, the update llog recovery will load the uncancelled update records into memory again for processing.

[AZ: the description above mixes the number of things, IMHO. probably, it'd be better to have a separate document describing async updates with no MDT/LDLM mentioned, just a mechanism accepting set of updates and ensuring they are applied atomically.]

Di Wang > DNE phase II async cross-MDT operation High Level Design > commit_flow.png

Note: LDLM lock is not needed during recovery.

If the replay request comes from the client, the master MDT will re-enqueue the lock for the replay request.
The failover MDT will not accept new request from clients during recovery. Commit on Share (COS) will be applied for all cross-MDT operations, which ensures all conflicting updates have been committed to disk, so any cross-MDT replay updates should not conflict, i.e. LDLM locks are not needed for replay between MDTs.

Update request format

As described earlier, one metadata operation will be decomposed into several updates. These updates will be distributed to all other MDTs by update RPC. In DNE Phase II each RPC only includes updates for single operation. The format for update RPC is:

/* Update request HEADER */
struct object_update_request {
        __u32                      ourq_magic;      /* UPDATE_REQUEST_MAGIC_V2 */
        __u16                      ourq_count;      /* number of object_update records in request */
        __u16                      ourq_padding;    /* currently unused */
        struct object_update       ourq_updates[0]; /* length of each struct update_rec_v2 */
};
 
/* Object update */
struct object_update {          
        __u16           ou_type;                /* enum update_type */
        __u16           ou_params_count;        /* update parameters count */
        __u32           ou_master_index;        /* master MDT/OST index */
        __u32           ou_flags;               /* enum update_flag */
        __u32           ou_padding1;            /* padding 1 */
        __u64           ou_batchid;             /* op transno on master */
        struct lu_fid   ou_fid;                 /* object to be updated */
        struct object_update_param ou_params[0]; /* update params */
};

enum update_rec_flags {
        OUT_UPDATE_FL_OST       = 0x000000001, /* update master is OST */
        OUT_UPDATE_FL_SYNC      = 0x000000002, /* commit update before reply */
        OUT_UPDATE_FL_COMMITTED = 0x000000004, /* ur_batchid is committed globally */
        OUT_UPDATE_FL_NOLOG     = 0x000000008, /* idempotent update does not need to be logged */ 
};
 
/* Parameters of each update */ 
struct object_update_param {
        __u16   oup_len;          /* length of this parameter in bytes */
        __u16   oup_padding;      /* currently unused */
        __u32   oup_padding2;     /* currently unused */
        char    oup_buf[oup_len]; /* update-specific parameter data */
};

Update reply format

After OUT handles these updates, the result of each update will be packed into the reply buffer, and the format is

/* Updates result HEADER */
struct object_update_reply {
        __u32   ourp_magic;         /* magic of the reply */
        __u16   ourp_count;         /* number of the update reply */
        __u16   ourp_padding;       /* unused for now */
        __u16   ourp_lens[0];       /* length of each update reply */
};
 
/* The result of each object update */
struct object_update_result {
        __u32   our_rc;             /* The return result of this update. */
        __u16   our_datalen;        /* length of our_data */
        __u16   our_padding;        /* unused for now */
        __u32   our_data[0];        /* holding the reply of the update, for example attributes of the object for ATTR_GET update */ 
};

Update buffer for creating striped directory

AZ: Should this section be updated to reflect changes in the design?

According to the process described above, during striped directory creation, sub-stripe(s) will be created on all MDTs involved. If it stores all of these updates into the prepared update buffer it might become very large, which might not be able to fit in a single update RPC (maximum 1M size). So we will compress the update buffer in this case. Because these creating stripe update is quite similar, the only difference is the stripe_index. So the update buffer will include:

Updates for creating master object on the master MDT.
Updates for setting stripe EA on the master object.
Updates for inserting nth name entry for the nth sub-stripe. (compressed Update)
Updates for updating the last rcvd on the master MDT (see Failover section)
Set stripe EA on the nth stripe. (Compressed Update)
Updates for creating nth sub-stripe.

OpenSFS > DNE2 High Level Design > Slide2.png

Open/create regular remote files

When creating regular remote files,

Client allocates the FID and sends create request to the master MDT where the file is located.
Master MDT creates the object by the FID.
Slave MDT inserts the name entry into its parent.

Note: open(O_CREAT) is not allowed to create remote regular files in this phase, so if the user wants to create and open a remote regular file on a remote MDT, the file should first be created, then opened in a separate system call.

Unlink remote files/directories

In common with DNE Phase I, DNE Phase II clients will send unlink request to the MDT where the inode is located:

The client sends unlink request to the master MDT.
The master MDT enqueues the LDLM lock of the remote parent and the file.
The master MDT then decrease the nlink of the inode, if the nlink is zero,
1. if the file is currently opened, move the file to ORPHAN directory.
2. if the file is not being opened, destroy the file.
The remote MDT will remove the name entry from the parent.

Rename remote files/directories

In contrast to other cross-MDT metadata operations, rename between multiple MDTs involves four objects, which might be on different MDTs. This adds additional complexity. Additional care must be taken before renaming directories: the relationship between the source and the target directory must be checked to avoid moving the parent into the subdirectory of its child. The checking process for directory renames is protected by a global rename lock to ensure the parent/child relationship will not change during the check, in addition to locks on each of the directories being renamed. If the source and target parent directories are the same directory, then the global rename lock can be dropped since it is not possible to change the parent/child relationship. The global lock will be held on the MDT0 namespace.

During a rename dir_S/src dir_T/tgt the master MDT1 holds dir_S, MDT2 holds src, MDT3 holds dir_T, MDT4 holds tgt

The client sends rename RPC to MDT4 if the tgt object exists, since it will need to unlink the target inode and may have to handle the open-unlinked file, otherwise to MDT2 where the src object exists (though this is not a hard requirement). This is the master MDT.
If the clients sends the RPC to an MDT and it looks up the tgt name under DLM lock and tgt object exists on a remote MDT, the MDT will return -EREMOTE and the client must resend the RPC to the MDT with the tgt object.
If src is a directory, the master MDT acquires the global rename lock. The master MDT then gets the LDLM lock of dir_S and dir_T according to their parent and child relationship, then gets the LDLM lock of their child name hashes.
If tgt is a directory that is not empty then the rename fails with -ENOTEMPTY.
If src is a directory, the master MDT checks the relationship between the dir_S and dir_T. If the dir_S is the parent of tgt, the rename fails with -EINVAL.
MDT1 deletes entry src and set ctime/mtime of dir_S.
If src is a directory MDT2 deletes old dir_S ".." entry and insert new dir_T ".." entry, sets ctime/mtime of src, decrements its nlink count, and also updates the linkEA of src.
The master MDT removes the old tgt entry if it exists, and inserts a new tgt entry with the src object FID, and decrements the nlink count of dir_T if this is a directory. The link count of dir_T does not change.
If src is a directory then the master MDT releases global rename lock
If tgt object exist, MDT4 destroys the tgt object.

[AZ: Frankly, I'm not sure why do we need this description here. this seem to be a normal rename process, nothing really special to DNE?]

All of these updates will be stored in the update logs on every MDT. If any MDT fails and restarts, it will notify other MDTs to send all these updates to the failover MDT, which then will be redo the updates for the failed MDT, which will be discussed in detail in the Failover section.

Migration

In DNE Phase II, the migration tool (lfs mv file -i target_MDT) will be provided to help users to move individual inodes from one MDT to another MDT, without copying data on the OSTs.

Migrating regular files will be performed as follows:

For lfs mv -i MDT3 file1, MDT1 holds the name entry of file1, MDT2 holds file1 inode, and MDT3 will be the target MDT where the file will be migrated

The client sends the migrate request to MDT1.
MDT1 checks whether the file is currently opened, and returns -EBUSY if it is opened by another process.
MDT1 acquires LAYOUT, UPDATE, and LOOKUP locks on the file, and it will also lock the parent of the file. Note: if there are multiple links of this file, it needs to lock all of its parents. [Di: But how to order these ldlm locks?]
MDT3 creates a new file with the same layout, and updates linkEA.
MDT1 updates the entry with new FID.
MDT2 destroys the old object, but if there are multiple links for the old object, it also needs to walk through all of the name-entries and update the FID in all these name entries.
MDT1 releases LAYOUT, UPDATE, and LOOKUP locks.
Client clears the inode cache of file1 so that the client is not caching the old layout.

Migrating directories is more complicated:

For lfs mv -i MDT3 dir1, MDT1 holds the name entry of dir1, MDT2 holds the dir1 inode and all the directory entries, MDT3 will be the target MDT where the file will be migrated

The client sends the migrate request to MDT1, where the directory is located.
MDT1 checks whether the file is currently opened, and returns -EBUSY if it is opened by other process.
MDT1 acquires LAYOUT, UPDATE, and LOOKUP locks on the file.
MDT3 create the new directory. Note: if dir1 is a non-empty directory, MDT1 needs to iterate all of entries of the directory, and send them to MDT3, which will insert all of the entries on the new directory, and also the linkEA of each children needs to be updated.
MDT2 destroy the old directory.
MDT1 update the entry with new FID.
MDT1 release layout, UPDATE and LOOKUP lock.

When the entire directory is being migrated from one MDT to a second MDT, individual files and directories will be migrated from the top to bottom, i.e. the parent will be migrated the new MDT first, then its children. By this way, if other process create the file/directories during the migration,

If the parent of the creation has been moved to the new MDT, the file/directory will be created on the new MDT.
If the parent of the creation has not been moved to the new MDT yet, the new created file/directory will be moved to the new MDT in the following migration.

This design ensures the all directories will be migrated to the new MDT in all cases.

After migrating the directory to the new MDT the directory on the old MDT will become an orphan, i.e. it can not be accessed from the namespace. The orphan can not be destroyed until all of its children are moved to the new MDT. In this way, migrating a directory does not need to update the parent FID in the linkEA of all of the children since all of children can still find its parent on the old MDT using fid2path during migration.

Failover

In DNE Phase I all of cross-MDT requests are synchronous and there are no replay requests between MDTs. This design simplifies ecovery between MDTs in DNE Phase I. With DNE Phase II, all of cross-MDT operations are asynchronous and there will be replay requests between MDTs. This makes recovery more complex than for DNE Phase I.

As described earlier all of updates of the cross-MDT operation will be recorded on every MDT involved in that operation. During recovery the updates will be sent to the failed MDT during recovery, which is the new master MDT of these updates, and are replayed there. Except updates of the operation, the update to modify the last_rcvd on the master MDT will also be added in the update log so that it can be replayed even if the client fails. A new index method to update index(index_update) is used. This record will include:

Master MDT index, used to identify the master MDT during recovery.
local FID { FID_SEQ_LOCAL_FILE, LAST_RECV_OID, 0 }, to represent the last_rcvd file.
The lsd_client_data structure, the client UUID (to be used for the index key), and the rest of the body is the value.

During update, the master MDT will first locate the last_rcvd by FID, then locate record in the file by client UUID, then update the whole body of lsd_client_data.

Recovery in DNE Phase II will be driven by master MDT, and can be divided into three steps:

When one MDT restarts after a crash It will read all of the updates with the same ou_master_index from the local update llog, and also read updates related with itself from all other MDTs.
1. The failover MDT will then compare local updates records and records from other remote MDTs. To identify update records from the same operation, a new unique identifier (Distribution ID) will be created on master MDT for each operation and recorded in all of update records of the operation, so the recovery process can tell all of records of the operation by checking ID and master_index. The DID(Distribution ID) should be unique across restart, so the last committed transno on the master MDT will be used as the beginning number, and increased in the memory for each operation.
  1. If the update records exist on both local (master) and remote MDTs, it means the operation has been committed on all MDTs, then just cancel these update records.
  2. If the update records only exist on the master, but not on the remote MDT, it means these updates has only been executed on the master MDT, so they need to be resent to remote MDTs and re-executed there.
  3. If the update records do not exist on the master, but only on the remote MDT, it means either the operation has not been executed on the new master MDT yet, or the operation has been executed, but these update records were already being cancelled. Then we need check the ou_batchid (transno of local updates, see above Implementation section) in the update records.
    1. if it is larger than last_committed transno of the master MDT, it means the operation has not been done in the master MDT, it needs to re-execute these updates.
    2. if it is smaller than last_committed transno of the master MDT, it means the operation has been done in the master MDT, it just needs to cancel the update records.
After the first step, the recovery thread will pick the replay request or update records(created in step 1) according to the transno, and redo these requests or updates. Note: the replay requests and updates might be duplicate, but the recovery thread should be able to tell by transno in request and update_records. AZ: it would be good to describe this in details.
If there are any failures during the above 2 steps, lfsck daemon will be triggered to fix the filesystem. (question)

Recovery of cross-MDT operations requires the participation of all involved MDTs. In case of multiple MDT failures, normal service cannot therefore resume until all failed MDTs fail over or reboot. The system administrator may disable a permantently failed MDT (by lctl deactivate) to allow recovery to complete on the remaining MDTs.

AZ: processing under the normal conditions is not described? how/when llogs are cancelled?

Failure cases

Failures

Failover

Both master and slave fail and updates have been committed on both MDTs.

The master will replay the update to the slave, and the slave will know whether the update has been executed by checking the update llog and generate the reply.

Both master and slave fail and updates have not been committed on both MDT yet.

Client will do normal replay (or resend if no reply), master will redo whole operation from scratch. If the client has also failed, nothing left to be done.

Both master and slave fails and updates have been committed on master MDT but not on slave MDT.

The master resends the update to slaves and slaves will redo updates; Client may resend or replay, and this will be handled by client last_rcvd on master.

The master is alive and the slave fails without committing the update.

The master replays (or resend if no reply) updates to the slave MDT, and the slave will redo updates.

The master is alive and the slave fails having committed the update.

The master replays (or resend if no reply), the slave will generate reply from the master last_rcvd slot based on XID (== master transno, if resend)

The master fails without commit the slave is alive.

The slave replays updates to the master MDT when it restarts. The master will check whether the update has been executed by checking its local update llog, and redo the update if not found. This also updates the client's last_rcvd slot from the update, so if the client replays or resends it can be handled normally.

Commit on Share

During recovery, if one update replay fails all related updates may also fail in the subsequent replay process. For example, client1 creates a remote directory on MDT1 and its name entry is on MDT0: other clients will create files under the remote directory on MDT1. If MDT0 fails and the name entry insertion has not yet been committed to disk. If the recovery fails for some reason, i.e. the directory is not being connected to the name space at all, all of the files under this directory will not be able to be accessed. To avoid this, commit on share will be applied to cross-MDT operation. i.e. If the MDT finds the object being updated was modified by some previous cross-MDT operation, this cross-MDT operation will be committed first. So in the previous example before creating any files under remote directory the creation of the remote directory must be committed to disk first. AZ: I think this applies to any operations, imagine local mkdir /A and then following distributed mkdir /A/B. if /A is missing, we won't be able to recovery the 2nd mkdir?

Commit on Share (COS) will be implemented by COS lock based on the current local COS implementation. During cross-MDT operation, all locks of remote objects(remote locks) will be hold on the master MDT, and all of remote locks will be COS lock. If these COS locks are being revoked, the master MDT will not only sync itself, but also sync the remote MDTs.

For example, these two consecutive operations: 1. mv dir_S/s dir_T/t 2. touch dir_T/s/tmp (MDT1 holds dir_S, MDT2 holds s, MDT3 holds dir_T, MDT4 holds t)

Client sends rename request to MDT1
MDT1 detects remote rename and holds LDLM COS locks for all four objects, and finish rename, and four LDLM locks are cached on MDT1.
Client sends open/create request to MDT2
MDT2 enqueue LDLM lock for s, MDT1 revoke the lock of s, because it is a COS lock, it will do sync on all of MDTs involve in the previous rename, i.e. MDT1, MDT2, MDT3, MDT4.

Compatibility

MDT-MDT

In DNE Phase I updates between MDTs are synchronous. In DNE Phase II updates are asynchronous. To avoid complications introduces with multiple different MDT versions, DNE Phase II requires all MDTs have to be of the same version, i.e. they must be upgraded at the same time. MDTs will check versions during connection setup and deny the connect requests from old MDT version.

MDT-OST

There are no protocol changes between MDT and OST in DNE Phase II.

CLIENT-MDT

In DNE Phase II rename requests will be sent to the MDT where the target file is located. This is different from DNE Phase I. An old client (<= Lustre software version 2.4.0) will still send the request to the MDT where the source parent is, and the source parent will return -EREMOTE to the old client. A 2.4.0 client does not understand -EREMOTE so a patch will be added to 2.4 series to redirect such rename requests to the MDT where the target file is, if it gets -EREMOTE from the MDT. Old MDTs that do not support DNE Phase II will return -EXDEV for cross-MDT rename, which will typically be handled by userspace tools by copying the files between the directories.

Striped Directories Design

Introduction

In DNE Phase I all of name entries of one directory will be only in a single MDT. As a result, single directory performance is expected to be the same as single MDT file system. In DNE Phase II a striped directory will be introduced to improve the single directory performance. This document will discuss how striped directory will be implemented. It assumes the knowledge of DNE phase II async cross-MDT operation High Level Design and DNE phase I Remote Directory High Level Design.

Functional Statement

Similar to file striping, a striped directory will split the name entries across multiple MDTs. Each MDT keeps directory entries for certain range of hash space. For example, there are N MDTs and hash range is 0 to MAX_HASH, first MDT will keep records with hashes [0, MAX_HASH/N - 1], second one with hashes [MAX_HASH / N, 2 * MAX_HASH / N] and so on. During file creation, LMV will calculate the hash value by the name, then create the file in the corresponding stripe on one MDT. It will also allow the user to choose different hash function to stripe the directory. The directory can only be striped during creation and can not be re-striped after creation in DNE phase II.

Di Wang > DNE Phase II Striped Directory High Level Design > stripe_hash.png

[AED: this diagram is not correct. Stripe 0 does not hold [0, max_hash/4] and stripe 1 does not hold [max_hash/4, max_hash/2], etc. Instead, it would be better to show stripe 0 holding [hash % 4 == 0], and stripe 1 holding [hash % 4 == 1]. ]

Definition

The first stripe of each striped directory will be called master stripe, which is usually in the same MDT with its parent. Other stripes will be called remote stripes.

Logical Statement

Similar to a striped file, a client will get directory layout information after lookup and then build the layout information for this directory in LMV. For any operation under the striped directory, the client will first calculate the hash value by name then get the stripe by hash and layout. Finally, the client will send the request to the MDT where the stripe is. If a large number of threads access the striped directory simultaneously, each thread can go to different MDTs and these requests can be handled by each MDT concurrently and independently. The single directory performance will be improved by this way.

Directory Layout

The directory layout information will be stored in the EA of every stripe as follows:

struct lmv_mds_md {
        __u32 lmv_magic;                      /* stripe format version */
        __u32 lmv_count;                      /* stripe count */
        __u32 lmv_master;                     /* master MDT index */
        __u32 lmv_hash_type;                  /* dir stripe policy, i.e. indicate which hash function to be used*/
        __u32 lmv_layout_version;             /* Used for directory restriping */
        __u32 lmv_padding1;
        __u32 lmv_padding2;
        __u32 lmv_padding3;
        char lmv_pool_name[LOV_MAXPOOLNAME];  /* pool name */
        struct lu_fid lmv_data[0];            /* FIDs for each stripe */
};

lmv_hash_type indicates which hash function the directory will use to split its name entries.

[AED: this part of the design needs to be updated to account for LU-5223]

Directory stripe lock

Currently all of name entries of one directory are protected by the UPDATE lock of this directory. As a result, the client will invalidate all entries in this directory during Update lock revocation. In striped directory each stripe has its own UPDATE lock and if any threads try to modify the stripe directory the MDT only needs acquire the single stripe UPDATE lock. Consequently, the client will only invalidate name entries of this stripe, instead of all of entries of the directory. When deleting the striped directory the MDT needs to acquire each of the stripe locks; When performing readdir of the striped directory, the client must to acquire each stripe lock to cache the directory contents. Stripe locks do not need to be acquired simultaneously.

Create striped directory

Creating a striped directory is similar to creating a striped file:

The client allocates FIDs for all stripes and sends the create request to the master MDT.
The master MDT sends object create updates to each remote MDT to create the stripes.
For each remote stripe, the parent FID in LinkEA will be the Master stripe FID, which will also be put into the ".." directory of each remote stripe, i.e. the remote stripes will physically be remote subdirectories of the master stripe to satisfy lfsck. During readdir, LMV will ignore this subdirectory relationship, and recognize it as individual stripe of the directory (it will be collapsed by LMV on the client with the layout and skipped during readdir.) This design simplifies LFSCK consistency checking and reduces the number of objects modified during rename (for ".." and LinkEA).

Delete striped directory

Client sends delete requests to the Master MDT, then Master MDT acquires all of stripe locks of the directory. The Master MDT checks if all of stripes are empty and then destroys all of the stripes.

Create/lookup files/directories under striped directory

When a file/directory is being created/looked up under stripe directory:

Client will first calculate the hash according to the name and lmv_hash_type of the striped directory. Next, the client gets the MDT index according to the hash and sends the create/lookup request to that MDT.
MDT will create/lookup the file and directories independently. Note: when creating the new directory, MDT only needs to modify the attributes of the local stripe, like increase nlink, mtime, so to avoid sending attrset updates between MDT. It also means when client tries to retrieve the attribute of striped directories, it needs to walk through all of stripes on different MDT, then merge attributes from each stripe.

Readdir of striped directory

During readdir() a client will iterate over all stripes and for each stripe it will get a stripe lock and then read directory entries. Each directory's hash range should be in the range [1..263-1]. The readdir() operation will proceed in hash order concurrently among all of the stripes that make up the directory, and the client will perform a merge sort of the hashes of all the returned entries to a single stream, up to the lowest hash value at the end of the returned directory pages. This allows a single 64-bit cookie to represent the readdir offset within all of the stripes in the directory. There is no more chance of hash collision with the readdir cookie in a striped directory than there is with a single directory of equivalent size.

Getattr of striped directory

Client iterate over all of the stripes to get attributes from all stripes and then merge them together.

size/blocks/nlink: add all together from every stripe.
ctime/mtime/atime: choose the newest one as the xtime of the striped directory.
uid/gid: should be same for all stripes.

Rename in the same striped directory

Client sends the rename request to the MDT where the master stripe of the source parent is located. If rename is in the same stripe it is the same as rename in the same directory. If the rename is under the same striped directory but between different stripes on different MDTs:

(mv dir_S/src dir_S/tgt, dir_S is striped directory, MDT0 holds the master stripe, MDT1 holds src, MDT2 holds tgt).

Client sends the rename request to MDT2.
MDT2 acquires the LDLM locks (both inode bits and hash of the file name) of the source and target stripe according to their FID order.
MDT1 deletes entry src, sets mtime of the stripe, updates linkEA of src, and if src is directory, decreases the nlink of the local stripe.
MDT2 deletes entry tgt, inserts entry src, and if tgt is directory, increases the nlink of the local stripe,

Rename between different striped directory

Rename between different striped directories is a more complicated case with potentially six MDTs involve in the process:

(mv dir_S/src dir_T/tgt, MDT1 holds the source stripe of dir_S where the name entry of src is located, MDT2 holds src object, MDT3 holds the target stripe of dir_T where the name entry of tgt is located, MDT4 holds tgt object)

The client sends rename request to MDT4 if the tgt object exists, otherwise to MDT2 where the src object exists (though this is not a hard requirement). This is the master MDT.
If the clients sends the RPC to an MDT and it looks up the tgt name under DLM lock and tgt object exists on a remote MDT, the MDT will return -EREMOTE and the client must resend the RPC to the MDT with the tgt object.
If the renamed object is a directory, the master MDT acquires the global rename lock. The master MDT gets the LDLM lock of dir_S and dir_T stripe according to their FID order, then gets the LDLM lock of their child name hashes.
If the renamed object is a directory the master MDT checks the relationship between the dir_S and dir_T stripes. If the dir_S is the parent of tgt, the rename is not allowed
MDT1 deletes entry src and set ctime/mtime of dir_S.
If the renamed object is a directory MDT2 deletes old dir_S ".." entry and insert new dir_T ".." entry, sets ctime/mtime of src and also updates the linkEA of src.
The master MDT deletes old entry tgt if it exists, and insert new entry tgt with the src object FID, and also updates the link count of local stripe if this is a directory.
If the renamed object is a directory then the master MDT releases global rename lock
If tgt object exist, MDT4 destroys tgt.

If the object being renamed is itself a striped directory, only the master stripe will have its ".." and linkEA entry updated.

LinkEA

LinkEA is used by fid2path to build the path by object FID. The LinkEA includes the parent FID and name. During fid2path a MDT will lookup the object parents to build the path until the root is reached. For a striped directory the master stripe FID will be stored into the linkEA of each other stripe. The parent FID of the striped directory will be put into the master stripe. A a result, if the object is under a striped directory the MDT will get stripe object first then locate the master stripe and then continue the fid2path process. For clients that do not understand striped directories (if supported), this may appear as a "//" component in the generated pathname, which is will fail safe.

Change log

An operation for a striped directory will be added to change log in the same way as a normal directory. The added operations include: create directory, unlink directory, create files under striped directory etc. Currently there are two users for change log,

lustre_rsync may be enhanced to understand striped directories:
1. If lustre_rsync target is Lustre file system, it will try to recreate the stripe directory with original stripe count. If it succeeds, it will reproduce all operations under the striped directory.
2. If it can not create the striped directory with the stripe count (for example there are not enough MDT on the target file system,) or the lustre_rsync target is not Lustre file system, it will create a normal directory, and all of striped directory operation will be converted to normal directory operation.
3. Besides the original directory creation, all of the lustre_rsync operations proceed as normal.
The striped directory implementation does not interact with HSM. This behaviour is consistent with DNE Phase I.

Recovery

Recovery of striped directory will use the redo log as described in DNE phase II async cross-MDT operation High Level Design.

In case of on-disk corruption in a striped directory, the LFSCK Phase III MDT-MDT Consistency project will address the distributed verification and repair.

Protocol Compatibility

Since old clients (<= Lustre software version 2.4) do not understand striped directories -ENOSUPP will be returned when clients not supporting this feature try to access the striped EA on the new MDT (>= Lustre software version 2.6).

Disk Compatiblity

The striped directory will be introduced in DNE Phase II as part of Lustre 2.6, so the incompatible LMAI_STRIPED flag is set in the LMA of the striped directory. If an MDT older than version 2.6 tries to access the striped directory, it will generate an -ENOSUPP error because this flag is not in its list of supported object features.

Other names and brands may be the property of others.

DNE StripedDirectories HighLevelDesign wiki version

Introduction

Asynchronous Cross-MDT Operation Design

Definitions

Operation and Update

Master and slave (remote) MDT

Functional Statements

Implementation

Update request format

Update reply format

Update buffer for creating striped directory

Open/create regular remote files

Unlink remote files/directories

Rename remote files/directories

Migration

Failover

Failure cases

Commit on Share

Compatibility

MDT-MDT

MDT-OST

CLIENT-MDT

Striped Directories Design

Introduction

Functional Statement

Definition

Logical Statement

Directory Layout

Directory stripe lock

Create striped directory

Delete striped directory

Create/lookup files/directories under striped directory

Readdir of striped directory

Getattr of striped directory

Rename in the same striped directory

Rename between different striped directory

LinkEA

Change log

Recovery

Protocol Compatibility

Disk Compatiblity

Navigation menu

Search