DNE StripedDirectories HighLevelDesign wiki version
Introduction
With the release of DNE Phase I Remote Directories, Lustre* file systems now support more than one MDT in a single filesystem. This feature has some limitations:
- Due to the synchronous nature of DNE Phase I remote operations, by default they are configured so only an administrator can create or unlink a remote directory. Create and unlink are the only 'cross-MDT' operations to be allowed in Phase I. All other cross-MDT operations (link, rename) will return EXDEV.
- Cross MDT operations must be synchronous and metadata performance may be impacted especially when DNE is based on ZFS.
- Moving files or directories between MDTs can only be achieved using file copy and unlink commands. As a result, data objects on OST will be moved resulting in redundant data transfer operations.
- All of name entries of a directory must be on one MDT, so the single directory performance is the same as single MDT filesystem.
DNE Phase II will resolve these issues. This document is divided into two sections. The first section 'Asynchronous Cross-MDT Operation Design' is concerned with resolution of the first three issues above. A separate section 'Striped Directories Design' describes a design to resolve the remaining issue. [AZ: sounds more like Phase III now? (smile) ]
Asynchronous Cross-MDT Operation Design
From the first three limitations enumerated in the introduction, the most important one is how to implement asynchronous cross-MDT updates and its recovery. An assumption is made that the file system may become inconsistent after the recovery in some rare cases, and both servers and clients should be able to detect such inconsistency and return proper error code to the user. In the mean time, LFSCK will be able to detect such inconsistency and attempt to resolve them. This design document assumes knowledge of the DNE phase II Solution Architecture, DNE Phase 1 Solution Architecture and DNE Phase 1 High Level Design.
Definitions
Operation and Update
Operation means one complete metadata operation i.e open/create a file, mkdir or rename that leaves the namespace in a consistent state. A request from a client usually includes only one metadata operation. The MDT will decompose the operation into several updates, for example mkdir will be decomposed into name entry insertion, parent link count increment, and object create.
Master and slave (remote) MDT
In DNE, client typically sends the metadata request to one MDT, called master MDT for this request. The master MDT then decomposes the operation into updates, and redistributes these updates to other MDTs, which are calledslave or remote MDT for this request.
Functional Statements
In DNE Phase II:
- All metadata operations are allowed to be cross-MDT, which will be asynchronous if they are not dependent on other uncommitted updates. If the cross-MDT operation is dependent on an uncommitted update from a different client, it will use the existing Commit-on-Share (COS) implementation to ensure that the prerequisite operation is committed first.
- Normal users (without administrator privilege) can perform cross-MDT operations. [AZ: probably we still want some limitations here, imagine a regular user doing a rename from one directory into another one, causing lots of syncs and as a result very slow processing on MDT?][AED: except users can already do sync() repeatedly today and slow down the filesystem, so I don't see that this is worse.]
- Migration tool will be provided to move individual inodes from one MDT to another MDT, without introducing redundant data object transfers on OSTs.
Implementation
In DNE Phase I, the master MDT collects the updates in the transaction declare phase, and then sends these updates to other MDTs during transaction start. For local transactions the declare phase is only for reserving resources, like journal credits, etc. To unify the transaction phase for local and remote operation, DNE Phase II will collect remote updates in the execution phase, i.e. between transaction start and stop, then distribute updates at transaction stop. The process is:
- The client sends the request to the master MDT.
- The master MDT enqueues all LDLM locks and get back the object attributes, then cache those attributes on the master MDT, and do sanity checks on the master MDT.
- The master MDT creates and starts the transaction, and decomposes the operation into updates in MDD layer, which might include both local and remote updates, and records all of these updates in the prepared
object_update_request
buffer. - The master MDT executes local updates during the transaction execution phase.
- In transaction stop, the master MDT will first generate an additional update to the master's
last_rcvd
file containing thelsd_client_data
(master transno, client XID, object pre_versions) for that client's operation, and adds this to the update request buffer. If recovery of the master MDT is later needed, it is able to update thelast_rcvd
slot for the client by replaying the update log from a slave MDT, even if the client is not available. - The master MDT distributes updates and the update buffer prepared in step 3 to all of remote MDTs.
- Slave MDT(s) will execute their local updates asynchronously, and also write all of the updates into their local log, then reply with their local transno to the master MDT.
- After the master MDT gets all replies from slave MDT(s), it releases the LDLM locks and replies to the client with the master transno generated in step 5, and client will add the request into its replay list.
- After the operation and its update log are committed to disk on the master MDT, it will piggyback the
last_committed_transno
to the client in RPC replies or pings, and client will remove the request from the replay list, which is the same as normal replay request. - When the updates are committed on the slave MDT(s), they will notify the master MDT using normal
last_committed_transno
in RPC replies or pings. - After the master MDT sees all of the remote updates are committed on the slave MDT(s), it will cancel its local update log record first.
- When the local update record cancellation has committed, the master MDT sends requests to remote MDT(s) to cancel their corresponding update request log records (identified by
ou_master_index
+ou_batchid
). The remote MDT(s) will use theou_master_index
+ou_batchid
to cancel their update records belonging to that operation. These may be kept in memory (e.g. hash table) for easier location during normal operation. In case of an MDT crash, the update llog recovery will load the uncancelled update records into memory again for processing.
[AZ: the description above mixes the number of things, IMHO. probably, it'd be better to have a separate document describing async updates with no MDT/LDLM mentioned, just a mechanism accepting set of updates and ensuring they are applied atomically.]
Di Wang > DNE phase II async cross-MDT operation High Level Design > commit_flow.png
Note: LDLM lock is not needed during recovery.
- If the replay request comes from the client, the master MDT will re-enqueue the lock for the replay request.
- The failover MDT will not accept new request from clients during recovery. Commit on Share (COS) will be applied for all cross-MDT operations, which ensures all conflicting updates have been committed to disk, so any cross-MDT replay updates should not conflict, i.e. LDLM locks are not needed for replay between MDTs.
Update request format
As described earlier, one metadata operation will be decomposed into several updates. These updates will be distributed to all other MDTs by update RPC. In DNE Phase II each RPC only includes updates for single operation. The format for update RPC is:
/* Update request HEADER */ struct object_update_request { __u32 ourq_magic; /* UPDATE_REQUEST_MAGIC_V2 */ __u16 ourq_count; /* number of object_update records in request */ __u16 ourq_padding; /* currently unused */ struct object_update ourq_updates[0]; /* length of each struct update_rec_v2 */ }; /* Object update */ struct object_update { __u16 ou_type; /* enum update_type */ __u16 ou_params_count; /* update parameters count */ __u32 ou_master_index; /* master MDT/OST index */ __u32 ou_flags; /* enum update_flag */ __u32 ou_padding1; /* padding 1 */ __u64 ou_batchid; /* op transno on master */ struct lu_fid ou_fid; /* object to be updated */ struct object_update_param ou_params[0]; /* update params */ }; enum update_rec_flags { OUT_UPDATE_FL_OST = 0x000000001, /* update master is OST */ OUT_UPDATE_FL_SYNC = 0x000000002, /* commit update before reply */ OUT_UPDATE_FL_COMMITTED = 0x000000004, /* ur_batchid is committed globally */ OUT_UPDATE_FL_NOLOG = 0x000000008, /* idempotent update does not need to be logged */ }; /* Parameters of each update */ struct object_update_param { __u16 oup_len; /* length of this parameter in bytes */ __u16 oup_padding; /* currently unused */ __u32 oup_padding2; /* currently unused */ char oup_buf[oup_len]; /* update-specific parameter data */ }; |
Update reply format
After OUT handles these updates, the result of each update will be packed into the reply buffer, and the format is
/* Updates result HEADER */ struct object_update_reply { __u32 ourp_magic; /* magic of the reply */ __u16 ourp_count; /* number of the update reply */ __u16 ourp_padding; /* unused for now */ __u16 ourp_lens[0]; /* length of each update reply */ }; /* The result of each object update */ struct object_update_result { __u32 our_rc; /* The return result of this update. */ __u16 our_datalen; /* length of our_data */ __u16 our_padding; /* unused for now */ __u32 our_data[0]; /* holding the reply of the update, for example attributes of the object for ATTR_GET update */ }; |
Update buffer for creating striped directory
AZ: Should this section be updated to reflect changes in the design?
According to the process described above, during striped directory creation, sub-stripe(s) will be created on all MDTs involved. If it stores all of these updates into the prepared update buffer it might become very large, which might not be able to fit in a single update RPC (maximum 1M size). So we will compress the update buffer in this case. Because these creating stripe update is quite similar, the only difference is the stripe_index. So the update buffer will include:
- Updates for creating master object on the master MDT.
- Updates for setting stripe EA on the master object.
- Updates for inserting nth name entry for the nth sub-stripe. (compressed Update)
- Updates for updating the last rcvd on the master MDT (see Failover section)
- Set stripe EA on the nth stripe. (Compressed Update)
Updates for creating nth sub-stripe.
Open/create regular remote files
When creating regular remote files,
- Client allocates the FID and sends create request to the master MDT where the file is located.
- Master MDT creates the object by the FID.
- Slave MDT inserts the name entry into its parent.
Note: open(O_CREAT) is not allowed to create remote regular files in this phase, so if the user wants to create and open a remote regular file on a remote MDT, the file should first be created, then opened in a separate system call.
Unlink remote files/directories
In common with DNE Phase I, DNE Phase II clients will send unlink request to the MDT where the inode is located:
- The client sends unlink request to the master MDT.
- The master MDT enqueues the LDLM lock of the remote parent and the file.
- The master MDT then decrease the nlink of the inode, if the nlink is zero,
- if the file is currently opened, move the file to ORPHAN directory.
- if the file is not being opened, destroy the file.
- The remote MDT will remove the name entry from the parent.
Rename remote files/directories
In contrast to other cross-MDT metadata operations, rename between multiple MDTs involves four objects, which might be on different MDTs. This adds additional complexity. Additional care must be taken before renaming directories: the relationship between the source and the target directory must be checked to avoid moving the parent into the subdirectory of its child. The checking process for directory renames is protected by a global rename lock to ensure the parent/child relationship will not change during the check, in addition to locks on each of the directories being renamed. If the source and target parent directories are the same directory, then the global rename lock can be dropped since it is not possible to change the parent/child relationship. The global lock will be held on the MDT0 namespace.
During a rename dir_S/src
dir_T/tgt
the master MDT1 holds dir_S, MDT2 holds src, MDT3 holds dir_T, MDT4 holds tgt
- The client sends rename RPC to MDT4 if the
tgt
object exists, since it will need to unlink the target inode and may have to handle the open-unlinked file, otherwise to MDT2 where thesrc
object exists (though this is not a hard requirement). This is the master MDT. - If the clients sends the RPC to an MDT and it looks up the
tgt
name under DLM lock andtgt
object exists on a remote MDT, the MDT will return-EREMOTE
and the client must resend the RPC to the MDT with thetgt
object. - If
src
is a directory, the master MDT acquires the global rename lock. The master MDT then gets the LDLM lock ofdir_S
anddir_T
according to their parent and child relationship, then gets the LDLM lock of their child name hashes. - If
tgt
is a directory that is not empty then the rename fails with-ENOTEMPTY
. - If
src
is a directory, the master MDT checks the relationship between thedir_S
anddir_T
. If thedir_S
is the parent oftgt
, the rename fails with-EINVAL
. - MDT1 deletes entry
src
and set ctime/mtime ofdir_S
. - If
src
is a directory MDT2 deletes olddir_S
"..
" entry and insert newdir_T
"..
" entry, sets ctime/mtime ofsrc
, decrements its nlink count, and also updates thelinkEA
ofsrc
. - The master MDT removes the old
tgt
entry if it exists, and inserts a newtgt
entry with thesrc
object FID, and decrements the nlink count ofdir_T
if this is a directory. The link count ofdir_T
does not change. - If
src
is a directory then the master MDT releases global rename lock - If
tgt
object exist, MDT4 destroys thetgt
object.
[AZ: Frankly, I'm not sure why do we need this description here. this seem to be a normal rename process, nothing really special to DNE?]
All of these updates will be stored in the update logs on every MDT. If any MDT fails and restarts, it will notify other MDTs to send all these updates to the failover MDT, which then will be redo the updates for the failed MDT, which will be discussed in detail in the Failover section.
Migration
In DNE Phase II, the migration tool (lfs mv file -i target_MDT
) will be provided to help users to move individual inodes from one MDT to another MDT, without copying data on the OSTs.
Migrating regular files will be performed as follows:
For lfs mv -i MDT3
file1
, MDT1 holds the name entry of file1
, MDT2 holds file1
inode, and MDT3 will be the target MDT where the file will be migrated
- The client sends the migrate request to MDT1.
- MDT1 checks whether the file is currently opened, and returns -EBUSY if it is opened by another process.
- MDT1 acquires
LAYOUT
,UPDATE
, andLOOKUP
locks on the file, and it will also lock the parent of the file. Note: if there are multiple links of this file, it needs to lock all of its parents. [Di: But how to order these ldlm locks?] - MDT3 creates a new file with the same layout, and updates
linkEA
. - MDT1 updates the entry with new FID.
- MDT2 destroys the old object, but if there are multiple links for the old object, it also needs to walk through all of the name-entries and update the FID in all these name entries.
- MDT1 releases
LAYOUT
,UPDATE
, andLOOKUP
locks. - Client clears the inode cache of
file1
so that the client is not caching the old layout.
Migrating directories is more complicated:
For lfs mv -i MDT3 dir1
, MDT1 holds the name entry of dir1
, MDT2 holds the dir1
inode and all the directory entries, MDT3 will be the target MDT where the file will be migrated
- The client sends the migrate request to MDT1, where the directory is located.
- MDT1 checks whether the file is currently opened, and returns
-EBUSY
if it is opened by other process. - MDT1 acquires
LAYOUT
,UPDATE
, andLOOKUP
locks on the file. - MDT3 create the new directory. Note: if
dir1
is a non-empty directory, MDT1 needs to iterate all of entries of the directory, and send them to MDT3, which will insert all of the entries on the new directory, and also the linkEA of each children needs to be updated. - MDT2 destroy the old directory.
- MDT1 update the entry with new FID.
- MDT1 release layout, UPDATE and LOOKUP lock.
When the entire directory is being migrated from one MDT to a second MDT, individual files and directories will be migrated from the top to bottom, i.e. the parent will be migrated the new MDT first, then its children. By this way, if other process create the file/directories during the migration,
- If the parent of the creation has been moved to the new MDT, the file/directory will be created on the new MDT.
- If the parent of the creation has not been moved to the new MDT yet, the new created file/directory will be moved to the new MDT in the following migration.
This design ensures the all directories will be migrated to the new MDT in all cases.
After migrating the directory to the new MDT the directory on the old MDT will become an orphan, i.e. it can not be accessed from the namespace. The orphan can not be destroyed until all of its children are moved to the new MDT. In this way, migrating a directory does not need to update the parent FID in the linkEA of all of the children since all of children can still find its parent on the old MDT using fid2path during migration.
Failover
In DNE Phase I all of cross-MDT requests are synchronous and there are no replay requests between MDTs. This design simplifies ecovery between MDTs in DNE Phase I. With DNE Phase II, all of cross-MDT operations are asynchronous and there will be replay requests between MDTs. This makes recovery more complex than for DNE Phase I.
As described earlier all of updates of the cross-MDT operation will be recorded on every MDT involved in that operation. During recovery the updates will be sent to the failed MDT during recovery, which is the new master MDT of these updates, and are replayed there. Except updates of the operation, the update to modify the last_rcvd
on the master MDT will also be added in the update log so that it can be replayed even if the client fails. A new index method to update index(index_update) is used. This record will include:
- Master MDT index, used to identify the master MDT during recovery.
- local FID
{ FID_SEQ_LOCAL_FILE, LAST_RECV_OID, 0 }
, to represent thelast_rcvd
file. - The
lsd_client_data
structure, the client UUID (to be used for the index key), and the rest of the body is the value.
During update, the master MDT will first locate the last_rcvd
by FID, then locate record in the file by client UUID, then update the whole body of lsd_client_data
.
Recovery in DNE Phase II will be driven by master MDT, and can be divided into three steps:
- When one MDT restarts after a crash It will read all of the updates with the same
ou_master_index
from the local update llog, and also read updates related with itself from all other MDTs.- The failover MDT will then compare local updates records and records from other remote MDTs. To identify update records from the same operation, a new unique identifier (Distribution ID) will be created on master MDT for each operation and recorded in all of update records of the operation, so the recovery process can tell all of records of the operation by checking ID and master_index. The DID(Distribution ID) should be unique across restart, so the last committed transno on the master MDT will be used as the beginning number, and increased in the memory for each operation.
- If the update records exist on both local (master) and remote MDTs, it means the operation has been committed on all MDTs, then just cancel these update records.
- If the update records only exist on the master, but not on the remote MDT, it means these updates has only been executed on the master MDT, so they need to be resent to remote MDTs and re-executed there.
- If the update records do not exist on the master, but only on the remote MDT, it means either the operation has not been executed on the new master MDT yet, or the operation has been executed, but these update records were already being cancelled. Then we need check the
ou_batchid
(transno of local updates, see above Implementation section) in the update records.- if it is larger than
last_committed
transno of the master MDT, it means the operation has not been done in the master MDT, it needs to re-execute these updates. - if it is smaller than
last_committed
transno of the master MDT, it means the operation has been done in the master MDT, it just needs to cancel the update records.
- if it is larger than
- The failover MDT will then compare local updates records and records from other remote MDTs. To identify update records from the same operation, a new unique identifier (Distribution ID) will be created on master MDT for each operation and recorded in all of update records of the operation, so the recovery process can tell all of records of the operation by checking ID and master_index. The DID(Distribution ID) should be unique across restart, so the last committed transno on the master MDT will be used as the beginning number, and increased in the memory for each operation.
- After the first step, the recovery thread will pick the replay request or update records(created in step 1) according to the transno, and redo these requests or updates. Note: the replay requests and updates might be duplicate, but the recovery thread should be able to tell by transno in request and update_records. AZ: it would be good to describe this in details.
- If there are any failures during the above 2 steps, lfsck daemon will be triggered to fix the filesystem. (question)
Recovery of cross-MDT operations requires the participation of all involved MDTs. In case of multiple MDT failures, normal service cannot therefore resume until all failed MDTs fail over or reboot. The system administrator may disable a permantently failed MDT (by lctl
deactivate) to allow recovery to complete on the remaining MDTs.
AZ: processing under the normal conditions is not described? how/when llogs are cancelled?
Failure cases
Failures
Failover
Both master and slave fail and updates have been committed on both MDTs.
The master will replay the update to the slave, and the slave will know whether the update has been executed by checking the update llog and generate the reply.
Both master and slave fail and updates have not been committed on both MDT yet.
Client will do normal replay (or resend if no reply), master will redo whole operation from scratch. If the client has also failed, nothing left to be done.
Both master and slave fails and updates have been committed on master MDT but not on slave MDT.
The master resends the update to slaves and slaves will redo updates; Client may resend or replay, and this will be handled by client last_rcvd
on master.
The master is alive and the slave fails without committing the update.
The master replays (or resend if no reply) updates to the slave MDT, and the slave will redo updates.
The master is alive and the slave fails having committed the update.
The master replays (or resend if no reply), the slave will generate reply from the master last_rcvd
slot based on XID (== master transno, if resend)
The master fails without commit the slave is alive.
The slave replays updates to the master MDT when it restarts. The master will check whether the update has been executed by checking its local update llog, and redo the update if not found. This also updates the client's last_rcvd
slot from the update, so if the client replays or resends it can be handled normally.
During recovery, if one update replay fails all related updates may also fail in the subsequent replay process. For example, client1 creates a remote directory on MDT1 and its name entry is on MDT0: other clients will create files under the remote directory on MDT1. If MDT0 fails and the name entry insertion has not yet been committed to disk. If the recovery fails for some reason, i.e. the directory is not being connected to the name space at all, all of the files under this directory will not be able to be accessed. To avoid this, commit on share will be applied to cross-MDT operation. i.e. If the MDT finds the object being updated was modified by some previous cross-MDT operation, this cross-MDT operation will be committed first. So in the previous example before creating any files under remote directory the creation of the remote directory must be committed to disk first. AZ: I think this applies to any operations, imagine local mkdir /A and then following distributed mkdir /A/B. if /A is missing, we won't be able to recovery the 2nd mkdir?
Commit on Share (COS) will be implemented by COS lock based on the current local COS implementation. During cross-MDT operation, all locks of remote objects(remote locks) will be hold on the master MDT, and all of remote locks will be COS lock. If these COS locks are being revoked, the master MDT will not only sync itself, but also sync the remote MDTs.
For example, these two consecutive operations: 1. mv dir_S/s dir_T/t 2. touch dir_T/s/tmp (MDT1 holds dir_S, MDT2 holds s, MDT3 holds dir_T, MDT4 holds t)
- Client sends rename request to MDT1
- MDT1 detects remote rename and holds LDLM COS locks for all four objects, and finish rename, and four LDLM locks are cached on MDT1.
- Client sends open/create request to MDT2
- MDT2 enqueue LDLM lock for s, MDT1 revoke the lock of s, because it is a COS lock, it will do sync on all of MDTs involve in the previous rename, i.e. MDT1, MDT2, MDT3, MDT4.
Compatibility
MDT-MDT
In DNE Phase I updates between MDTs are synchronous. In DNE Phase II updates are asynchronous. To avoid complications introduces with multiple different MDT versions, DNE Phase II requires all MDTs have to be of the same version, i.e. they must be upgraded at the same time. MDTs will check versions during connection setup and deny the connect requests from old MDT version.
MDT-OST
There are no protocol changes between MDT and OST in DNE Phase II.
CLIENT-MDT
In DNE Phase II rename requests will be sent to the MDT where the target file is located. This is different from DNE Phase I. An old client (<= Lustre software version 2.4.0) will still send the request to the MDT where the source parent is, and the source parent will return -EREMOTE
to the old client. A 2.4.0 client does not understand -EREMOTE
so a patch will be added to 2.4 series to redirect such rename requests to the MDT where the target file is, if it gets -EREMOTE
from the MDT. Old MDTs that do not support DNE Phase II will return -EXDEV
for cross-MDT rename, which will typically be handled by userspace tools by copying the files between the directories.
Striped Directories Design
Introduction
In DNE Phase I all of name entries of one directory will be only in a single MDT. As a result, single directory performance is expected to be the same as single MDT file system. In DNE Phase II a striped directory will be introduced to improve the single directory performance. This document will discuss how striped directory will be implemented. It assumes the knowledge of DNE phase II async cross-MDT operation High Level Design and DNE phase I Remote Directory High Level Design.
Functional Statement
Similar to file striping, a striped directory will split the name entries across multiple MDTs. Each MDT keeps directory entries for certain range of hash space. For example, there are N MDTs and hash range is 0 to MAX_HASH
, first MDT will keep records with hashes [0, MAX_HASH/N - 1]
, second one with hashes [MAX_HASH / N, 2 * MAX_HASH / N]
and so on. During file creation, LMV will calculate the hash value by the name, then create the file in the corresponding stripe on one MDT. It will also allow the user to choose different hash function to stripe the directory. The directory can only be striped during creation and can not be re-striped after creation in DNE phase II.
Di Wang > DNE Phase II Striped Directory High Level Design > stripe_hash.png
[AED: this diagram is not correct. Stripe 0 does not hold [0, max_hash/4] and stripe 1 does not hold [max_hash/4, max_hash/2], etc. Instead, it would be better to show stripe 0 holding [hash % 4 == 0], and stripe 1 holding [hash % 4 == 1]. ]
Definition
The first stripe of each striped directory will be called master stripe, which is usually in the same MDT with its parent. Other stripes will be called remote stripes.
Logical Statement
Similar to a striped file, a client will get directory layout information after lookup and then build the layout information for this directory in LMV. For any operation under the striped directory, the client will first calculate the hash value by name then get the stripe by hash and layout. Finally, the client will send the request to the MDT where the stripe is. If a large number of threads access the striped directory simultaneously, each thread can go to different MDTs and these requests can be handled by each MDT concurrently and independently. The single directory performance will be improved by this way.
Directory Layout
The directory layout information will be stored in the EA of every stripe as follows:
struct lmv_mds_md { __u32 lmv_magic; /* stripe format version */ __u32 lmv_count; /* stripe count */ __u32 lmv_master; /* master MDT index */ __u32 lmv_hash_type; /* dir stripe policy, i.e. indicate which hash function to be used*/ __u32 lmv_layout_version; /* Used for directory restriping */ __u32 lmv_padding1; __u32 lmv_padding2; __u32 lmv_padding3; char lmv_pool_name[LOV_MAXPOOLNAME]; /* pool name */ struct lu_fid lmv_data[0]; /* FIDs for each stripe */ }; |
lmv_hash_type
indicates which hash function the directory will use to split its name entries.
[AED: this part of the design needs to be updated to account for LU-5223]
Directory stripe lock
Currently all of name entries of one directory are protected by the UPDATE lock of this directory. As a result, the client will invalidate all entries in this directory during Update lock revocation. In striped directory each stripe has its own UPDATE lock and if any threads try to modify the stripe directory the MDT only needs acquire the single stripe UPDATE lock. Consequently, the client will only invalidate name entries of this stripe, instead of all of entries of the directory. When deleting the striped directory the MDT needs to acquire each of the stripe locks; When performing readdir of the striped directory, the client must to acquire each stripe lock to cache the directory contents. Stripe locks do not need to be acquired simultaneously.
Create striped directory
Creating a striped directory is similar to creating a striped file:
- The client allocates FIDs for all stripes and sends the create request to the master MDT.
- The master MDT sends object create updates to each remote MDT to create the stripes.
- For each remote stripe, the parent FID in LinkEA will be the Master stripe FID, which will also be put into the ".." directory of each remote stripe, i.e. the remote stripes will physically be remote subdirectories of the master stripe to satisfy lfsck. During readdir, LMV will ignore this subdirectory relationship, and recognize it as individual stripe of the directory (it will be collapsed by LMV on the client with the layout and skipped during readdir.) This design simplifies LFSCK consistency checking and reduces the number of objects modified during rename (for ".." and LinkEA).
Delete striped directory
Client sends delete requests to the Master MDT, then Master MDT acquires all of stripe locks of the directory. The Master MDT checks if all of stripes are empty and then destroys all of the stripes.
Create/lookup files/directories under striped directory
When a file/directory is being created/looked up under stripe directory:
- Client will first calculate the hash according to the name and
lmv_hash_type
of the striped directory. Next, the client gets the MDT index according to the hash and sends the create/lookup request to that MDT. - MDT will create/lookup the file and directories independently. Note: when creating the new directory, MDT only needs to modify the attributes of the local stripe, like increase nlink, mtime, so to avoid sending attrset updates between MDT. It also means when client tries to retrieve the attribute of striped directories, it needs to walk through all of stripes on different MDT, then merge attributes from each stripe.
Readdir of striped directory
During readdir() a client will iterate over all stripes and for each stripe it will get a stripe lock and then read directory entries. Each directory's hash range should be in the range [1..2
63-1
]. The readdir() operation will proceed in hash order concurrently among all of the stripes that make up the directory, and the client will perform a merge sort of the hashes of all the returned entries to a single stream, up to the lowest hash value at the end of the returned directory pages. This allows a single 64-bit cookie to represent the readdir offset within all of the stripes in the directory. There is no more chance of hash collision with the readdir cookie in a striped directory than there is with a single directory of equivalent size.
Getattr of striped directory
Client iterate over all of the stripes to get attributes from all stripes and then merge them together.
- size/blocks/nlink: add all together from every stripe.
- ctime/mtime/atime: choose the newest one as the xtime of the striped directory.
- uid/gid: should be same for all stripes.
Rename in the same striped directory
Client sends the rename request to the MDT where the master stripe of the source parent is located. If rename is in the same stripe it is the same as rename in the same directory. If the rename is under the same striped directory but between different stripes on different MDTs:
(mv dir_S/src dir_S/tgt
, dir_S
is striped directory, MDT0 holds the master stripe, MDT1 holds src
, MDT2 holds tgt
).
- Client sends the rename request to MDT2.
- MDT2 acquires the LDLM locks (both inode bits and hash of the file name) of the source and target stripe according to their FID order.
- MDT1 deletes entry
src
, sets mtime of the stripe, updateslinkEA
ofsrc
, and ifsrc
is directory, decreases the nlink of the local stripe. - MDT2 deletes entry
tgt
, inserts entrysrc
, and iftgt
is directory, increases the nlink of the local stripe,
Rename between different striped directory
Rename between different striped directories is a more complicated case with potentially six MDTs involve in the process:
(mv dir_S/src dir_T/tgt
, MDT1 holds the source stripe of dir_S
where the name entry of src
is located, MDT2 holds src
object, MDT3 holds the target stripe of dir_T
where the name entry of tgt
is located, MDT4 holds tgt
object)
- The client sends rename request to MDT4 if the
tgt
object exists, otherwise to MDT2 where thesrc
object exists (though this is not a hard requirement). This is the master MDT. - If the clients sends the RPC to an MDT and it looks up the
tgt
name under DLM lock andtgt
object exists on a remote MDT, the MDT will return-EREMOTE
and the client must resend the RPC to the MDT with thetgt
object. - If the renamed object is a directory, the master MDT acquires the global rename lock. The master MDT gets the LDLM lock of
dir_S
anddir_T
stripe according to their FID order, then gets the LDLM lock of their child name hashes. - If the renamed object is a directory the master MDT checks the relationship between the
dir_S
anddir_T
stripes. If thedir_S
is the parent oftgt
, the rename is not allowed - MDT1 deletes entry
src
and set ctime/mtime ofdir_S
. - If the renamed object is a directory MDT2 deletes old
dir_S
"..
" entry and insert newdir_T
"..
" entry, sets ctime/mtime ofsrc
and also updates thelinkEA
ofsrc
. - The master MDT deletes old entry
tgt
if it exists, and insert new entrytgt
with thesrc
object FID, and also updates the link count of local stripe if this is a directory. - If the renamed object is a directory then the master MDT releases global rename lock
- If
tgt
object exist, MDT4 destroystgt
.
If the object being renamed is itself a striped directory, only the master stripe will have its "..
" and linkEA
entry updated.
LinkEA
LinkEA
is used by fid2path
to build the path by object FID. The LinkEA includes the parent FID and name. During fid2path
a MDT will lookup the object parents to build the path until the root is reached. For a striped directory the master stripe FID will be stored into the linkEA
of each other stripe. The parent FID of the striped directory will be put into the master stripe. A a result, if the object is under a striped directory the MDT will get stripe object first then locate the master stripe and then continue the fid2path
process. For clients that do not understand striped directories (if supported), this may appear as a "//
" component in the generated pathname, which is will fail safe.
Change log
An operation for a striped directory will be added to change log in the same way as a normal directory. The added operations include: create directory, unlink directory, create files under striped directory etc. Currently there are two users for change log,
lustre_rsync
may be enhanced to understand striped directories:
- If
lustre_rsync
target is Lustre file system, it will try to recreate the stripe directory with original stripe count. If it succeeds, it will reproduce all operations under the striped directory. - If it can not create the striped directory with the stripe count (for example there are not enough MDT on the target file system,) or the
lustre_rsync
target is not Lustre file system, it will create a normal directory, and all of striped directory operation will be converted to normal directory operation. - Besides the original directory creation, all of the
lustre_rsync
operations proceed as normal.
- If
- The striped directory implementation does not interact with HSM. This behaviour is consistent with DNE Phase I.
Recovery
Recovery of striped directory will use the redo log as described in DNE phase II async cross-MDT operation High Level Design.
In case of on-disk corruption in a striped directory, the LFSCK Phase III MDT-MDT Consistency project will address the distributed verification and repair.
Protocol Compatibility
Since old clients (<= Lustre software version 2.4) do not understand striped directories -ENOSUPP will be returned when clients not supporting this feature try to access the striped EA on the new MDT (>= Lustre software version 2.6).
Disk Compatiblity
The striped directory will be introduced in DNE Phase II as part of Lustre 2.6, so the incompatible LMAI_STRIPED
flag is set in the LMA of the striped directory. If an MDT older than version 2.6 tries to access the striped directory, it will generate an -ENOSUPP
error because this flag is not in its list of supported object features.
- Other names and brands may be the property of others.