Single-shared file performance in Lustre is highly dependent on the interaction between the task data placement within the file, and the file's layout on Lustre OSTs (striping).
In IOR single-shared file mode, a group of processes all write to a single file. A number of parameters control how those processes layout and transfer the data to the storage system. Each individual process makes a series of TransferSize sequential write calls to a individual IOR block, totaling BlockSize bytes of continuous data. One block from each process are laid out sequentially into a Segment, multiple segments form the total file size. The stride is the distance between two blocks from the same process, and is the same as the SegmentSize.
None of this has any impact on how Lustre physically organizes the data among it's OSTs (but it may very well have a large performance impact!) The Lustre layout is entirely controlled by the striping parameters StripeSize and StripeCount. StripeCount is the number of OSTs that Lustre will use for RAID0 stripes of a given file. StripeSize is the size of each of those RAID0 stripes - the amount of data Lustre will store on one OST before moving on to the next OST.
When IOR blocks don't line up on Lustre stripe boundaries, a number of problems manifest:
- Extra bulk RPCs are required, as TransferSize write might be split across two different OSTs, requiring two different RPCs. Fewer, larger RPCs are more efficient in Lustre.
- Multiple IOR processes may may writing into the same Lustre stripe simultaneously. While this is perfectly legitimate in Lustre, it does mean that the OST must now break the extents lock - the "ownership rights" for a range of the stripe - into smaller bits. This lock negotiation itself requires RPC interaction between the OST and clients, impacting performance.
- Different processes writing on the same OST at the same time causes disc drive head contention, as the drive tries to write into two different positions within the stripe.
Using stripe sizes that are an even multiple of block size does not suffer the split-write issue, but does require the extent locks to be split among the blocks, and suffers the same drive head contention.
Setting the block size as a multiple of the stripe size is better, removing the split-writes and the extent lock splitting, but it still suffers from a different sort of contention. Two different processes may be writing to two different stripes of the same stripe object - the collection of stripes for a given file on a single OST. Drive head contention remains.
When block sizes align exactly with stripe sizes, an ideal 1:1 correspondence between IOR processes and OSTs can be achieved. BUT only if the number of stripes is an integral multiple of the number of processes, otherwise different processes might be trying to write to the same OST, with results similar to the large block case. Also, note that since Lustre's optimal network RPC size is 1MB, the ideal transfer size is also 1MB.
Original PDF: File:SSFstriping.pdf