DNE RemoteDirectories Implementation wiki version

From OpenSFS Wiki
Revision as of 12:21, 11 June 2015 by RichardHenwood (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Implementation milestone 1

Milestone Completion Criteria

Per the contract, three Implementation milestones have been defined by Whamcloud. This document is concerned with completing the first Implementation milestone which is agreed as: “Demonstrate working DNE code. The sanity.sh and mdsrate-create tests will pass in a DNE environment. Suitable new regression tests for the remote directory functionality will be added and passed, including functional Use Cases for upgrade and downgrade.” These requirements are enumerated as: Demonstrate working DNE code

  • sanity.sh passed.
  • mdsrate-create passed.
  • Regression tests implemented and passed.
  • Upgrade demonstrated.
  • Downgrade demonstrated.

These requirements are demonstrated below.

sanity.sh passed

The results from the sanity.sh run are recorded in maloo at: https://maloo.whamcloud.com/test_sets/4c945302-9bf0-11e1-8837-52540035b04c A screenshot of the page is available as Appendix A: sanity.sh screenshot.

Imp1 fig1.png

Data was collected from a Hyperion test run between April 12th and April 17th 2012. The test configuration included 100 clients, 4 threads on each client, each thread with an individual mount point. Each thread performs 10000 file open/create within unique directories. The units of the y-axis are completed file operations per second.

Regression tests implemented and passed

test_230 has been created to test DNE Remote Directories functionality:

test_230() {
       [ "$MDTCOUNT" -lt "2" ] && skip_env "skipping remote directory test" && return
       local MDTIDX=1

	   mkdir -p $DIR/$tdir/test_230_local
       local mdt_idx=$($GETSTRIPE -M $DIR/$tdir/test_230_local)
       [ $mdt_idx -ne 0 ] &&
                  error "create local directory on wrong MDT $mdt_idx"

       $LFS setdirstripe -i $MDTIDX $DIR/$tdir/test_230 ||
	              error "create remote directory failed"
       local mdt_idx=$($GETSTRIPE -M $DIR/$tdir/test_230)
       [ $mdt_idx -ne $MDTIDX ] &&
                  error "create remote directory on wrong MDT $mdt_idx"

       createmany -o $DIR/$tdir/test_230/t- 10 || error "create files on remote directory failed"
       mdt_idx=$($GETSTRIPE -M $DIR/$tdir/test_230/t-0)
       [ $mdt_idx -ne $MDTIDX ] && error "create files on wrong MDT $mdt_idx"
       rm -r $DIR/$tdir || error "unlink remote directory failed"
}
run_test 230 "Create remote directory and files under the remote directory"

Upgrade demonstrated

Upgrade functionality is demonstrated as part of the test test_32c in conf-sanity: https://maloo.whamcloud.com/test_sets/5cfc5278-9d2e-11e1-8587-52540035b04c

== conf-sanity test 32c: Upgrade with writeconf =================== 14:09:40 (1336932580)
Loading modules from /work/orion_release/orion/lustre-dev/lustre
../libcfs/libcfs/libcfs options: 'libcfs_panic_on_lbug=0'
debug=-1
subsystem_debug=all -lnet -lnd -pinger
../lnet/lnet/lnet options: 'networks=tcp accept=all'
gss/krb5 is not supported
/work/orion_release/orion/lustre-dev/lustre/utils/tunefs.lustre
arch
commit
kernel
list
mdt
ost
sha1sums
Upgrading from disk2_2-ldiskfs.tar.bz2, created with:
  Commit: 2.2
  Kernel: 2.6.32-220.el6_lustre.g4554b65.x86_64
    Arch: x86_64
debug=-1
mount old MDT ....
mkfs new MDT....
mkfs.lustre: Warning: default mount option `errors=remount-ro' is missing
mount new MDT....
Mount client with 2 MDTs 
Create the local directory and files on the old MDT
total: 10 creates in 0.04 seconds: 230.77 creates/second
Verify the MDT index of these files...Pass
Create the remote directory and files on new MDT
total: 10 creates in 0.05 seconds: 188.40 creates/second
Verify the MDT index of these files...Pass
Skip b1_8 images before we have 1.8 compatibility
Skip b1_8 images before we have 1.8 compatibility
Resetting fail_loc on all nodes...done.

Downgrade demonstrated

[root@testnode1 tests]# sh -vx upgrade_downgrade.sh
#Upgrade tests
#go to lustre 2.3 make old ldiskfs
cd /work/lustre-2.3/lustre/tests
+ cd /work/lustre-2.3/lustre/tests
testnode=${testnode:-"`hostname`"}
hostname
++ hostname
+ testnode=testnode1
LOAD=y sh llmount.sh
+ LOAD=y
+ sh llmount.sh
Loading modules from /work/lustre-2.3/lustre/tests/..
debug=vfstrace rpctrace dlmtrace neterror ha config ioctl super
subsystem_debug=all -lnet -lnd -pinger
../lnet/lnet/lnet options: 'networks=tcp accept=all'
gss/krb5 is not supported
quota/lquota options: 'hash_lqs_cur_bits=3'
set -vx
+ set -vx
../utils/mkfs.lustre --reformat --mgs --mdt --device-size=1048576 /tmp/lustre-mdt-2.3
+ ../utils/mkfs.lustre --reformat --mgs --mdt --device-size=1048576 /tmp/lustre-mdt-2.3
Permanent disk data:
Target: lustre-MDTffff
Index: unassigned
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x75
(MDT MGS needs_index first_time update )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters:
formatting backing filesystem ldiskfs on /dev/loop0
target name lustre-MDTffff
4k blocks 262144
options -I 512 -i 2048 -q -O dirdata,uninit_bg,dir_nlink,huge_file,flex_bg -E lazy_journal_init -F
mkfs_cmd = mke2fs -j -b 4096 -L lustre-MDTffff -I 512 -i 2048 -q -O dirdata,uninit_bg,dir_nlink,huge_file,flex_bg -E lazy_journal_init -F /dev/loop0 262144
Writing CONFIGS/mountdata
../utils/mkfs.lustre --reformat --mgsnode=$testnode --ost --device-size=1048576 /tmp/lustre-ost-2.3
+ ../utils/mkfs.lustre --reformat --mgsnode=testnode1 --ost --device-size=1048576 /tmp/lustre-ost-2.3
Permanent disk data:
Target: lustre-OSTffff
Index: unassigned
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x72
(OST needs_index first_time update )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.122.162@tcp
formatting backing filesystem ldiskfs on /dev/loop0
target name lustre-OSTffff
4k blocks 262144
options -I 256 -q -O extents,uninit_bg,dir_nlink,huge_file,flex_bg -G 256 -E resize=4290772992,lazy_journal_init -F
mkfs_cmd = mke2fs -j -b 4096 -L lustre-OSTffff -I 256 -q -O extents,uninit_bg,dir_nlink,huge_file,flex_bg -G 256 -E resize=4290772992,lazy_journal_init -F /dev/loop0 262144
Writing CONFIGS/mountdata
mount -t lustre -o loop,user_xattr,acl /tmp/lustre-mdt-2.3 /mnt/mds
+ mount -t lustre -o loop,user_xattr,acl /tmp/lustre-mdt-2.3 /mnt/mds
mount -t lustre -o loop /tmp/lustre-ost-2.3 /mnt/ost1
+ mount -t lustre -o loop /tmp/lustre-ost-2.3 /mnt/ost1
mount -t lustre $testnode:/lustre /mnt/lustre
+ mount -t lustre testnode1:/lustre /mnt/lustre
cp /etc/fstab /mnt/lustre
+ cp /etc/fstab /mnt/lustre
cp /etc/hosts /mnt/lustre
+ cp /etc/hosts /mnt/lustre
umount /mnt/lustre
+ umount /mnt/lustre
umount /mnt/ost1
+ umount /mnt/ost1
umount /mnt/mds
+ umount /mnt/mds
losetup -d /dev/loop0
+ losetup -d /dev/loop0
losetup -d /dev/loop1
+ losetup -d /dev/loop1
losetup -d /dev/loop2
+ losetup -d /dev/loop2
ioctl: LOOP_CLR_FD: No such device or address
losetup -d /dev/loop3
+ losetup -d /dev/loop3
ioctl: LOOP_CLR_FD: No such device or address
losetup -d /dev/loop4
+ losetup -d /dev/loop4
ioctl: LOOP_CLR_FD: No such device or address
losetup -d /dev/loop5
+ losetup -d /dev/loop5
ioctl: LOOP_CLR_FD: No such device or address
losetup -d /dev/loop6
+ losetup -d /dev/loop6
ioctl: LOOP_CLR_FD: No such device or address
losetup -d /dev/loop7
+ losetup -d /dev/loop7
ioctl: LOOP_CLR_FD: No such device or address
LOAD=y sh llmountcleanup.sh
+ LOAD=y
+ sh llmountcleanup.sh
Stopping clients: testnode1 /mnt/lustre (opts:-f)
Stopping clients: testnode1 /mnt/lustre2 (opts:-f)
osd_ldiskfs 296768 0 
fsfilt_ldiskfs 119600 0 
mdd 426496 3 osd_ldiskfs,cmm,mdt
ldiskfs 354264 2 osd_ldiskfs,fsfilt_ldiskfs
jbd2 101384 3 osd_ldiskfs,fsfilt_ldiskfs,ldiskfs
crc16 35328 1 ldiskfs
obdclass 1109104 29 llite_lloop,lustre,obdfilter,ost,osd_ldiskfs,cmm,fsfilt_ldiskfs,mdt,mdd,mds,mgs,mgc,lov,osc,mdc,lmv,fid,fld,lquota,ptlrpc
lvfs 72256 22 llite_lloop,lustre,obdfilter,ost,osd_ldiskfs,cmm,fsfilt_ldiskfs,mdt,mdd,mds,mgs,mgc,lov,osc,mdc,lmv,fid,fld,lquota,ptlrpc,obdclass
libcfs 344320 24 llite_lloop,lustre,obdfilter,ost,osd_ldiskfs,cmm,fsfilt_ldiskfs,mdt,mdd,mds,mgs,mgc,lov,osc,mdc,lmv,fid,fld,lquota,ptlrpc,obdclass,lvfs,ksocklnd,lnet
exportfs 39296 2 fsfilt_ldiskfs,nfsd
modules unloaded.
echo "go to DNE branch do upgrade"
+ echo 'go to DNE branch do upgrade'
go to DNE branch do upgrade
cd /work/lustre-dne/lustre/tests
+ cd /work/lustre-dne/lustre/tests
LOAD=y sh llmount.sh
+ LOAD=y
+ sh llmount.sh
Loading modules from /work/lustre-dne/lustre/tests/..
../libcfs/libcfs/libcfs options: 'libcfs_panic_on_lbug=0'
debug=vfstrace rpctrace dlmtrace neterror ha config ioctl super
subsystem_debug=all -lnet -lnd -pinger
../lnet/lnet/lnet options: 'networks=tcp accept=all'
gss/krb5 is not supported
mount -t lustre -o loop,user_xattr,acl,abort_recov,write_conf /tmp/lustre-mdt-2.3 /mnt/mds1
+ mount -t lustre -o loop,user_xattr,acl,abort_recov,write_conf /tmp/lustre-mdt-2.3 /mnt/mds1
../utils/mkfs.lustre --reformat --mgsnode=$testnode --mdt --device-size=1048576 --index 1 /tmp/lustre-mdt-new
+ ../utils/mkfs.lustre --reformat --mgsnode=testnode1 --mdt --device-size=1048576 --index 1 /tmp/lustre-mdt-new
Permanent disk data:
Target: lustre:MDT0001
Index: 1
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x61
(MDT first_time update )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters: mgsnode=192.168.122.162@tcp
formatting backing filesystem ldiskfs on /dev/loop1
target name lustre:MDT0001
4k blocks 262144
options -I 512 -i 2048 -q -O dirdata,uninit_bg,dir_nlink,huge_file,flex_bg -E lazy_journal_init -F
mkfs_cmd = mke2fs -j -b 4096 -L lustre:MDT0001 -I 512 -i 2048 -q -O dirdata,uninit_bg,dir_nlink,huge_file,flex_bg -E lazy_journal_init -F /dev/loop1 262144
Writing CONFIGS/mountdata
mount -t lustre -o loop,user_xattr,acl,abort_recov,write_conf /tmp/lustre-mdt-new /mnt/mds2
+ mount -t lustre -o loop,user_xattr,acl,abort_recov,write_conf /tmp/lustre-mdt-new /mnt/mds2
mount -t lustre -o loop,abort_recov,write_conf /tmp/lustre-ost-2.3 /mnt/ost1
+ mount -t lustre -o loop,abort_recov,write_conf /tmp/lustre-ost-2.3 /mnt/ost1
mount -t lustre $testnode:/lustre /mnt/lustre
+ mount -t lustre testnode1:/lustre /mnt/lustre
diff /mnt/lustre/fstab /etc/fstab || { echo "the file is diff1" && exit 1; }
+ diff /mnt/lustre/fstab /etc/fstab
diff /mnt/lustre/hosts /etc/hosts || { echo "the file is diff1" && exit 1; }
+ diff /mnt/lustre/hosts /etc/hosts
../utils/lfs setdirstripe -i 1 /mnt/lustre/test_mdt1 || { echo "create remote directory failed" && exit 1; }
+ ../utils/lfs setdirstripe -i 1 /mnt/lustre/test_mdt1
mdt_idx=$(../utils/lfs getstripe -M /mnt/lustre/test_mdt1)
../utils/lfs getstripe -M /mnt/lustre/test_mdt1
++ ../utils/lfs getstripe -M /mnt/lustre/test_mdt1
+ mdt_idx=1
[ $mdt_idx -ne 1 ] && { echo "create remote directory on wrong MDT" && exit 1; }
+ '[' 1 -ne 1 ']'

mkdir /mnt/lustre/test_mdt1/dir
+ mkdir /mnt/lustre/test_mdt1/dir
mdt_idx=$(../utils/lfs getstripe -M /mnt/lustre/test_mdt1/dir)
../utils/lfs getstripe -M /mnt/lustre/test_mdt1/dir
++ ../utils/lfs getstripe -M /mnt/lustre/test_mdt1/dir
+ mdt_idx=1
[ $mdt_idx -ne 1 ] && { echo "create remote directory on wrong MDT" && exit 1; }
+ '[' 1 -ne 1 ']'
cp /mnt/lustre/hosts /mnt/lustre/test_mdt1/dir/hosts
+ cp /mnt/lustre/hosts /mnt/lustre/test_mdt1/dir/hosts
cp /mnt/lustre/fstab /mnt/lustre/test_mdt1/dir/fstab
+ cp /mnt/lustre/fstab /mnt/lustre/test_mdt1/dir/fstab
echo "downgrade DNE to single MDT"
+ echo 'downgrade DNE to single MDT'
downgrade DNE to single MDT
mkdir /mnt/lustre/test_mdt1_backup
+ mkdir /mnt/lustre/test_mdt1_backup
cp -R /mnt/lustre/test_mdt1/ /mnt/lustre/test_mdt1_backup/
+ cp -R /mnt/lustre/test_mdt1/ /mnt/lustre/test_mdt1_backup/
../utils/lctl dk > /tmp/debug.out
+ ../utils/lctl dk
umount /mnt/lustre/
+ umount /mnt/lustre/
umount /mnt/mds2
+ umount /mnt/mds2
umount /mnt/mds1
+ umount /mnt/mds1
umount /mnt/ost1
+ umount /mnt/ost1
LOAD=y sh llmountcleanup.sh
+ LOAD=y
+ sh llmountcleanup.sh
Stopping clients: testnode1 /mnt/lustre (opts:-f)
Stopping clients: testnode1 /mnt/lustre2 (opts:-f)
osd_ldiskfs 343056 0 
fsfilt_ldiskfs 43776 0 
ldiskfs 354392 2 osd_ldiskfs,fsfilt_ldiskfs
mdd 338320 3 osd_ldiskfs,cmm,mdt
fld 113776 7 osd_ldiskfs,lod,obdfilter,cmm,mdt,lmv,fid
obdclass 1221936 28 llite_lloop,lustre,osd_ldiskfs,osp,lod,obdfilter,ost,cmm,mdt,mdd,mgs,mgc,lov,osc,mdc,lmv,fid,fld,ptlrpc
lvfs 59024 22 llite_lloop,lustre,osd_ldiskfs,fsfilt_ldiskfs,osp,lod,obdfilter,ost,cmm,mdt,mdd,mgs,mgc,lov,osc,mdc,lmv,fid,fld,ptlrpc,obdclass
libcfs 344192 24 llite_lloop,lustre,osd_ldiskfs,fsfilt_ldiskfs,osp,lod,obdfilter,ost,cmm,mdt,mdd,mgs,mgc,lov,osc,mdc,lmv,fid,fld,ptlrpc,obdclass,lvfs,ksocklnd,lnet
jbd2 101384 2 osd_ldiskfs,ldiskfs
crc16 35328 1 ldiskfs
modules unloaded.
losetup -d /dev/loop0
+ losetup -d /dev/loop0
losetup -d /dev/loop1
+ losetup -d /dev/loop1
losetup -d /dev/loop2
+ losetup -d /dev/loop2
losetup -d /dev/loop3
+ losetup -d /dev/loop3
ioctl: LOOP_CLR_FD: No such device or address
losetup -d /dev/loop4
+ losetup -d /dev/loop4
ioctl: LOOP_CLR_FD: No such device or address
losetup -d /dev/loop5
+ losetup -d /dev/loop5
ioctl: LOOP_CLR_FD: No such device or address
losetup -d /dev/loop6
+ losetup -d /dev/loop6
ioctl: LOOP_CLR_FD: No such device or address
losetup -d /dev/loop7
+ losetup -d /dev/loop7
ioctl: LOOP_CLR_FD: No such device or address
cd /work/lustre-2.3/lustre/tests
+ cd /work/lustre-2.3/lustre/tests
LOAD=y sh llmount.sh
+ LOAD=y
+ sh llmount.sh
Loading modules from /work/lustre-2.3/lustre/tests/..
debug=vfstrace rpctrace dlmtrace neterror ha config ioctl super
subsystem_debug=all -lnet -lnd -pinger
../lnet/lnet/lnet options: 'networks=tcp accept=all'
gss/krb5 is not supported
quota/lquota options: 'hash_lqs_cur_bits=3'
/work/lustre-2.3/lustre/utils/tunefs.lustre --writeconf /tmp/lustre-mdt-2.3
+ /work/lustre-2.3/lustre/utils/tunefs.lustre --writeconf /tmp/lustre-mdt-2.3
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: lustre-MDT0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x5
(MDT MGS )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters:
Permanent disk data:
Target: lustre-MDT0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x105
(MDT MGS writeconf )
Persistent mount opts: user_xattr,errors=remount-ro
Parameters:
Writing CONFIGS/mountdata
/work/lustre-2.3/lustre/utils/tunefs.lustre --writeconf /tmp/lustre-ost-2.3
+ /work/lustre-2.3/lustre/utils/tunefs.lustre --writeconf /tmp/lustre-ost-2.3
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: lustre-OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.122.162@tcp
Permanent disk data:
Target: lustre-OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x102
(OST writeconf )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.122.162@tcp
Writing CONFIGS/mountdata
mount -t lustre -o loop,user_xattr,acl,abort_recov /tmp/lustre-mdt-2.3 /mnt/mds
+ mount -t lustre -o loop,user_xattr,acl,abort_recov /tmp/lustre-mdt-2.3 /mnt/mds
mount -t lustre -o loop,abort_recov /tmp/lustre-ost-2.3 /mnt/ost1
+ mount -t lustre -o loop,abort_recov /tmp/lustre-ost-2.3 /mnt/ost1
mount -t lustre $testnode:/lustre /mnt/lustre
+ mount -t lustre testnode1:/lustre /mnt/lustre
diff /mnt/lustre/fstab /etc/fstab || { echo "the file is different" && exit 1; }
+ diff /mnt/lustre/fstab /etc/fstab
diff /mnt/lustre/hosts /etc/hosts || { echo "the file is different" && exit 1; }
+ diff /mnt/lustre/hosts /etc/hosts
diff /mnt/lustre/test_mdt1_backup/test_mdt1/dir/fstab /etc/fstab || { echo "the file is different" && exit 1; }
+ diff /mnt/lustre/test_mdt1_backup/test_mdt1/dir/fstab /etc/fstab
diff /mnt/lustre/test_mdt1_backup/test_mdt1/dir/hosts /etc/hosts || { echo "the file is different" && exit 1; }
+ diff /mnt/lustre/test_mdt1_backup/test_mdt1/dir/hosts /etc/hosts
umount /mnt/lustre
+ umount /mnt/lustre
umount /mnt/mds
+ umount /mnt/mds
umount /mnt/ost1
+ umount /mnt/ost1


Implementation Milestone 2

Milestone Completion Criteria

Per the contract, three Implementation milestones have been defined by the Whamcloud. This document is concerned with completing the first Implementation milestone which is agreed as: Demonstrate DNE recovery and failover. Suitable DNE-specific recovery and failover tests will be added and passed. These requirements are demonstrated below.

Demonstrate DNE recovery and fail-over

Recovery and fail-over testing is performed by the replay-dual, replay-single and recovery-small tests.

replay-dual

Verify recovery from two clients after server failure. https://maloo.whamcloud.com/test_sets/7767e970-24cd-11e2-9e7c-52540035b04c

recovery-small

Verify RPC replay after communications failure https://maloo.whamcloud.com/test_sets/76bb0e82-24cb-11e2-9e7c-52540035b04c

replay-single

https://maloo.whamcloud.com/test_sets/dd004406-24ca-11e2-9e7c-52540035b04c The test platform is the OpenSFS Functional Test Cluster. The test run with these results is recorded in maloo at: https://maloo.whamcloud.com/test_sessions/4f6f3d5a-24bf-11e2-9e7c-52540035b04c A screenshot of the test session is recorded in Appendix A.

DNE-Specific recovery and failover tests will be added and passed.

Change Commit Message

4318 tests: add parallel sanity tests to dne 4319 dne: add dne test into insanity.sh 4320 dne: add remote dir test to recovery-xxx-scale 4321 dne: add remote dir check in replay-vbr. 4367 tests: DNE fixes for conf sanity. 4366 tests: Add dne specific tests to sanityN 4365 tests: add create remote directory to racer 4364 tests: add DNE upgrade tests. 4363 tests: support multiple node fails 4362 tests: add dne test cases in replay-single 4361 tests: add dne tests cases in replay-dual 4360 tests: add DNE specific tests in recovery-small 4359 tests: Add test_mkdir in sanity for DNE 4358 tests: Add DNE test cases in sanity 230.

Appendix A

Imp2fig1.png

NOTE: Tests unrelated to recovery and failover exhibit a small number of failures. These are a result of unresolved issues on Master branch. DNE is now based on Master so as the sources of failure are resolved they will be inherited by DNE.


Implementation Milestone 3

Milestone Completion Criteria

Per the contract, three Implementation milestones have been defined by Intel. This document is concerned with completing the third Implementation milestone which is agreed as: Performance and scaling testing will be run on available testing resources. The Lustre Manual will be updated to include DNE Documentation. These requirements are demonstrated below.

Performance of DNE code on OpenSFS Cluster

Performance tests were run with the 'mdsrate' MPI program on 24 clients on the OpenSFS Functional Test Cluster. Each client has 8 mount points giving a total of 192 threads to drive load to the file system. Each thread operates on 20K files for a given test resulting in a total test load of 3840K files per run. Each test run was completed once. The OpenSFS Cluster hardware is described in Appendix A.

Performance measurements on 2012-12-17

Imp3fig1.png The performance of create unlink and mkdir on the current Master branch is significantly below the level of Lustre 2.3. DNE code has not yet been landed to Master so it is not responsible for this regression. The performance of the DNE-enabled code is comparable to the current Master branch. After further investigation into the regression, it appears to be caused by quota accounting (LU-2442), which is now enabled by default in 2.4, but not in Lustre 2.3 or earlier. The quota code serializes the metadata operations, which defeats the SMP scaling introduced in 2.3. DNE scaling will be re-tested with quota disabled as part of the next DNE milestone.

Performance measurements on 2012-12-20

Imp3fig2.png Results from ongoing analysis are shown in the figure above. These results indicate illustrate that after an initial analysis, fixed build of Master and DNE are able to be rapidly tested and results made available. Performance overall for Master and DNE is improved compared with results from 2012-12-17. Work continues on the analysis of the performance differences which suggest a significant regression.

Scaling of DNE code on OpenSFS Cluster

Scaling test were performed on 2012-12-17 using Master and DNE code from the same date. This code displays poor performance compared to code tested on 2012-12-20, shown on Figure 2. Four tests of typical operations have been completed. These include:

  • create
  • stat
  • unlink
  • mknod

Scale tests were run with the 'mdsrate' MPI program on 24 clients on the OpenSFS Functional Test Cluster. Each client has 8 mount points giving a total of 192 threads to drive load to the file system. Each thread operates on 20K files for a given test resulting in a total test load of 3840K files per run. Each test run was completed once. The OpenSFS Cluster hardware is described in Appendix A. In all cases: projected values are estimated as a linear extrapolation from the best performing MDS/MDT configuration.

create scaling

Imp3fig3.png Projected values are calculated by making a linear projection from the best performing configuration. In the case of 'create' 4(MDS/MDT) perform best.

stat scaling

Imp3fig4.png Projected values are calculated by making a linear projection from the best performing configuration. In the case of 'stat', 1(MDS/MDT) performs best.

unlink scaling

Imp3fig5.png Projected values are calculated by making a linear projection from the best performing configuration. In the case of 'unlink' 4(MDS/MDT) perform best.

mknod scaling

Imp3fig6.png Projected values are calculated by making a linear projection from the best performing configuration. In the case of 'mknod' 4(MDS/MDT) perform best.


In the case of the stat operation, the scaling between 1 and 2 MDTs increases linearly. An initial analysis suggests that the client load may reach a maximum at ~350K stat/second and be unable to drive the higher order availability. Further performance testing, analysis and improvements will be performed as part of the next milestone. Create, unlink and mknod illustrate that a single MDT under load performs worse than two MDTs under the same load. An initial analysis suggests that as MDTs are added additional memory and disk cache become available with the new MDS nodes. As the load is constant, fewer inodes per MDS are handled.

Update Lustre Manual

The Lustre manual update is currently in review at: http://review.whamcloud.com/4773 It includes the following topics:

  • add an MDT.
  • remove an MDT.
  • upgrade to multiple MDT configurations.
  • designing active-active MDS configurations.
  • warns against having chained remote directories.

The changes from a snapshot of the Manual on 2012-12-17

Conclusion

Measurements performed on 2012-12-17 indicated Master branch exhibited a performance regression compared to Lustre 2.3, DNE performance on 2012-12-17 was found to be on-par with Master performance on the same date. An initial analysis was conducted. New builds of both Master and DNE were prepared. On 2012-12-20, performance of Master and DNE were repeated. These results show performance of Master and DNE has significantly improved but some measurements (stat, unlink) remain behind 2.3. Once the performance regression in Master is understood and resolved, the DNE performance will be re-tested against the improved Master performance. The stat operation scales well between one and two MDTs but does not continue to show linear performance scaling up to four MDTs (the only other measurement.) More investigation will be performed for the next milestone, including testing with three MDTs. Implementation phase 3 has been completed according to the agreed criteria.

Appendix A: OpenSFS Functional Test Cluster configuration and specification MDS server

(2) Intel E5620 2.4GHz Westmere (Total 8 Cores)
(1) 64GB DDRIII 1333MHz ECC/REG - (8x8GB Modules Installed) * (1) On Board Dual 10/100/1000T Ports
(1) 500GB SATA Enterprises 24x7
(1) 40GB SSD OCZ SATA
(8) Hot Swap Drive Bays for SATA/SAS
(6) PCi-e Slots 8X
(3) QDR 40GB QSFP to QSFP iB Cables
(3) Mellanox QDR 40GB QSFP Single Port

each pair of MDS are sharing one storage as below

(1) Intel E5620 2.4GHz Westmere (Total 4 Cores)
(1) 12GB DDRIII 1333MHz ECC/REG - (3x4GB Modules Installed)
(1) On Board Dual 10/100/1000T Ports
(1) On Board IPMI 2.0 Via 3rd. Lan
(6) PCi-e Slots 8X
(2) Mellanox QDR 40GB QSFP Single Port (Connected to MDS Server) * (1) LSI/3Ware 9750SA-4i with BBU installed (raid 0/1/5/6...)
(1) SM826E16-R920LPB 12 Bays 2U Case with Dual 920W PS,
(10) 2TB Enterprises HDDs. 24x7 SATA II
(2) 120GB SSD (Raid 1)

The MDT are configured with external journal from a local SSD with size = 7GB

OSS server

(2) Intel E5620 2.4GHz Westmere (Total 8 Cores)
(2) Copper Base CP0217 CPU Cooler 1U with Heat Pipe
(1) 64GB DDRIII 1333MHz ECC/REG - (8x8GB Modules Installed) * (1) On Board Dual 10/100/1000T Ports
(1) On Board VGA
(1) On Board IPMI 2.0 Via 3rd. Lan
(1) 500GB SATA Enterprises 24x7
(1) 40GB SSD OCZ SATA
(8) Hot Swap Drive Bays for SATA/SAS
(6) PCi-e Slots 8X
(3) QDR 40GB QSFP to QSFP iB Cables
(3) Mellanox QDR 40GB QSFP Single Port

each pair of OSS are sharing one storage as below

(1) Intel E5620 2.4GHz Westmere (Total 4 Cores)
(1) Copper Base CP0217 CPU Cooler 1U with Heat Pipe
(1) 12GB DDRIII 1333MHz ECC/REG - (3x4GB Modules Installed)
(1) On Board Dual 10/100/1000T Ports
(6) PCi-e Slots 8X
(2) Mellanox QDR 40GB QSFP Single Port (Connected to OSS Server) * (1) LSI/3Ware 9750SA-4i with BBU installed (raid 0/1/5/6...)
(1) SM846E16-R1200B 24 Bays 4U Case with Dual 1200W PS,
(20) 2TB Enterprises HDDs. 24x7 SATA II (2 x 8+2 Raid 6)
(4) 120GB SSD (Raid 1, 2+2)

ALL OS version are Rhel 6.3/x86_64 DNE: http://review.whamcloud.com/4414 Master: http://review.whamcloud.com/4614 Test CMD:

/usr/lib64/openmpi/bin/mpirun -mca plm_ssh_agent rsh -np 192 -machinefile /home/minh.diep/bin/machinefile /home/minh.diep/bin/mdsrate --create|stat|unlink|mknod --mdtcount ## --mntcount 8 --mntfmt='/mnt/lustre%d' --dirfmt='dnedir%d' --nfiles 20000 --ndirs 192 --filefmt 'g%%d'