Zero to Hero: Difference between revisions

From OpenSFS Wiki
Jump to navigation Jump to search
(Created page with "==== Starting from scratch ==== You are a first time user of high-speed storage, possibly HPC in general, and want to see what it can do. Big numbers would be great, you've ...")
 
 
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
==== Starting from scratch ====
==== Starting from scratch ====


You are a first time user of high-speed storage, possibly HPC in general, and want to see what it can do.  Big numbers would be great, you've been told that your new system is capable of moving truly staggering amounts of data at blistering speeds, and you want to see it happen.


This is a tale about locating and eliminating bottlenecks. Some you may or may not have encountered before, or expected.
HPC systems provide value through their ability to run big jobs faster than other systems. It is natural to look for performance metrics; users need to know if your system can run their job, managers want to know if they received good value from their vendor, and so on.  
 
 
When a system is procured, the vendor often provides peak performance numbers. These are based on the theoretical maximum for each component. Storage benchmarks can often run at 80% or more of the theoretical peak.  Good results are almost never achieved on the first attempt. What follows here is a guide to improving benchmark results to get to the desired result.


==== First attempts ====
==== First attempts ====
Well, the way you'd test normal storage is to get a file from one client server on your network to another. So you try it, and copy an iso image, or other large file you've got sitting somewhere to your completely unused parallel filesystem.  You're expecting big things, but watching it go, you're underwhelmed. It's nothing like what you've been promised, maybe not by several orders of magnitude! What could be wrong?
A normal test of read and write rates begins with a program that creates random data and writes it to a file. The new file is then read by the processor and sent to /dev/null.  
 
 
The first time you try this, expect to be underwhelmed. It's far from the peak that you expected. What could be wrong? Where is the bottleneck?  Is the problem with IO parameters such as the block size, or is there a problem with parallelism? 


Well, maybe it's threads, so you copy the same file several times in parallelThat should speed things up, and it might. But you're still stuck with some pretty bad performance. Where do you go from here?
 
Start over, restrict your test to one processor and one IO device. Determine the optimal parameters for this configurationThen see if another processor helps. See if you can double the speed when you are using two IO devices. Find out if one processor can drive two IO devices at twice the speed of one device.


=== dd ===
=== dd ===
The dd command is something you've heard of, maybe used on occasion for testing disk arrays, you might as well give it a shot, and maybe reading from your local drive is causing some problems. So you run something like:
 
 
If you have done this sort of thing before, you are probably familiar with the dd command.  If not, it’s time to go to the man page for dd and check it out. Then run something like:


'''dd if=/dev/zero of=/mnt/filesystem/testdir/file bs=4k count=1m'''
'''dd if=/dev/zero of=/mnt/filesystem/testdir/file bs=4k count=1m'''


You get something that is better than your previous attempts, but still isn't near what you should get.
Compare your test program results (if you started with another test program) with dd results. 
Vary the block size. You should see improvement as you start to add threads, but the returns are probably diminishing. Make sure yournetwork isn’t the bottleneck. If you've been working with networked filesystems before, this may have happened.
 
=== More clients ===
Add another client. Take the scripts and configurations you've got from your single client dd script and run on multiple clients. getting somewhere.
So you start adding in 4, then 8 then ... clients watching the numbers creep up. The limitations of dd are becoming obvious, it's getting hard to script all this and keep things coordinated.
 
=== Running IOR on the Cluster ===


More threads?  can't hurt.  You see some improvement as you start to add threads, but the returns are diminishing, and you're still not seeing the numbers you were promised. You're pretty sure you're out of network bandwidth on the client.  If you've been working with networked filesystems before, this may or may not have happened to you.
The first thing to do with IOR is to get it compiled and running. Using the parameters you found best with dd, run IOR and confirm that with 4 or 8 processors, IOR gives results that are very close to those from an ensemble of dd runs.


=== More clients ===
Time to add another client to the mix.  Taking the scripts and configurations you've got from the single client dd script, you expand it out to 2 clients.  You get roughly double the performance, now we're getting somewhere.


So you start adding in 4, then 8 then ... clients watching the numbers creep upHowever, when you started this whole thing you weren't really thinking of running dozens, hundreds, or thousands? threads across many, many different clients.  Things are starting to get messy.
IOR, IOzone and xdd are the tools that the Hero Run task group uses. Select your favorite and become familiar with itWrite scripts, run them on your cluster, and use them for testing when you get new hardware.  
 


Someone else has got to have done this before, right?  Maybe tools exist that can help sort all of this out.
These codes are also useful for diagnostics. For example, if you have one drive group running at half speed and seven drive groups running at full speed, you will have difficulty finding the slow drive group if all of the tests you run are eight-way parallel.


=== Clustered runs ===
== References ==
Hopefully you've got your compute cluster up and running, and you can run jobs on it.  Time to go hunting for tools to use.  IOR, IOzone and xdd are the tools that the Hero Run task group uses (which is why you're here).  So you grab one, find the basic scripts to run them on your cluster, and start firing away, tweaking variables in the scripts, recording numbers, creeping upwards in performance, and hopefully reaching your goal.
Xyratex [http://goo.gl/7AWwQ Benchmarking Methodology] paper

Latest revision as of 13:06, 20 February 2015

Starting from scratch

HPC systems provide value through their ability to run big jobs faster than other systems. It is natural to look for performance metrics; users need to know if your system can run their job, managers want to know if they received good value from their vendor, and so on.


When a system is procured, the vendor often provides peak performance numbers. These are based on the theoretical maximum for each component. Storage benchmarks can often run at 80% or more of the theoretical peak. Good results are almost never achieved on the first attempt. What follows here is a guide to improving benchmark results to get to the desired result.

First attempts

A normal test of read and write rates begins with a program that creates random data and writes it to a file. The new file is then read by the processor and sent to /dev/null.


The first time you try this, expect to be underwhelmed. It's far from the peak that you expected. What could be wrong? Where is the bottleneck? Is the problem with IO parameters such as the block size, or is there a problem with parallelism?


Start over, restrict your test to one processor and one IO device. Determine the optimal parameters for this configuration. Then see if another processor helps. See if you can double the speed when you are using two IO devices. Find out if one processor can drive two IO devices at twice the speed of one device.

dd

If you have done this sort of thing before, you are probably familiar with the dd command. If not, it’s time to go to the man page for dd and check it out. Then run something like:

dd if=/dev/zero of=/mnt/filesystem/testdir/file bs=4k count=1m

Compare your test program results (if you started with another test program) with dd results. Vary the block size. You should see improvement as you start to add threads, but the returns are probably diminishing. Make sure yournetwork isn’t the bottleneck. If you've been working with networked filesystems before, this may have happened.

More clients

Add another client. Take the scripts and configurations you've got from your single client dd script and run on multiple clients. getting somewhere. So you start adding in 4, then 8 then ... clients watching the numbers creep up. The limitations of dd are becoming obvious, it's getting hard to script all this and keep things coordinated.

Running IOR on the Cluster

The first thing to do with IOR is to get it compiled and running. Using the parameters you found best with dd, run IOR and confirm that with 4 or 8 processors, IOR gives results that are very close to those from an ensemble of dd runs.


IOR, IOzone and xdd are the tools that the Hero Run task group uses. Select your favorite and become familiar with it. Write scripts, run them on your cluster, and use them for testing when you get new hardware.


These codes are also useful for diagnostics. For example, if you have one drive group running at half speed and seven drive groups running at full speed, you will have difficulty finding the slow drive group if all of the tests you run are eight-way parallel.

References

Xyratex Benchmarking Methodology paper