Thanks to the new campus agreement with Mathworks, Matlab Distributed Computing Server is available and installed on the Cosine cluster.
In order to use matlab, the module must be loaded
$ module avail matlab
$ module load matlab/R2014b
Interactive Matlab
Interactive Matlab sessions can be run in text-only mode or using the full Matlab GUI:
- the
matlab
command will try to start the Matlab desktop GUI using X-Windows, if X is not available, then a text-only session will be started
- to specify a text-only interactive Matlab session use
matlab -nodisplay
.
If you want to run a text-only Matlab session, you should:
- log in to the cluster e.g. using
ssh ...
- start a session on a node on the cluster using
qlogin -pe orte <numberOfCoresRequested>
- from this session, you should:
- load the appropriate Matlab module e.g.
module load matlab
- start Matlab using
matlab -nodisplay
Non-Interactive Matlab Jobs
There are three main categories of non-interactive Matlab SGE jobs that you can run on the cluster:
- array jobs run multiple copies of a job across the cluster differentiated by a task ID;
- distributed jobs use Matlab Distributed Computing Server (MDCS) to run across nodes on the cluster and allow communication between tasks.
Array jobs
Array jobs should be used when the job does not require any synchronisation between tasks. The script will be launched multiple times, with a varying index. The index is accessible via the environment variable SGE_TASK_ID
.
Typical uses of array jobs would include:
- processing a set of input files with each job processing a different file;
- processing a single large file using multiple jobs each of which processes a section of the file;
- examining the performance of a model using multiple sets of model parameters.
An example can be found on the cluster in:
/cm/shared/examples/matlab/array
Distributed Jobs
Rather than submitting SGE jobs that execute Matlab scripts on the cluster nodes, distributed jobs launch tasks on cluster nodes from within Matlab. Distributed jobs require the cluster to be configured within Matlab, and submission scripts which define how tasks should be launched on cluster nodes. The submission of the jobs is performed via Matlab GUI or command line interface.
In order to distributed jobs, you should:
- Configure Matlab to use the cluster, either using a cluster profile or programmatically
- Create a independent and/or communicating job submission script
- Submit (run) your job
Matlab cluster profiles
Using GUI configuration utility
In order to configure it, start Matlab GUI and then go Parallel -> Manage Cluster Profiles
New Window will pop up. In the new window, click on Add -> Custom -> Generic
New profile will be created. Re-name it to something sensible (you will be referring to it through the code). Lets call it Cosine.
Next, make sure you have provided the following info in the Properties tab (leaving all of the other options as default:
Main Properties |
Description of this cluster: |
Cosine HPC |
Folder where cluster stores job data: |
use default (unless you want to specify alternative location) |
Number of workers available to the cluster: |
32 |
Root folder of MATLAB installation for workers: |
use default |
Cluster uses MathWorks hosted licensing: |
false |
Submit Functions |
Function called when submitting independent jobs: |
@independentSubmitFcn |
Function called when submitting communicating jobs: |
@communicatingSubmitFcn |
Cluster Environment |
Cluster nodes' operating system: |
Unix |
Job storage location is accessible from client and cluster nodes: |
yes |
Workers |
Range of number of workers to run the job: |
[1 32] |
Jobs and task functions |
Function to query cluster about the job state: |
@getJobStateFcn |
Function to manage cluster when you call delete on a job: |
@deleteJobFcn
|
Note, that once profile has been loaded, you can override the settings from the submission script
Once the profile has been set up click ok. Next select newly created profile, and validate the configuration.
Importing Cluster Profiles
You can import a profile using either the Cluster Profile Manager or the Matlab parallel.importProfile(filename)
command.
parallel.importProfile('/cm/shared/examples/matlab/distributed/Cosine.settings');
To import settings from the Cluster Profile Manager, use:
- Parallel -> Manage Cluster Profiles
- Add -> Import
- and select the appropriate settings file.
Programmatically
Rather than using a previously defined cluster profile, the cluster details can be configured ad-hoc in a .m script file:
cluster = parallel.cluster.Generic();
cluster.NumWorkers = 32;
cluster.JobStorageLocation = '/homes/cosine/helpdesk/matlab/';
cluster.IndependentSubmitFcn = @independentSubmitFcn;
cluster.CommunicatingSubmitFcn = @communicatingSubmitFcn;
cluster.OperatingSystem = 'unix';
cluster.HasSharedFilesystem = true;
cluster.GetJobStateFcn = @getJobStateFcn;
cluster.DeleteJobFcn = @deleteJobFcn;
cluster.RequiresMathWorksHostedLicensing = false;
To save the cluster definition as a profile for later re-use, use:
cluster.saveAsProfile('Cosine')
To load a previously saved cluster definition, use:
cluster = parcluster('Cosine')
Passing Additional Parameters to SGE
If you want to pass additional arguments to the SGE specify the submit function as {@communicatingSubmitFcn, 'list_of_additional_qsub_parameters'}
e.g. to specify that 4GB of memory should be requested, and that emails should be sent to name@domain.name
at the beginning and end of the job the submit functions should be specified as:
cluster = parcluster('Cosine');
cluster.CommunicatingSubmitFcn = {@communicatingSubmitFcn, '-l h_vmem=5G -m be -M name@domain.name'};
pp = parpool(cluster);
parfor i=1:10
hn = system('hostname');
disp(hn);
end
...
delete(pp)
Make sure that the options you pass to the qsub
command are syntactically correct, otherwise the job will fail (see the qsub
man page for the list of available options).
Independent Jobs
An independent job is defined as follows (from http://www.mathworks.co.uk/help/distcomp/program-independent-jobs.html):
An Independent job is one whose tasks do not directly communicate with each other, that is, the tasks are independent of each other. The tasks do not need to run simultaneously, and a worker might run several tasks of the same job in succession. Typically, all tasks perform the same or similar functions on different data sets in an embarrassingly parallel configuration.
Independent jobs are created using the Matlab createJob()
function.
An independent job example:
/cm/shared/examples/matlab/distributed/independent
Note: Matlab will submit the job to SGE without the need to write a submission script:
$ matlab -nodisplay < independent.m
Communicating Jobs
A communicating job is defined as follows (from http://www.mathworks.co.uk/help/distcomp/introduction.html):
Communicating jobs are those in which the workers can communicate with each other during the evaluation of their tasks. A communicating job consists of only a single task that runs simultaneously on several workers, usually with different data. More specifically, the task is duplicated on each worker, so each worker can perform the task on a different set of data, or on a particular segment of a large data set. The workers can communicate with each other as each executes its task. The function that the task runs can take advantage of a worker's awareness of how many workers are running the job, which worker this is among those running the job, and the features that allow workers to communicate with each other.
Communicating jobs are required for:
parfor
loops which allow multiple loop iterations to be executed in parallel;
spmd
blocks which run a single program on multiple data - i.e. the same program runs on all workers with behaviour determined by the varying data on each worker (see here).
Communicating jobs are created using the Matlab createCommunicatingJob()
function and can have a Type of either pool
or spmd
:
- a
pool
job runs the supplied task on one worker and uses the remaining workers as a pool to execute parfor
loops, spmd
blocks etc., the total number of workers available for parallel code is therefore one less than the total number of workers;
- a
spmd
job runs the supplied task on all the workers, with no task fundamentally in control - effectively, an spmd
job acts as if the entire task is within an spmd
block.
Communication beween spmd workers (whether in an spmd
job or spmd
block) occurs using the lab*
functions (see Matlab help). Control of spmd workers is usually exerted by message passing and testing data values (e.g. using the worker with labindex
of 1 to control other workers).
An independent job example:
/cm/shared/examples/matlab/distributed/communication
Note: Matlab will submit the job to SGE without the need to write a submission script:
$ matlab -nodisplay < communication.m
Non-blocking jobs
The examples in both the communicating and independent jobs sections submit the job then wait (block) until the job is complete, subsequently extracting the results and deleting the job, i.e.
cluster = parcluster('Cosine');
ijob = createJob(cluster);
....
submit(ijob);
wait(job, 'finished'); %Wait for the job to finish
results = getAllOutputArguments(ijob); %retrieve results
...
destroy(job); %destroy the job
In some situations, this might not be desired - e.g. where client is not allowed to run for long times on the submit host. In such cases a non-blocking submit script should be used instead. The only difference to the communicating and independent scrips defined earlier is that a non-blocking job doesn't have the wait
and destroy
calls.
A non-blocking independent submit script:
/cm/shared/examples/matlab/distributed/independent/independent_noblock.m
A non-blocking communicating submit script:
/cm/shared/examples/matlab/distributed/communication/communication_noblock.m
Once the job has been completed, the results can be fetched programmatically:
cluster = parcluster('Cosine');
job = cluster.findJob('ID',1);
job_output = fetchOutputs(job);
The ID used in cluster.findJob('ID', ...)
above is the internal Matlab job ID as displayed at the end of the example non-blocking submit scripts, not the SGE job ID.
Once you have finished with it you can delete it using