Skip to topic | Skip to bottom
Home
Main
Main.GridEnginer1.9 - 26 Jul 2007 - 18:11 - AndreGauthiertopic end

Start of topic | Skip to actions

Table of contents

Quick Reference

On the clusters, an easy way to submit jobs is using the Grid Engine. This will automatically start your job on the dali node with the least load. There's a GUI interface, but I think it's simpler just to use the command line. It's important for everyone to use Grid Engine rather than just ssh'ing to compute servers, because the Grid Engine can't manage jobs it doesn't know about.

Grid Engine commands should be run from sydney. They will not be available from your local machine.

  • Quick Start: If all you want to is run jobs interactively from the command line, like you're doing now, all you have to do is: (a) Modify your dot-file as it says in "Setup", (b) log in to dalisrv, and (c) run qlogin instead of ssh. That's it!

  • Setup:

These files are already source for you in your init scripts, just know that this is what happens:

source /opt/gridengine/default/common/settings.sh

If you're using csh, then instead you call

source /opt/gridengine/default/common/settings.csh

  • Submitting a job: Place your commands in a shell script. Then do:

qsub -cwd -o stdout.txt -e stderr.txt myscript.sh args

Alternatively, to run a binary rather than a script, you must use the flag -b y

qsub -b y -cwd -o stdout.txt -e stderr.txt mybinary args

  • Deleting a job You can get the job number using qstat

qdel

  • Viewing jobs In general, use qstat. Perhaps the most useful things to do is get detailed information about your jobs only:

qstat -r -u USERNAME

If you see 'Eqw' next to a job, that means it's not running because of an error. Probably you misspelled the name of the script or something. You See #If your job has an Error state, below.

  • Viewing hosts. You can view all the hosts in the Grid Engine and what their current load is by doing

qstat -f

  • New Use Trevor's tstat to quickly view usefull stats

tstat

  • Working interactively. Sometimes, you really just want a shell prompt. Then do

qrsh

which will give you an interactive shell on some available Grid Engine machine. This way you take up a slot on the Grid Engine queue, so if someone else submits a large array job, your interactive job will still be assigned a processor. This is great for running Matlab. Variations: To log in to a specific host, use qrsh -q all.q@compute-0-1. If all the slots are full, but you want your shell to wait in the queue for the next available slot, do qrsh -now n.

This information was taken from the Grid Engine documentation (http://gridengine.sunsource.net/project/gridengine/howto/basic_usage.html).

Subtle Gotchas

We try to make it easy to run scripts with Grid Engines, but there are a few common problems we haven't ironed out yet.

  1. Your environment variables. Your startup scripts (.profile, .cshrc, and so on) don't seem to always be read in correctly in the grid engine. So be careful, your path may not be what you think it is. Best to use absolute paths. Recently, Charles has found that using -S /bin/bash does not read his .profile, but using -S /bin/sh does read his profile. Note that sh is just linked to bash, so you don't even lose any bash features by doing this. Obviously, this doesn't help you if you use csh.

Stupid Grid Engine Tricks

Here are some neat tricks that the Grid Engine can do.

Requiring Some Amount of Physical RAM

Sometimes your job uses a lot of memory, and you want to restrict it to run on nodes where there is enough free memory, so you don't have swapping. You can do this using job resources, like thiq

qsub -l mem_free=2G my-job.sh

But what if you want to put constraints on things other than free memory. "Can I do that?" you ask. Boy can you. You can require all kinds of things about the grid node handling your job (e.g., that it have enough physical RAM.) For a full list, try qstat -F. For example, you can constrain to a particular host using "-l hostname=compute-0-0" (although you probably should never want to do this).

Array Jobs

You can start an array job, a bunch of jobs run in parallel that differ only in that each one is associated with a different numeric index. Each job can find out its index vio the SGE_TASK_ID environment variable.

e.g., Suppose you want to do 10-fold cross validation in parallel. Create a script that runs one fold, using that environment variable to know which script to run. Then doing

qsub -t 1-10 myjob-cv.sh

would give you 10-fold cross-validation in parallel! This is way cool. I must try this.

Passing Through Environment Variables

If you have custom shell variables in your enviroment and want to use them in grid engine, you need export them to your grid engine runtime enviroment using -v option. For example say you have the Varible $COLLECTION=/work1/mycollection

qsub -v COLLECTION=$COLLECTION my-job.sh

and the environment variable C will be set in the job context. You may have problems if the value of C contains a space. There's probably a workaround, but I haven't explored it.

Embedding Command-line Arguments

qsub has bmany command-line options. Rather than specifying them on the comand line, you can keep them in the script file by putting them in with special #$ comments. This means the qsub command line is one less thing you have to log / keep track of.

For example, say you wanted a script to always be run as an array job. You could use a command-line argument, as in the last section. Or you can do

# my-job.sh

#$ -t 1-10

echo `uname` $SGE_TASK_ID

And then you just do qsub my-job.sh to submit it!

Here's another example of some parameters used at the top of a perl script:

 # ---------------------------
 # -- Grid Engine parameters 
 # -----------------
 # --- interpreter 
 #$ -S /usr/bin/perl
 # -----------------
 # --- run from current dir
 #$ -cwd 
 # -----------------
 # --- job name
 #$ -N my.job.name
 # -----------------
 # --- stdio redirection
 #$ -e /dev/null
 #$ -o /dev/null
 # -----------------
 # --- force jobs to linux 
 #$ -l arch=glinux
 # -----------------
 # --- restartable
 #$ -r y
 # --- run 20 instances of this script
 #$ -t 1-20
In the actual script, the task id is retrieved using:

$taskid = $ENV{"SGE_TASK_ID"};

Note that options specified to qsub on the command-line override these, so if you want to, e.g., resubmit just one task in an array job, you can do something like

qsub -t 3-3 my-large-array-job.sh

Waiting for Jobs to Finish

Sometimes you don't want qsub to return to the command line until the job actually finishes. (This happened to me when I was using Makefiles for data processing, but wanted to offload some processing steps to the GE.) To do this, use the -sync y option.

Suspending Jobs

If you want to temporarily suspend one of your jobs (e.g., to make way for someone else), you can do

qmod -sj

To start your job up again, you can do

qmod -usj

If your job has an Error state

Sometimes, you'll make a mistake somewhere, and you'll have a job in an error state. When you do qstat -f, you'll see errors like this:

 ############################################################################ 
  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
 ############################################################################
       5 0.00000 run.tiny.g casutton     Eqw   11/07/2005 11:50:42     1        
See the big E? That means for some reason, the Grid Engine couldn't even start my job. To find out more, I do qstat -j JOB_ID. In this case:

$ qstat -j 5

 job_number:                 5
  ...LOTS OF INFORMATION DELETED...
  error reason    1:          11/07/2005 11:50:56 [1567:17869]: error: can't chdir to /m/vinci6/data2/casutton/experiments/pw/synt
  scheduling info:            queue instance "all.q@dalisvr.cs.umass.edu" dropped because it is disabled
                                     job is in error state
Oh, so that helps. In this case, the problem was that SGE couldn't find the directory in which I wanted to run the script.

Running a Job On a Particular Machine

Occasionally, you may need to run a job on a particular machine. This should be fairly rare, but if you do need to do this, you can do this using the Rocks cluster-fork command. There is information on this in the Rocks documentation (http://dalisrv.cs.umass.edu/rocks-documentation/4.1/launching-interactive-jobs.html#CLUSTER-FORK).

Trevor Strohman's Python library

This Python library is meant to allow large parameter-sweep submissions to a Grid Engine cluster. Each individual cluster job is represented by a Job object, and these objects can have dependencies on other jobs. Once a set of job objects has been created, the sge.build_submission function dispatches these jobs for execution on the cluster. The library automatically redirects stdout and stderr of the submitted jobs.

http://ciir.cs.umass.edu/~strohman/code/sge.py

Sydney's web interface

http://sydney.cs.umass.edu/ganglia/

-- AndreGauthier - 20 Sep 2006
to top


You are here: Main > GridEngine

to top

Copyright © 1999-2008 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback