Page Comparison

...

In order to submit jobs to the YARCC physics cluster you need to first load the grid engine module.

Panel

title	Load Grid Engine Tools

-bash-4.1$ module load sge

When you log in, you will be directed to one of several login nodes. These allow You will first login to physlogin which allows Linux command line access to the system, which is necessary for the editing programs, compiling and running the code. Usage of the login physlogin nodes is shared amongst all who are logged in. These systems should not be used for running your code, other than for short test runs.

...

The default resource allocation for jobs are

Default Resource Allocation
Resource	Default	Maximum
Time limit	-	24 hours and 7 days
Memory	2GB	256GB
Cores	1	800

Armed with this information, the scheduler is able to dispatch the jobs at some point in the future when the resources become available. A fair-share policy is in operation to guide the scheduler towards allocating resources fairly between users.

Job submission

Currently the facility is configured with a single general access queue, allowing submission to all available compute resources. Thus, there is no need to specify a queue name in job submissions.

...

For example submission scripts please look at these Job Script FIles.

Querying queues

The qstat command may be used to display information on the current status of Grid Engine jobs and queues.

Panel

title	List your jobs in the queue

-bash-4.1$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
  10202 0.00000 node_test  abs4         qw    06/02/2014 09:17:30                                    1 1-1000:1
  10203 0.00000 node_test  abs4         qw    06/02/2014 09:17:31                                    1 1-1000:1

Column	Description
job-ID	A number used to uniquely identify your job within the LSF system. Use this number when you want to halt a job via the qdel command.
rior	The user's current job priority, based upon current and recent cluster utilisation
name	The job's name, as specified by the job submisison script's -N directive
user	The username of the job owner
state	Current job status: r (running), t (in transfer) and qw (queued and waiting)
submit/start at	For waiting jobs: the time the job was submitted. For running the jobs: the time the job started running
queue	For running jobs, the queue and compute node the job is running on
slots	The number of job slots the job is consuming (1 for serial jobs, greater than 1 for parallel jobs)
ja-task-ID	A special field for task arrays

...

By default, users will only see their jobs in the qstat output. To see all jobs use a username wildcard

...

Important switches to qstat are:

Switch	Action
-help	Prints a list of all options.
-f	Prints full display output of all queues
-g c	Print a 'cluster queue' summary - good for understanding what resources are free, across different queue types
-g t	Print 'traditional' output, i.e. print a line per queue used, rather than a line per job
-u username	Displays all jobs for a particular username.
-j jobid	Displays details about a particular job
-f	Display in full mode

Job deletion

To delete a job from the queue use the qdel <jobid> command, where jobid is a number referring to the specified job (available from qstat).

...

A user can delete all their jobs from the batch queues with the -u option

Information about old jobs

To display a list of recently completed jobs

...

To see information about completed jobs use the qacct command -o for a user, -d <n> for jobs over the last n days

Information about the system

The qhost command checks the status of the compute nodes

Panel

title	List status of nodes

bash-4.1$ qhost
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
ecgberht                lx-amd64        2    1    2    2  0.03    3.7G    1.6G    2.0G   65.9M
elecnode0               lx-amd64       32    2   16   32 16.04   62.9G    9.9G    2.0G     0.0
elecnode1               lx-amd64       32    2   16   32 16.03   62.9G   10.1G    2.0G     0.0
elecnode2               lx-amd64       32    2   16   32 15.99   62.9G    7.7G    2.0G     0.0
elecnode3               lx-amd64       32    2   16   32 16.23   62.9G    7.7G    2.0G     0.0
elecnode4               lx-amd64       32    2   16   32 16.16   62.9G   10.1G    2.0G     0.0
elecnode5               lx-amd64       32    2   16   32 16.20   62.9G   10.1G    2.0G     0.0
elecnode6               lx-amd64       32    2   16   32 16.13   62.9G   10.1G    2.0G     0.0
elecnode7               lx-amd64       32    2   16   32 16.09   62.9G   10.1G    2.0G     0.0
rnode0                  lx-amd64       32    2   16   32  0.00   62.8G  856.6M    2.0G     0.0
rnode1                  lx-amd64       32    2   16   32  1.00   62.9G  834.7M    2.0G     0.0
rnode2                  lx-amd64       32    2   16   32  1.00  126.0G    1.2G    2.0G     0.0
rnode3                  lx-amd64       32    2   16   32  1.02   62.9G  831.2M    2.0G     0.0
rnode4                  lx-amd64       32    2   16   32  1.00  126.0G    1.1G    2.0G     0.0
rnode5                  lx-amd64       32    2   16   32  1.01   62.9G  842.1M    2.0G     0.0

FAQ

Why do I get an email saying I am not using the /scratch filestore, when I am?

If you receiving an email about using the wrong directory for job submission and you are sure you are submitting your job from the scratch directory, you probably have edited your file on a Windows PC and transferred it to /scratch.

...

We do not recommend that you edit your files on a Windows PC and transfer them to the cluster.

My jobs are not running, they just sit in the queue

There are two main causes for your jobs to sit in the queue and do not run (the are in the Pending state:

Use the command :

qalter -w p job-id

Code Block

language	bash

$ qalter -w v 275566
verification: found suitable queue(s)

This means you job script is OK. If you get the message

verification: no suitable queues

there is something wrong with your job script, see (2)

The job script is wrong
1. You may have missed out the runtime parameter "#$ -l h_rt=0:15:00"
2. You have asked for a resource that is not available, the scheduler will wait for the resource to appear. for example specifying "#$ -l h_vmem=512G" when we have no nodes with 51G Gbytes of memory. Remember total memory requested is slots * memory.
The resources you are requesting are currently in use
1. You may be requesting for a node with 20 cores and 128GB of memory. This node may be being used by another job, the node may be ether fully utilised, or only one slot may be active.
2. The queue suitable for the job is not currently enabled. For example a job requiring 48 hours to run will only start on the weekend queue, which is enabled on Friday evening.
Use the command 'qstat -j <jobid>' to display why the job is currently running.

What resources did my job use

In order to tune your job submission parameters use the '-me' directive to inform you of the resources used:

...

Code Block

language	bash
title	Resources used by a terminated job

abs4@login2.york.ac.uk$ qacct -j 830402
==============================================================
qname        yarcc-testing       
hostname     tnode0.york.ac.uk   
group        csrv                
owner        abs4                
project      NONE                
department   defaultdepartment   
jobname      bad.job             
jobnumber    830402              
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Tue Nov 22 08:25:59 2016
start_time   Tue Nov 22 08:28:04 2016
end_time     Tue Nov 22 08:28:04 2016
granted_pe   NONE                
slots        1                   
failed       0    
exit_status  0                   
ru_wallclock 0s
ru_utime     0.031s
ru_stime     0.041s
ru_maxrss    5.395KB
ru_ixrss     0.000B
ru_ismrss    0.000B
ru_idrss     0.000B
ru_isrss     0.000B
ru_minflt    7481                
ru_majflt    0                   
ru_nswap     0                   
ru_inblock   8                   
ru_oublock   24                  
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     93                  
ru_nivcsw    19                  
cpu          0.071s
mem          0.000GBs
io           0.000GB
iow          0.000s
maxvmem      0.000B
arid         undefined
ar_sub_time  undefined
category     -U special14day,iggi-users,special3day,testusers -u abs4 -l h_rt=120,h_vmem=1G

...

Versions Compared

Old Version 1

New Version Current

Key

Job submission

Querying queues

Job deletion

Information about old jobs

Information about the system

FAQ

Why do I get an email saying I am not using the /scratch filestore, when I am?

My jobs are not running, they just sit in the queue

What resources did my job use