Physnodes - 3) Submitting Jobs to the physics cluster
In order to submit jobs to the physics cluster you need to first load the grid engine module.
-bash-4.1$ module load sge
You will first login to physlogin which allows Linux command line access to the system, which is necessary for the editing programs, compiling and running the code. Usage of physlogin nodes is shared amongst all who are logged in. These systems should not be used for running your code, other than for short test runs. Â Â
Access to the job/batch submission system is through the login nodes. When a submitted job executes, processors on the compute nodes are exclusively made available for the purposes of running the job. Â Â
In order to interact with the job/batch system, the user must first give some indication of the resources they require. At a minimum these include: Â Â
- how long does the job need to run for Â
- on how many processors to run the job
The default resource allocation for jobs are
Default Resource Allocation | ||
---|---|---|
Resource | Default | Maximum |
Time limit | - | 24 hours and 7 days |
Memory | 2GB | 256GB |
Cores | 1 | 800 |
Armed with this information, the scheduler is able to dispatch the jobs at some point in the future when the resources become available. A fair-share policy is in operation to guide the scheduler towards allocating resources fairly between users.  Â
Job submission
Currently the facility is configured with a single general access queue, allowing submission to all available compute resources. Thus, there is no need to specify a queue name in job submissions. Â Â
The general command to submit a job with the qsub command is:
$ qsub [options] script_file_name [--script-args]
where script_file_name is a file containing commands to executed by the batch request. Â Â
For example submission scripts please look at these Job Script FIles. Â Â
Querying queues
The qstat command may be used to display information on the current status of Grid Engine jobs and queues.
-bash-4.1$ qstat
job-ID prior  name      user        state submit/start at    queue                         slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
 10202 0.00000 node_test abs4        qw   06/02/2014 09:17:30                                   1 1-1000:1
 10203 0.00000 node_test abs4        qw   06/02/2014 09:17:31                                   1 1-1000:1
Column | Description |
---|---|
job-ID | A number used to uniquely identify your job within the LSF system. Use this number when you want to halt a job via the qdel command. |
rior | The user's current job priority, based upon current and recent cluster utilisation |
name | The job's name, as specified by the job submisison script's -N directive |
user | The username of the job owner |
state | Current job status: r (running), t (in transfer) and qw (queued and waiting) |
submit/start at | For waiting jobs: the time the job was submitted. For running the jobs: the time the job started running |
queue | For running jobs, the queue and compute node the job is running on |
slots | The number of job slots the job is consuming (1 for serial jobs, greater than 1 for parallel jobs) |
ja-task-ID | A special field for task arrays |
By default, users will only see their jobs in the qstat output. To see all jobs use a username wildcard
-bash-4.1$ qstat -u \*
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
10063 0.55500 sh jaw500 r 05/27/2014 13:39:39 elec-cluster@elecnode0.york.ac 1
10064 0.55500 sh jaw500 r 05/27/2014 13:39:39 elec-cluster@elecnode1.york.ac 1
10065 0.55500 sh jaw500 r 05/27/2014 13:39:39 elec-cluster@elecnode2.york.ac 1
10066 0.55500 sh jaw500 r 05/27/2014 13:39:39 elec-cluster@elecnode4.york.ac 1
10067 0.55500 sh jaw500 r 05/27/2014 13:39:39 elec-cluster@elecnode7.york.ac 1
10068 0.55500 sh jaw500 r 05/27/2014 13:39:39 elec-cluster@elecnode6.york.ac 1
10069 0.55500 sh jaw500 r 05/27/2014 13:39:39 elec-cluster@elecnode5.york.ac 1
10070 0.55500 sh jaw500 r 05/27/2014 13:39:39 elec-cluster@elecnode3.york.ac 1
Important switches to qstat are:Â Â
Switch | Action |
-help | Prints a list of all options. |
-f | Prints full display output of all queues |
-g c | Print a 'cluster queue' summary - good for understanding what resources are free, across different queue types |
-g t | Print 'traditional' output, i.e. print a line per queue used, rather than a line per job |
-u username | Displays all jobs for a particular username. |
-j jobid | Displays details about a particular job |
-f | Display in full mode |
Job deletion
To delete a job from the queue use the qdel <jobid> command, where jobid is a number referring to the specified job (available from qstat).
-bash-4.1$ qsub slow_job
Your job 10204 ("slow_job") has been submitted
-bash-4.1$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
10204 0.00000 slow_job abs4 qw 06/02/2014 09:26:47 1
-bash-4.1$ qdel 10204
abs4 has registered the job 10204 for deletion
-bash-4.1$ qstat
-bash-4.1$
Â
To force action for running jobs use the -f option
A user can delete all their jobs from the batch queues with the -u optionÂ
Information about old jobs
To display a list of recently completed jobs
-bash-4.1$ qstat -s z
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
10201 0.55500 node_test abs4 z 06/02/2014 09:05:02 1 1-1000:1
10202 0.55500 node_test abs4 z 06/02/2014 09:17:30 1 1-1000:1
10203 0.55500 node_test abs4 z 06/02/2014 09:17:31 1 1-1000:1
10204 0.00000 slow_job abs4 z 06/02/2014 09:26:47 1
-bash-4.1$
To see information about completed jobs use the qacct command -o for a user, -d <n> for jobs over the last n days
Information about the system
The qhost command checks the status of the compute nodes
bash-4.1$ qhost
HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS
----------------------------------------------------------------------------------------------
global - - - - - - - - - -
ecgberht lx-amd64 2 1 2 2 0.03 3.7G 1.6G 2.0G 65.9M
elecnode0 lx-amd64 32 2 16 32 16.04 62.9G 9.9G 2.0G 0.0
elecnode1 lx-amd64 32 2 16 32 16.03 62.9G 10.1G 2.0G 0.0
elecnode2 lx-amd64 32 2 16 32 15.99 62.9G 7.7G 2.0G 0.0
elecnode3 lx-amd64 32 2 16 32 16.23 62.9G 7.7G 2.0G 0.0
elecnode4 lx-amd64 32 2 16 32 16.16 62.9G 10.1G 2.0G 0.0
elecnode5 lx-amd64 32 2 16 32 16.20 62.9G 10.1G 2.0G 0.0
elecnode6 lx-amd64 32 2 16 32 16.13 62.9G 10.1G 2.0G 0.0
elecnode7 lx-amd64 32 2 16 32 16.09 62.9G 10.1G 2.0G 0.0
rnode0 lx-amd64 32 2 16 32 0.00 62.8G 856.6M 2.0G 0.0
rnode1 lx-amd64 32 2 16 32 1.00 62.9G 834.7M 2.0G 0.0
rnode2 lx-amd64 32 2 16 32 1.00 126.0G 1.2G 2.0G 0.0
rnode3 lx-amd64 32 2 16 32 1.02 62.9G 831.2M 2.0G 0.0
rnode4 lx-amd64 32 2 16 32 1.00 126.0G 1.1G 2.0G 0.0
rnode5 lx-amd64 32 2 16 32 1.01 62.9G 842.1M 2.0G 0.0
FAQ
Why do I get an email saying I am not using the /scratch filestore, when I am?
If you receiving an email about using the wrong directory for job submission and you are sure you are submitting your job from the scratch directory, you probably have edited your file on a Windows PC and transferred it to /scratch.
Editing text files on Windows has the bad effect of inserting additional characters into the file. After you have transferred the file to the login server, use the following command to remove these characters:
dos2unix <job-script>
We do not recommend that you edit your files on a Windows PC and transfer them to the cluster.
My jobs are not running, they just sit in the queue
There are two main causes for your jobs to sit in the queue and do not run (the are in the Pending state:
- Use the command :
qalter -w p job-id
$ qalter -w v 275566 verification: found suitable queue(s)
This means you job script is OK. If you get the message
verification: no suitable queues
there is something wrong with your job script, see (2)
- The job script is wrong
- You may have missed out the runtime parameter "#$ -l h_rt=0:15:00"
- You have asked for a resource that is not available, the scheduler will wait for the resource to appear. for example specifying "#$ -l h_vmem=512G" when we have no nodes with 51G Gbytes of memory. Remember total memory requested is slots * memory.
- The resources you are requesting are currently in use
- You may be requesting for a node with 20 cores and 128GB of memory. This node may be being used by another job, the node may be ether fully utilised, or only one slot may be active.
- The queue suitable for the job is not currently enabled. For example a job requiring 48 hours to run will only start on the weekend queue, which is enabled on Friday evening.
- Use the command 'qstat -j <jobid>' to display why the job is currently running.
What resources did my job use
In order to tune your job submission parameters use the '-me' directive to inform you of the resources used:
Job 754967 (mpi-16_job) Complete User = abs4 Queue = its-4hour@rnode2.york.ac.uk Host = rnode2.york.ac.uk Start Time = 02/18/2015 10:41:21 End Time = 02/18/2015 10:41:27 User Time = 00:00:01 System Time = 00:00:01 Wallclock Time = 00:00:06 CPU = 00:00:03 Max vmem = 69.367M Exit Status = 0
Here we can see the job used 3 seconds of CPU time an 69 MB of memory.
More information can be found by using the "qacct" command:
abs4@login2.york.ac.uk$ qacct -j 830402 ============================================================== qname yarcc-testing hostname tnode0.york.ac.uk group csrv owner abs4 project NONE department defaultdepartment jobname bad.job jobnumber 830402 taskid undefined account sge priority 0 qsub_time Tue Nov 22 08:25:59 2016 start_time Tue Nov 22 08:28:04 2016 end_time Tue Nov 22 08:28:04 2016 granted_pe NONE slots 1 failed 0 exit_status 0 ru_wallclock 0s ru_utime 0.031s ru_stime 0.041s ru_maxrss 5.395KB ru_ixrss 0.000B ru_ismrss 0.000B ru_idrss 0.000B ru_isrss 0.000B ru_minflt 7481 ru_majflt 0 ru_nswap 0 ru_inblock 8 ru_oublock 24 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 93 ru_nivcsw 19 cpu 0.071s mem 0.000GBs io 0.000GB iow 0.000s maxvmem 0.000B arid undefined ar_sub_time undefined category -U special14day,iggi-users,special3day,testusers -u abs4 -l h_rt=120,h_vmem=1G