Supercomputer

Using Univa Grid Engine (UGE)

Announcements

July 13, 2010

We changed the method of releasing the memory-use limit.
  • Each queue has a memory-use limit set to 2 GB and, as in the past, a job is killed if it uses more than 2 GB of memory for one process.
  • The method of releasing the limit has been changed to that described below. [New method]
    Specify the options (as shown below) when executing 'qsub'.
    % qsub -l s_vmem=8G
    => The limit is increased to 8 GB.
    => The job will be killed if memory use exceeds 8 GB.

    [Previous method]
    Describe as below in the job.
    ·For bash
    ulimit -v 8388608
    ulimit -d 8388608
    ·For tcsh
    limit vmemoryuse 8388608
    limit datasize 8388608
    ·For zsh
    limit addressspace 8388608
    limit datasize 8388608
  • As announced in the workshops, by declaring the amount of required memory, we can ensure that multiple jobs with large memory requirements are not submitted to the same host. Refer to the instructions below for declaring and releasing the limit at the same time. Please note that in Shirokane2, the value of 's_vmem' is specified in Bytes, meaning that the value to be specified for 'mem_req' is just the number (in GB). In Shirokane3, the values of 's_vmem' and 'mem_req' are specified in Bytes. (Shirokane3)
    % qsub -l s_vmem=8G -l mem_req=8G
    => The limit is increased to 8 GB.
    => The job will be killed if the memory requirement exceeds 8 GB.
    => Subtract 8 GB from the installed memory set to the host (128, 2048).
    => This results in 32-8=26; the job will not be submitted to the host if this number is zero.
    (Shirokane2)
    % qsub -l s_vmem=8G -l mem_req=8
    => The limit is increased to 8 GB.
    => The job will be killed if the memory requirement exceeds 8 GB.
    => Subtract 8 GB from the installed memory set to the host (32, 128).
    => This results in 32-8=26; the job will not be submitted to the host if this number is zero.
    When specifying a memory-use limit of over 2 GB in Shirokane2/5.3 GB in Shirokane3, please specify only s_vmem and mem_req, and do not use in combination with -pe def_slot. If you use in combination with def_slot, the value specified for def_slot multiplied by the value specified for s_vmem and mem_req will be applied to the job.
  • Click here to access a workshop document that discusses the above changes.

  • What is UGE?

    UGE is a grid computing system developed .
    Using UGE, programs previously executed on one computer can be executed on multiple computers.
    Please use UGE when executing large numbers of jobs.

    • There are two types of UGE: an open source version and a production version.
    • Both batch jobs and interactive jobs can be executed. When a user submits a job from a certain computer, UGE submits and executes the job to another computer with lesser load. The execution results are sent back to the user.

    What are the benefits of using UGE?

    • Many programs can be executed at the same time by submitting jobs to multiple computers.
    • UGE will automatically handle scheduling to ensure that large numbers of jobs can be executed smoothly.
    • An array job function is available, which sets multiple parameters to one program and then submits the job.
    • Able to submit jobs using a GUI.

    How to use UGE

    • All users can start using UGE immediately.
    • A job is submitted by describing the command to be executed in the shell script and utilizing the 'qsub' command, as shown below. % qsub [ShellScript name] When the above command is executed, the content of the standard output and standard error from the executed shell script are output to the user's home directory as a file.
    • In the default case, the script being submitted as a job will be run by /bin/csh. If "#!/bin/bash" or "#!/usr/local/bin/perl" are used at the beginning of the script, the job is run by /bin/csh. To specify the intended interpreter, it is necessary to submit a job in the manner shown below: % qsub -S [Path of the interpreter executing the script] [Script name]
    • The job is normally executed in the user's home directory, but the option below is used when executing the job in the current directory: % qsub -cwd [Script name] The current directory must be an area (under the home directory, etc.) referable from all execution hosts with the same path.
    • It is also possible to directly submit a binary as a job, using the option below: %qsub -b y [Binary name]Many binaries can be submitted as a job, but submitting a binary directly is not recommended. Please submit a job as a shell script for general use.
    • To execute a job on the home directory and to output the standard output [path A] and standard error [path B] to the specified path, use the option below: % qsub -o [path A] -e [path B]
    • To confirm the execution state of the submitted job, use the command below: % qstat The '-f' option can be used to confirm the state of job execution in greater detail.
    • When deleting a submitted job, confirm the job ID using the 'qstat' command and execute the command below. The '-u' option deletes all the jobs of the specified user ID. % qdel [Job ID]
      % qdel -u [User ID]
    • The user's guide for N1GridEngine (the production version of UGE) can be found at the following Web site (the operating method is the same as that for UGE).
      N1GridEngine User's Guide

    About the shell script

    • The shell script is a text file that contains a UNIX command.
    • It is necessary to change the shebang(#!) or add descriptions to execute the shell script as an UGE job. For perl
      [For changing]
      #!/usr/local/bin/perl → #$ -S /usr/local/bin/perl

      [For adding]
      #!/usr/local/bin/perl
      #$ -S /usr/local/bin/perl

      Other examples
      #$ -S /bin/bash
      #$ -S /bin/tcsh
    • It is also possible to specify the interpreter when submitting a job. In this case, the above change is not necessary. % qsub -S /usr/local/bin/perl [Script name]
    • In the shell script used for UGE, the row including '#$' can be sent to UGE as an option. #$ -cwd Using the above, it is possible to specify the option of 'qsub' in the shell script.

    Priority among users when executing jobs

    • Priority among users when executing jobs is determined by the number of 'tickets' held by each user. Here, 'priority' refers to the priority when submitting to the execution host after executing the 'qsub' command.
    • Initially, all users have 10,000,000 tickets.
    • The number of tickets held by each user decreases in proportion to CPU utilization required by the submitted UGE job. The number of tickets is automatically restored 1 week after job execution. The more time that has passed, the lesser degree that the number of tickets is affected by CPU utilization.

    Precautions when using UGE

    • UGE does not automatically process jobs by multi-threading: it monitors the load of several computers and executes large numbers of jobs sequentially.
    • Please locate the file referred to by the job under the home directory etc., where it can be referred to from all computers.
    • The child process created by the UGE job is not under the control of UGE. For a shell script that creates a large number of child processes, it is possible to smoothly execute the job by describing it such that the child process is submitted as an UGE job (described to run 'qsub' in the shell script).

    About the environment variable setting necessary for UGE use

    • The environment variable is set by simply logging in.

    Using the 'qmon' command

    • 'qmon' is a command tool for UGE used with a GUI. By utilizing qmon, jobs can be submitted using a GUI.
    • 'qmon' can be utilized by logging in to 'ngw.hgc.jp' and using the command 'qmon' on the command line.
    • Because the X Window System is utilized, X Server software is required to utilize 'qmon' from a Windows machine.
      ASTEC-X, WiredX, etc. have been developed for X Server software for Windows.
      When utilizing the above software, please be aware of your personal firewall settings. Port being used
      ASTEC-X 6000/TCP
      WiredX 8000/TCP
    • The settings for the 'DISPLAY' environment variable on 'ngw' may be necessary, depending on the X Server software. It may also be necessary to execute the 'xhost' command on a PC in use.
    • For Mac OS X users, the X Server software is not pre-installed on your computer. However, a CD labeled with 'Xcode' is included at purchase; consequently, 'qmon' can be utilized by installing X11 from the CD. It is also possible to download X11 for free from the Apple homepage.

    About the current queue settings

    The five available queues are as follows.
    Queue name How to submit a job Number of slots Upper limit of execution time
    Shirokane2
    lljobs.q qsub -l ljob -q lljobs.q sample.sh 256 2 months
    ljobs.q qsub -l ljob sample.sh 1,728 2 weeks
    mjobs.q qsub sample.sh 8,340 2 days *default queue
    sjobs.q qsub -l sjob sample.sh 452 8 hours
    lmem.q qsub -l lmem sample.sh 132 2 weeks
    intr.q qlogin 144 none
    web.q qsub -q web.q sample.sh none none
    cp.q qsub -l cp sample.sh
    qlogin -l cp
    8 none
    mjobs_rerun.q qsub sample.sh 1,408 2 days
    'intr.q' is an interactive queue. You can login to the computing node using the 'qlogin' command and utilize it to execute an interactive job or to debug batch jobs. The computing node for login is automatically selected according to the load of the computer. [username@hostname ~]$ qlogin
    local configuration ngw01i not defined - using global configuration
    Your job 10765 ("QLOGIN") has been submitted
    waiting for interactive job to be scheduled ...
    Your interactive job 10765 has been successfully scheduled.
    [username@ncXXX ~]$
    mjobs_rerun.q is a queue assigned to the same node as the exclusive queue. Similar to mjobs.q, the upper limit of the execution time for this queue is set to 2 days. If the jobs are submitted until the maximum utilization limit of the mjobs.q, the jobs will be then submitted to mjobs_rerun.q.
    If it is not set with the options as follows, it will be submitted to either mjobs.q or mjobs_rerun.q. For users who can use the exclusive queue, there may be times it is submitted instead to the exclusive queue. [username@hostname ~]$ qsub sample.sh If the options are not set like this, there will be times that jobs are not submitted to mjobs_rerun.q, hence it is easier for the jobs to be executed when the state of the jobs are busy.
    However, as mjobs_rerun.q is a queue that is assigned to the same node as the exclusive queue, jobs that are executed in mjobs_rerun.q will be passed back if jobs are submitted to the exclusive queue of the same node. The jobs passed back will be rescheduled, and executed again from the start.
    If it is preferable not to submit to mjobs_rerun.q , please set the options as below. However, "-q" is an option to designate a queue, hence this will cause the jobs not to be submitted to the exclusive queue or web.q. [username@hostname ~]$ qsub -q mjobs.q sample.sh

    How to immediately execute a job with an execution time less than 8 hours (use of 'sjobs.q')

    • 'sjobs.q' has an upper limit for execution time of 8 hours. This is not the CPU time, but the actual time.
    • 'sjobs.q' is set for the following situations.
      Queues other than 'sjobs.q' are filled with the jobs of other users.
      To immediately execute a job with an execution time of less than 8 hours.
    • Jobs are not normally submitted to 'sjobs.q'.
      When submitting a job to 'sjobs.q', specify the option and submit the job as described below:
      % qsub -l sjob [ShellScript name] Jobs cannot be submitted to 'sjobs.q' even as '-q sjobs.q'. '-l sjob' is required to submit a job to 'sjobs.q'.
      When submitting a job by specifying the computer included in 'sjobs.q', execute the following command:
      % qsub -l sjob -q sjobs.q@ncXXXi [ShellScript name]
    • In 'sjobs.q', the job is killed by the system when the job execution time exceeds 8 hours.
    • The killing of a job is not reported to the user.
    • After execution, please execute the command below to confirm the submitted queue or execution time.
      % qacct -j [Job ID]


    How to smoothly execute multi-thread jobs and jobs that create child processes

    • When submitting a job to UGE that creates multiple processes, it is possible that UGE will have problems with load balancing.
    • UGE controls the number of jobs according to the number of slots defined in the queue. However, when a job is submitted to multiple slots at the same time, and if these jobs create multiple processes, the number of processes that exceed the defined number of slots will be executed inside the computer.
    • If at all possible, please do not submit to UGE a job that creates multiple processes.
    • If it is necessary to submit such a job to UGE, use the option below when submitting the job:
    • % qsub -pe def_slot 2 [ShellScript]
    • By submitting a job using the above option, it uses two slots. You can avoid making an excessive number of submissions by redefining the number of slots of the job.

    How to specify the available memory of the submitting computer and then submit the job

    • The host to which the UGE job is submitted is not known in advance. However, when submitting a job that uses a large amount of memory, sufficient memory is required on the submitting computer.
    • By specifying the necessary memory, the job will only be submitted to a host with sufficient memory.
    • % qsub -l mem_free=20G sample.sh

          -> The job is only submitted to a computer with more than 20 GB of available memory.

    • [Note] When a large number of jobs is submitted at the same time, the available memory may be less than that required. This situation can be avoided by setting the submitting order of the job. In doing so, please ensure extra time by inserting a sleep job, etc.

    • % qsub -l mem_free=20G [Script name]
          -> The job is only submitted to computers with more than 20 GB of available memory.
      % qsub -N sleep_job -b y sleep 60
          -> The job is submitted as a 'sleep_job' and the execution is complete in 60 seconds.
      % qsub -l mem_free=20G -hold_jid sleep_job [Script name]
          -> After execution of the 'sleep_job' is complete,
                the job is submitted to a computer with more than 20 GB of available memory.


      "I want to set the job execution order and then submit the job."

      • This is possible by executing the command below: % qsub -N job1 [Script name]
        % qsub -N job2 -hold_jid job1 [Script name]
        % qsub -N job3 -hold_jid job1,job2 [Script name]
      • Each job is set with a new name: 'job1', 'job2' and 'job3'. When 'job1' execution is complete, 'job2' is executed; when both 'job1' and 'job2' are complete, 'job3' is executed.
      • The same setting for the execution order can be achieved by using the expression shown below: % qsub -N job1 [Script name]
        % qsub -N job2 -hold_jid job1 [Script name]
        % qsub -N job3 -hold_jid "job*" [Script name]

      "I want to set the job execution order and to control a job to be executed later based on the outcome of an earlier job."

      • "'Job B' is to be executed after the execution of 'Job A'. In the case of an error in 'Job A', I want to cancel the execution of 'Job B'." This is possible using the setting below:
        Describe the command below in the error processing of 'Job A'.
        % qdel jobB

        Set the execution order of the job and submit
        % qsub -N jobA [Script name]
        % qsub -N jobB -hold_jid jobA [Script name]

      "I want to specify the queue in which the job is to be executed and then to submit the job"

      • Some users can utilize a queue other than 'ljobs.q' and 'sjobs.q'. In this case, use '-q option' to specify the queue in which the job is to be executed. % qsub -q ljobs.q [Script name]
        The Job is submitted only to 'ljobs.q'.

      • By placing the same name on multiple jobs, it is possible to retain management over the entire group.
      • It is possible to name a job when submitting the job or when it is waiting to be executed. It is not possible to name a job that has already been executed. % qsub -N GROUP1 [Script name 1]
        % qsub -N GROUP1 [Script name 2]
        % qsub -N GROUP2 [Script name 3]
        % qsub -N GROUP2 [Script name 4]
      • Delete jobs as a group. Delete GROUP1
        % qdel GROUP1
      • Change the job option as a group. Change the UGE option and the shell script argument of GROUP 2.
        % qalter [UGE option] GROUP2 [Shell script argument]


      "I want to set the environment variable and then submit the job"

      • The setting for the environment variable of the jobs depends on the submitted shell script.
      • When submitting a csh script as a job, '$HOME/.cshrc' is read, but it is not read when submitting a Perl script as a job.
      • When executing a Perl script etc., use the '-v' option to set the environment variable.

      • qsub -v LD_LIBRARY_PATH=$HOME/lib:$LD_LIBRARY_PATH,PATH=$HOME/bin:$PATH \
             -S /usr/local/bin/perl [PerlScript name]


      "I want to execute R using UGE"

      • By executing R in batch mode, R can be executed using UGE.
      • First, create a text file that includes the R commands (as below).
      File name:/home/[userID]/r_batch.R

      x <- matrix(1:12,3,4)
      x

      • Create a shell script that executes R in batch mode.
      File name:/home/[userID]/R.sh

      #!/bin/tcsh
      #$ -S /bin/tcsh
      /usr/local/bin/R CMD BATCH /home/[userID]/r_batch.R

      • Perform 'qsub' and submit the job.
      qsub /home/[userID]/R.sh
      • The execution result will be output by R in the home directory as "[File name including R command]out".

      "I want to know the number of slots I can use."

      You can check using the qfree command. The number of available slots in the requested course will be displayed in the QUOTA LIMIT column per queue, and the number of slots in use will be displayed in the [USER] JOBS and [GROUP] JOBS columns.
      % qfree 
      SUMMARY OF RUNNING JOBS
      
                       [USER] [GROUP] QUOTA      ALL                  
              QNAME     JOBS     JOBS LIMIT     JOBS AVAIL STDBY TOTAL
      ------------- ----------------- ----- --------------------------
               cp.q        0        0     0        0    16     0    16
             intr.q        0       39    96       62     2     0    96
            ljobs.q        0        0   576       28    59   448   576
           lljobs.q        7        7    96       46    38     0    96
             lmem.q        0        0   176       32    36    16   176
            mjobs.q        0     1089  4096     1179   138  1120  3320
      mjobs_rerun.q        0        0  4096        0     0  1456  1456
            sjobs.q        2        2   304        2    62   240   304
      ------------- ----------------- ----- --------------------------
                           9     1137  9440     1349   351  3280  6040
      
      THE NUMBER OF RUNNING JOBS BY USER IN THE GROUP [GROUP]
      
          QNAME    userA    userB    userC
      ---------- --------------------------------------------------
           cp.q        0        0        0
         intr.q       32        0        0
        ljobs.q        0        0        0
       lljobs.q        0        0        0
         lmem.q        0        0        0
        mjobs.q      216      232       75
        sjobs.q        0        0        0
             qw      936        0        0
      ---------- --------------------------------------------------
      
      The meaning of each column is described below.
      Column NameMeaning
      [USER] JOBS [USER] The number of slots used by the [USER]. The name of the user that executed the qfree command will be displayed as the [USER].
      [GROUP] JOBS [GROUP] The number of slots used by the [GROUP] users. The group name of the user who executed the qfree command will be displayed as the [GROUP].
      QUOTA LIMIT [GROUP] The maximum number of slots available to the [GROUP]. If it is 0, there is no limit per group. (The number of slots indicated in the TOTAL column is available for use.)
      ALL JOBS The number of slots used by the jobs executed by all users.
      AVAIL The number of slots that can immediately execute a job when the requested memory volume is 2 GB.
      STDBY The number of slots of suspended or stopped calculation nodes. (The number of slots configured for calculation nodes suspended due to small number of jobs or stopped due to a failure. The suspended calculation node will automatically start when the job is submitted and the number of waiting jobs increases.)
      TOTAL The number of slots per queue.
      When the user name and group name are specified in the argument, the specified user and group data is displayed. % qfree [USER] % qfree [USER] [GROUP]
      Help is displayed when using the "-h" option. % qfree -h

      "I want to remove a job which state is 'dt' or 'dr'."

      You can forcibly remove the job by using "qdel -f" command.
      % qdel -f [Job ID]


      Top of Page Top of Page

The University of Tokyo The Institute of Medical Science

Copyright©2005-2019 Human Genome Center