Service Desk Knowledgebase: Condor

From Computer Laboratory System Administration
Revision as of 12:39, 19 June 2015 by vrw10 (talk | contribs) (→‎Service Desk Call Handling Procedure)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

This is the Condor content page of the CL Wiki Service Desk Knowledgebase. Its purpose is to provide information to the Service Desk team on how to handle problems and requests about this CL service. If you are involved with the provision of this CL service please feel free to add to the knowledge about that it.

If CL staff need to tell the Service Desk team about problems with this service please email
sys-admin-aside@cl.cam.ac.uk.

Return to the Service Desk Knowledgebase SERVICE PORTFOLIO

Key Service Description & URLs

CL Customer Documentation

Further CL Sys-Admin Resources

NOTE: Machines for Condor jobs are NOT always powered up and are started up on request.

Underpinning Services

Customer-base for this Service

  • All staff and students of the collegiate University

Costs

  • Free to all current staff and students of the collegiate University

SLA

  • N/A

Service Desk Call Handling Procedure

  • RT tickets can be escalated by changing the Queue to backoffice with the Owner set to Nobody and the Status as new. Tell the requestor:
    I am passing this request over to the experts who, I'm sure, will be in contact shortly.

Condor

Graham Titmus (31/01/2015)

http://www.cl.cam.ac.uk/local/sys/unix/applications/condor/condor-6.8.html

 1.2. Condor in the Computer Lab
 To use the condor system, the execute machines needs to be able to
 get Kerberos TGTs so that the filer is accessible, See the NFS
 sec=sys page for instructions on enabling filer access."

http://www.cl.cam.ac.uk/local/sys/unix/nfs-sec-krb5/#unattended

 Unattended operation (cron jobs, Condor, scripts, servers)  
 We are in the process of upgrading our cron servers, Condor 
 machines and web servers to use NFS sec=krb5. If you need to run 
 cron, condor or jobs periodically that access filer then please 
 email sys-admin to request advise on how to achieve this. There 
 are several techniques that can be used depending upon the exact 
 nature of your work, such as the TGT server. We will attempt to 
 help you find the most appropriate solution.

In the case of condor case the appropriate solution is to use the new TGT Server. There is information on how to do so at http://www.wiki.cl.cam.ac.uk/clwiki/SysInfo/TgtServer i.e. step 1 below...

1 The user needs to setup a Kerberos Ticket on the machine.

2 The Xen VMs need to be started - these are named pb0xx. e.g. to start machine 30 with 2 CPUs and 7GB memory use

cl-condor-start pb030 2 7000000000

3 The PATH variable needs setting to

PATH=/opt/condor-6.8.3/bin:$PATH;export PATH

4 The user can then use the commands to submit and monitor jobs.

Common problem is that jobs are held

condor_q -analyze 

just says "Request is held". A common problem is to underestimate the amount or RAM needed. The job will repeatedly run out and fail, so be held.

condor_q
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 
1374.31 mr472 2/4 00:53 0+00:31:52 R 0 32128.9 runconfig.sh /home

shows jobs require 32GB each,

condor_Qr
1374.31 2S 5U 2C mr472 vm1@pb035 (Memory >= 4000)

shows the user seem not to have specified suitable RAM when starting the machine using 'cl-condor-start'. The solution is to start up new machines with enough memory.

Dealing with the XenPool Machines

See Accessing the Xen Console

Contacts

Primary

Availability

  • 24x7

Hints, Tips & Know Issues

TGT server

Ian Mackey (20/4/15)

The command "condor_q" comes back with:

-- Failed to fetch ads from: <128.232.0.69:9654> : ssh-remote-0.cl.cam.ac.uk
CEDAR:6001:Failed to connect to <128.232.0.69:9654>
ssh-remote-0:~$ condor_q  
-- Failed to fetch ads from: <128.232.0.69:9654> : ssh-remote-0.cl.cam.ac.uk
CEDAR:6001:Failed to connect to <128.232.0.69:9654>

ssh-remote-0:~$ condor_q -analyze

-- Failed to fetch ads from: <128.232.0.69:9654> : ssh-remote-0.cl.cam.ac.uk
CEDAR:6001:Failed to connect to <128.232.0.69:9654>



TGT server

Piete Brooks (12/3/15)

The new kid on the block is a TGT server which allows people to do things like:

  1. retain a valid TGT at all times on named machines
  2. request a TGT on a named machine by typing 'cl-tgt'


We have basically told users of the 'cron' and 'condor' services that they have to use it (despite the red warnings about it being dangerous).

We need to get some experience with it, and have our 'security aware' users mull it over before suggesting it as a "std" way for slogin-serv to retain TGTs. (mgk25 is very keen on it)

Categorising Keywords

  • Condor pool