Service Desk Knowledgebase: Condor
This is the Condor content page of the CL Wiki Service Desk Knowledgebase. Its purpose is to provide information to the Service Desk team on how to handle problems and requests about this CL service. If you are involved with the provision of this CL service please feel free to add to the knowledge about that it.
If CL staff need to tell the Service Desk team about problems with this service please email
sys-admin-aside@cl.cam.ac.uk.
Return to the Service Desk Knowledgebase SERVICE PORTFOLIO
Key Service Description & URLs
- Condor batch system
- Condor – local guide
- Computer Laboratory News (Twitter use @UC_CL_SysAdm)
CL Customer Documentation
Further CL Sys-Admin Resources
NOTE: Machines for Condor jobs are NOT always powered up and are started up on request.
Underpinning Services
- XenE - Xen
Customer-base for this Service
- All staff and students of the collegiate University
Costs
- Free to all current staff and students of the collegiate University
SLA
- N/A
Service Desk Call Handling Procedure
- RT tickets can be escalated by changing the Queue to backoffice with the Owner set to Nobody and the Status as new. Tell the requestor:
I am passing this request over to the experts who, I'm sure, will be in contact shortly.
Condor
Graham Titmus (31/01/2015)
http://www.cl.cam.ac.uk/local/sys/unix/applications/condor/condor-6.8.html
1.2. Condor in the Computer Lab To use the condor system, the execute machines needs to be able to get Kerberos TGTs so that the filer is accessible, See the NFS sec=sys page for instructions on enabling filer access."
http://www.cl.cam.ac.uk/local/sys/unix/nfs-sec-krb5/#unattended
Unattended operation (cron jobs, Condor, scripts, servers) We are in the process of upgrading our cron servers, Condor machines and web servers to use NFS sec=krb5. If you need to run cron, condor or jobs periodically that access filer then please email sys-admin to request advise on how to achieve this. There are several techniques that can be used depending upon the exact nature of your work, such as the TGT server. We will attempt to help you find the most appropriate solution.
In the case of condor case the appropriate solution is to use the new TGT Server. There is information on how to do so at http://www.wiki.cl.cam.ac.uk/clwiki/SysInfo/TgtServer i.e. step 1 below...
1 The user needs to setup a Kerberos Ticket on the machine.
2 The Xen VMs need to be started - these are named pb0xx. e.g. to start machine 30 with 2 CPUs and 7GB memory use
cl-condor-start pb030 2 7000000000
3 The PATH variable needs setting to
PATH=/opt/condor-6.8.3/bin:$PATH;export PATH
4 The user can then use the commands to submit and monitor jobs.
Common problem is that jobs are held
condor_q -analyze
just says "Request is held". A common problem is to underestimate the amount or RAM needed. The job will repeatedly run out and fail, so be held.
condor_q ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1374.31 mr472 2/4 00:53 0+00:31:52 R 0 32128.9 runconfig.sh /home
shows jobs require 32GB each,
condor_Qr 1374.31 2S 5U 2C mr472 vm1@pb035 (Memory >= 4000)
shows the user seem not to have specified suitable RAM when starting the machine using 'cl-condor-start'. The solution is to start up new machines with enough memory.
Dealing with the XenPool Machines
Contacts
Primary
- sys-admin-comment@cl.cam.ac.uk (Goes to CL back office team)
Availability
- 24x7
Hints, Tips & Know Issues
TGT server
Ian Mackey (20/4/15)
The command "condor_q" comes back with:
-- Failed to fetch ads from: <128.232.0.69:9654> : ssh-remote-0.cl.cam.ac.uk CEDAR:6001:Failed to connect to <128.232.0.69:9654>
ssh-remote-0:~$ condor_q -- Failed to fetch ads from: <128.232.0.69:9654> : ssh-remote-0.cl.cam.ac.uk CEDAR:6001:Failed to connect to <128.232.0.69:9654> ssh-remote-0:~$ condor_q -analyze -- Failed to fetch ads from: <128.232.0.69:9654> : ssh-remote-0.cl.cam.ac.uk CEDAR:6001:Failed to connect to <128.232.0.69:9654>
TGT server
Piete Brooks (12/3/15)
The new kid on the block is a TGT server which allows people to do things like:
- retain a valid TGT at all times on named machines
- request a TGT on a named machine by typing 'cl-tgt'
We have basically told users of the 'cron' and 'condor' services that they have to use it (despite the red warnings about it being dangerous).
We need to get some experience with it, and have our 'security aware' users mull it over before suggesting it as a "std" way for slogin-serv to retain TGTs. (mgk25 is very keen on it)
Categorising Keywords
- Condor pool