Service Desk Knowledgebase: Condor: Difference between revisions

From Computer Laboratory System Administration
Jump to navigationJump to search
 
(10 intermediate revisions by 2 users not shown)
Line 31: Line 31:


==Service Desk Call Handling Procedure==
==Service Desk Call Handling Procedure==
* [https://rt.cl.cam.ac.uk/ RT] tickets can be escalated to the changing the '''Queue''' to '''sys-admin''' with the '''Owner''' set to '''Nobody''' and '''Status''' as '''New'''.  Tell the requestor:<br /> ''I am passing this request over to our Condor team who, I'm sure, will be in contact shortly.''


===Dealing with the XenPool Machines===
* [http://helpdesk.csx.cam.ac.uk/ RT] tickets can be escalated by changing the '''Queue''' to '''backoffice''' with the '''Owner''' set to '''Nobody''' and the '''Status''' as '''new'''.  Tell the requestor:<br /> ''I am passing this request over to the experts who, I'm sure, will be in contact shortly.''
See [https://wiki.cam.ac.uk/cl-sys-admin/Service_Desk_Knowledgebase:_XenE#Accessing_the_Xen_Console Accessing the Xen Console]


== Contacts ==
===Condor===
[http://www.lookup.cam.ac.uk/person/gt19 Graham Titmus] (31/01/2015)


'''Primary'''
http://www.cl.cam.ac.uk/local/sys/unix/applications/condor/condor-6.8.html
* [mailto:sys-admin-comment@cl.cam.ac.uk sys-admin-comment@cl.cam.ac.uk] (Goes to CL back office team)
  '''1.2. Condor in the Computer Lab'''
  To use the condor system, the execute machines needs to be able to
  get Kerberos TGTs so that the filer is accessible, See the NFS
  sec=sys page for instructions on enabling filer access."


==Availability==
http://www.cl.cam.ac.uk/local/sys/unix/nfs-sec-krb5/#unattended
  '''Unattended operation (cron jobs, Condor, scripts, servers)''' 
  We are in the process of upgrading our cron servers, Condor
  machines and web servers to use NFS sec=krb5. If you need to run
  cron, condor or jobs periodically that access filer then please
  email sys-admin to request advise on how to achieve this. There
  are several techniques that can be used depending upon the exact
  nature of your work, such as the TGT server. We will attempt to
  help you find the most appropriate solution.


* 24x7
In the case of condor case the appropriate solution is to use the new TGT Server. There is information on how to do so at
http://www.wiki.cl.cam.ac.uk/clwiki/SysInfo/TgtServer i.e. step 1 below...


==Hints, Tips & Know Issues==
1 The user needs to setup a [http://www.wiki.cl.cam.ac.uk/rowiki/SysInfo/TgtServer Kerberos Ticket] on the machine.
 
===Condor…===
[http://www.lookup.cam.ac.uk/person/gt19 Graham Titmus] (31/01/2015)
 
 
1The user needs to setup a [http://www.wiki.cl.cam.ac.uk/rowiki/SysInfo/TgtServer Kerberos Ticket] on the machine.


2 The Xen VMs need to be started - these are named pb0xx.  e.g. to start machine 30 with 2 CPUs and 7GB memory use
2 The Xen VMs need to be started - these are named pb0xx.  e.g. to start machine 30 with 2 CPUs and 7GB memory use
Line 70: Line 75:
shows the user seem not to have specified suitable RAM when starting the machine using 'cl-condor-start'.
shows the user seem not to have specified suitable RAM when starting the machine using 'cl-condor-start'.
The solution is to start up new machines with enough memory.
The solution is to start up new machines with enough memory.
----
 
===Dealing with the XenPool Machines===
See [https://wiki.cam.ac.uk/cl-sys-admin/Service_Desk_Knowledgebase:_XenE#Accessing_the_Xen_Console Accessing the Xen Console]
 
== Contacts ==
 
'''Primary'''
* [mailto:sys-admin-comment@cl.cam.ac.uk sys-admin-comment@cl.cam.ac.uk] (Goes to CL back office team)
 
==Availability==
 
* 24x7
 
==Hints, Tips & Know Issues==
 
===TGT server===
[http://www.lookup.cam.ac.uk/person/iwm21 Ian Mackey] (20/4/15)
 
The command "condor_q" comes back with:
-- Failed to fetch ads from: <128.232.0.69:9654> : ssh-remote-0.cl.cam.ac.uk
CEDAR:6001:Failed to connect to <128.232.0.69:9654>
 
ssh-remote-0:~$ condor_q 
-- Failed to fetch ads from: <128.232.0.69:9654> : ssh-remote-0.cl.cam.ac.uk
CEDAR:6001:Failed to connect to <128.232.0.69:9654>
ssh-remote-0:~$ condor_q -analyze
-- Failed to fetch ads from: <128.232.0.69:9654> : ssh-remote-0.cl.cam.ac.uk
CEDAR:6001:Failed to connect to <128.232.0.69:9654>
 
 
 
 
===TGT server===
[http://www.lookup.cam.ac.uk/person/pb22 Piete Brooks] (12/3/15)
 
The new kid on the block is a [http://www.wiki.cl.cam.ac.uk/clwiki/SysInfo/TgtServer TGT server] which allows people to do things like:
# retain a valid TGT at all times on named machines
# request a TGT on a named machine by typing 'cl-tgt'
 
 
We have basically told users of the ''''cron'''' and ''''condor'''' services that they have to use it (despite the red warnings about it being dangerous).
 
We need to get some experience with it, and have our 'security aware' users mull it over before suggesting it as a "std" way for slogin-serv to retain TGTs. (mgk25 is very keen on it)


==Categorising Keywords==
==Categorising Keywords==
* Condor pool
* Condor pool

Latest revision as of 12:39, 19 June 2015

This is the Condor content page of the CL Wiki Service Desk Knowledgebase. Its purpose is to provide information to the Service Desk team on how to handle problems and requests about this CL service. If you are involved with the provision of this CL service please feel free to add to the knowledge about that it.

If CL staff need to tell the Service Desk team about problems with this service please email
sys-admin-aside@cl.cam.ac.uk.

Return to the Service Desk Knowledgebase SERVICE PORTFOLIO

Key Service Description & URLs

CL Customer Documentation

Further CL Sys-Admin Resources

NOTE: Machines for Condor jobs are NOT always powered up and are started up on request.

Underpinning Services

Customer-base for this Service

  • All staff and students of the collegiate University

Costs

  • Free to all current staff and students of the collegiate University

SLA

  • N/A

Service Desk Call Handling Procedure

  • RT tickets can be escalated by changing the Queue to backoffice with the Owner set to Nobody and the Status as new. Tell the requestor:
    I am passing this request over to the experts who, I'm sure, will be in contact shortly.

Condor

Graham Titmus (31/01/2015)

http://www.cl.cam.ac.uk/local/sys/unix/applications/condor/condor-6.8.html

 1.2. Condor in the Computer Lab
 To use the condor system, the execute machines needs to be able to
 get Kerberos TGTs so that the filer is accessible, See the NFS
 sec=sys page for instructions on enabling filer access."

http://www.cl.cam.ac.uk/local/sys/unix/nfs-sec-krb5/#unattended

 Unattended operation (cron jobs, Condor, scripts, servers)  
 We are in the process of upgrading our cron servers, Condor 
 machines and web servers to use NFS sec=krb5. If you need to run 
 cron, condor or jobs periodically that access filer then please 
 email sys-admin to request advise on how to achieve this. There 
 are several techniques that can be used depending upon the exact 
 nature of your work, such as the TGT server. We will attempt to 
 help you find the most appropriate solution.

In the case of condor case the appropriate solution is to use the new TGT Server. There is information on how to do so at http://www.wiki.cl.cam.ac.uk/clwiki/SysInfo/TgtServer i.e. step 1 below...

1 The user needs to setup a Kerberos Ticket on the machine.

2 The Xen VMs need to be started - these are named pb0xx. e.g. to start machine 30 with 2 CPUs and 7GB memory use

cl-condor-start pb030 2 7000000000

3 The PATH variable needs setting to

PATH=/opt/condor-6.8.3/bin:$PATH;export PATH

4 The user can then use the commands to submit and monitor jobs.

Common problem is that jobs are held

condor_q -analyze 

just says "Request is held". A common problem is to underestimate the amount or RAM needed. The job will repeatedly run out and fail, so be held.

condor_q
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 
1374.31 mr472 2/4 00:53 0+00:31:52 R 0 32128.9 runconfig.sh /home

shows jobs require 32GB each,

condor_Qr
1374.31 2S 5U 2C mr472 vm1@pb035 (Memory >= 4000)

shows the user seem not to have specified suitable RAM when starting the machine using 'cl-condor-start'. The solution is to start up new machines with enough memory.

Dealing with the XenPool Machines

See Accessing the Xen Console

Contacts

Primary

Availability

  • 24x7

Hints, Tips & Know Issues

TGT server

Ian Mackey (20/4/15)

The command "condor_q" comes back with:

-- Failed to fetch ads from: <128.232.0.69:9654> : ssh-remote-0.cl.cam.ac.uk
CEDAR:6001:Failed to connect to <128.232.0.69:9654>
ssh-remote-0:~$ condor_q  
-- Failed to fetch ads from: <128.232.0.69:9654> : ssh-remote-0.cl.cam.ac.uk
CEDAR:6001:Failed to connect to <128.232.0.69:9654>

ssh-remote-0:~$ condor_q -analyze

-- Failed to fetch ads from: <128.232.0.69:9654> : ssh-remote-0.cl.cam.ac.uk
CEDAR:6001:Failed to connect to <128.232.0.69:9654>



TGT server

Piete Brooks (12/3/15)

The new kid on the block is a TGT server which allows people to do things like:

  1. retain a valid TGT at all times on named machines
  2. request a TGT on a named machine by typing 'cl-tgt'


We have basically told users of the 'cron' and 'condor' services that they have to use it (despite the red warnings about it being dangerous).

We need to get some experience with it, and have our 'security aware' users mull it over before suggesting it as a "std" way for slogin-serv to retain TGTs. (mgk25 is very keen on it)

Categorising Keywords

  • Condor pool