Filer review

From Computer Laboratory System Administration
Revision as of 14:19, 15 February 2012 by mgk25 (talk | contribs)
Jump to navigationJump to search
This is an evolving early-draft report by the ad-hoc Filer Working Group, who started in early 2012 to review the use and configuration of the departmental filer. For more information about the filer, there are also the departmental filespace user documentation, some notes on the NetApp file server by and for sys-admin, and its own man pages.

The Computer Laboratory has operated a centrally provided NFS file store for Unix/Linux systems continuously since the mid 1980s. This service hosts the commonly used home directories and working directories of most users, and is widely used by research groups to collaborate, via group directories for shared project files and software. At present, this service is provided by a NetApp FAS3140-R5 storage server "elmer" (SN: 210422922), running under "Data ONTAP Release 7.3.3". This server also provides access to the same filespace to other operating systems via the CIFS and WebDAV protocols. It also hosts disk images for virtual machines, which are accessed over block-level protocols such as iSCSI. An additional FAS2040-R5 server "echo" (SN: 200000186549) handles off-site backup using SnapVault.

Review

The Computer Laboratory's IT Advisory Panel initiated on 28 October 2011 an ad-hoc working group to review the provision of departmental file space, headed by Markus Kuhn. The initial focus of this review will be the configuration and use of the existing filer "elmer", with a particular view on identifying and eliminating obstacles that currently prevent remote NFS access by private Linux home computers (something that has long been available to Windows users). It is believed that this project also provides an opportunity to rid the configuration of the filer from historic baggage and to streamline and simplify its use by departmentally administered Linux machines. The working group may later extend its remit and welcomes suggestions.

The initial fact-finding phase of the review was conducted by Markus Kuhn and Martyn Johnson and focussed namespace management, authentication, and access from outside the department.

Namespace management

The NetApp operating system requires filer administrators to structure the storage space at several levels. Familiarity with these will help to understand some of the historic design decisions made.

  • An aggregate is a collection of physical discs, made up of one or more RAID-sets. It is the smallest unit that can be physically unplugged and moved intact to a different filer. Elmer has two aggregates because one cannot mix different major disk technologies (Fibre Channel vs. SATA) in an aggregate. The backup filer eldo has just one. Discs can be added to an aggregate on the fly, but never removed.
  • A volume is a major unit of space allocation within an aggregate. Typically, they have reserved space, though it is possible to over-commit if one really wants to. Many properties are bound to a volume, e.g. language(?). Significantly, a volume is the unit of snapshotting – each volume has its own snapshot schedule and retention policy.
  • A q-tree ("quota tree") is a magic directory within the root directory of a volume which has a quota attached to it and all its descendants. (This is merely for quota; there is no space reservation associated with a q-tree.)

When we first got the filer in 19??, the aggregate layer did not exist, and a volume was just a collection of discs. Therefore a single volume couldn't get too bit, and it was not feasible to put, for example, all user home directories into a single q-tree, as a q-tree couldn't span multiple volumes, and therefore no multiple sets of disks. This imposed an upper bound on the size of a q-tree. In addition, the backup system imposed constraints on the total number of q-trees. It was therefore also not possible to give every user their own q-tree. As a compromise, Martyn Johnson created eight q-trees called homes-1 to homes-8, which are now all located in volume 1, along with various q-trees for each research group with group filespace (and for various other functions):

 $ ls /a/elmer-vol1
 grp-cb1  grp-op1  grp-se1  grp-th1  homes-2  homes-5  homes-8  sys-rt
 grp-da1  grp-pr1  grp-sr1  grp-th2  homes-3  homes-6  sys-1
 grp-nl1  grp-rb1  grp-sr9  homes-1  homes-4  homes-7  sys-pk1
 $ ls /a/elmer-vol3
 grp-dt1  grp-nl4  grp-rb4   grp-sr3  grp-sr7  sys-lg1  sys-ww1
 grp-dt2  grp-nl9  grp-rb9   grp-sr4  grp-sr8  sys-li1
 grp-nl2  grp-rb2  grp-sr11  grp-sr5  grp-th9  sys-li9
 grp-nl3  grp-rb3  grp-sr2   grp-sr6  sys-acs  sys-pk2
 $ ls /a/elmer-vol4
 misc-clbib  misc-repl  sys-bmc  www-1  www-2
 $ ls /a/elmer-vol5
 WIN32Repository  grp-sr10  grp-te1     scr-1  scr-3  scr-5
 grp-rb5          grp-sr12  misc-arch1  scr-2  scr-4  www-3
 $ ls /a/elmer-vol6
 MSprovision  grp-nl8  grp-rb6  sys-ct  sys-rt2  www-4
 $ /a/elmer-vol8
 grp-ai1  grp-dt8  grp-dt9  grp-nl7
 $ /a/elmer-vol9
 ah433-nosnap  iscsi-nosnap1  misc-nosnap1

As a result of this compromise, the pathname of a (super)home directory on the filer, such as

 vol1/homes-1/maj1/
 vol1/homes-5/mgk25/

now includes a q-tree identifier (e.g., homes-1) that the user cannot infer from the user identifier, and which we therefore would ideally hide from users. Users should instead see simple pathnames such as /homes/maj1. Therefore, a two-stage mapping system between filer pathnames and user-visible pathnames was implemented for NFSv3:

  • Server-side mapping: Firstly, the filer's /etc/exports file (/a/elmer-vol0/etc/exports in lab-managed Linux machines) uses the -actual option as in "/vol/userfiles/mgk25 -actual=/vol/vol1/homes-5/mgk25" to export each superhome directory of a user under an alias pathname that lacks the q-tree identifier.
  • Client-side mapping: Secondly, autofs system is used to individually mount such user directories under a more customary location in the client-side namespace, using mount entires such as "elmer:/vol/userfiles/mgk25/unix_home on /auto/homes/mgk25" or "elmer:/vol/vol3/grp-rb2/ecad on /auto/groups/ecad". Finally, symbolic links such as "/homes -> /auto/homes", "/usr/groups -> /auto/groups", and "/anfs -> /auto/anfs" to give access via customary short pathnames.

This solution is historically grown and was motivated by two considerations:

  • When new users are created, their home directory should become instantly available on all client machines, which meant that the mapping needed to eliminate the q-tree identifier from the pathname had to be performed on the server, as there was no practical way to push out such changes in real-time to all client machines.
  • There was already an existing historic automount configuration infrastructure and customary namespace in place on lab-managed Unix clients when the filer was installed.

This arrangement is far from ideal, and causes a number of problems:

  • Quota restrictions (as reported on lab-managed Linux machines by "cl-rquota" are reported in terms of q-tree identifiers (e.g., homes-5, scr-1, www-1) that have hardly any useful link with the pathnames that the user is accustomed to, making it difficult to guess which subdirectory needs cleaning up if one runs out of quota.
  • Some research groups historically had to fragment their group space across multiple q-trees, which complicates understanding quotas and namespace.
  • Some users have quota in other's home directory (namely those sharing the same of the eight home-* q-trees), but others do not, which can lead to confusion when users try to collaborate using shared directories in someone's home directory.
  • A very elaborate system of symbolic links and autofs configuration is needed to create the customary client-side name-space, which requires substantial setup and tweaking on lab-managed machines that was never documented or supported for implementation on private machines.
  • the scheme relies on the -actual option in the filer's /etc/exports, which works for NFSv3, but unfortunately apparently not for NFSv4. As a result, we cannot switch to NFSv4 with the current setup, and we are loosing out on the substantial performance and ease-of-tunneling advantages that NFSv4 would have in particular for remote access (only one single TCP connection to filer needed, "delegation" to maintain consistency of a local file cache, etc.).

There are two reasons why fixing this is non-trivial and hasn't been done long ago:

  1. The autofs configuration required for client-side mapping is disseminated via LDAP from an LDIF master file that is currently generated by a very complex set of historically grown scripts that are considered unmaintainable and not fully understood by any member of sys-admin. This system is overdue for reimplementation.
  2. The filer lacks any more efficient command than simple copying to move files from one q-tree into another. Therefore, reorganising the way in which home and research-group directories are distributed across q-trees would not only take several days to complete (which could be disruptive), but would also cause significant "churn" in the backup system, essentially doubling for a long time the amount of disc space required on the backup system.