Filer review
- This is an evolving early-draft report by the ad-hoc Filer Working Group, who started in early 2012 to review the use and configuration of the departmental filer. For more information about the filer, there are also the departmental filespace user documentation, some notes on the NetApp file server by and for sys-admin, and its own man pages.
The Computer Laboratory has operated a centrally provided NFS file store for Unix/Linux systems continuously since the mid 1980s. This service hosts the commonly used home directories and working directories of most users, and is widely used by research groups to collaborate, via group directories for shared project files and software. At present, this service is provided by a NetApp FAS3140-R5 storage server "elmer" (SN: 210422922), running under "Data ONTAP Release 7.3.3". This server also provides access to the same filespace to other operating systems via the CIFS and WebDAV protocols. It also hosts disk images for virtual machines, which are accessed over block-level protocols such as iSCSI. An additional FAS2040-R5 server "echo" (SN: 200000186549) handles off-site backup using SnapVault.
Review
The Computer Laboratory's IT Advisory Panel initiated on 28 October 2011 an ad-hoc working group to review the provision of departmental file space, headed by Markus Kuhn. The initial focus of this review will be the configuration and use of the existing filer "elmer", with a particular view on identifying and eliminating obstacles that currently prevent remote NFS access by private Linux home computers (something that has long been available to Windows users). It is believed that this project also provides an opportunity to rid the configuration of the filer from historic baggage and to streamline and simplify its use by departmentally administered Linux machines. The working group may later extend its remit and welcomes suggestions.
The initial fact-finding phase of the review was conducted by Markus Kuhn and Martyn Johnson and focussed namespace management, authentication, and access from outside the department.
Namespace management
The NetApp operating system requires filer administrators to structure the storage space at several levels. Familiarity with these will help to understand some of the historic design decisions made.
- An aggregate is a collection of physical discs, made up of one or more RAID-sets. It is the smallest unit that can be physically unplugged and moved intact to a different filer. Elmer has two aggregates because one cannot mix different major disk technologies (Fibre Channel vs. SATA) in an aggregate. The backup filer eldo has just one. Discs can be added to an aggregate on the fly, but never removed.
- A volume is a major unit of space allocation within an aggregate. Typically, they have reserved space, though it is possible to over-commit if one really wants to. Many properties are bound to a volume, e.g. language(?). Significantly, a volume is the unit of snapshotting – each volume has its own snapshot schedule and retention policy.
- A q-tree ("quota tree") is a magic directory within the root directory of a volume which has a quota attached to it and all its descendants. (This is merely for quota; there is no space reservation associated with a q-tree.)
When we first got the filer in 19??, the aggregate layer did not exist, and a volume was just a collection of discs. Therefore a single volume couldn't get too bit, and it was not feasible to put, for example, all user home directories into a single q-tree, as a q-tree couldn't span multiple volumes, and therefore no multiple sets of disks. This imposed an upper bound on the size of a q-tree. In addition, the backup system imposed constraints on the total number of q-trees. It was therefore also not possible to give every user their own q-tree. As a compromise, Martyn Johnson created eight q-trees called homes-1 to homes-8, which are now all located in volume 1, along with various q-trees for each research group with group filespace (and for various other functions):
$ ls /a/elmer-vol1 grp-cb1 grp-op1 grp-se1 grp-th1 homes-2 homes-5 homes-8 sys-rt grp-da1 grp-pr1 grp-sr1 grp-th2 homes-3 homes-6 sys-1 grp-nl1 grp-rb1 grp-sr9 homes-1 homes-4 homes-7 sys-pk1 $ ls /a/elmer-vol3 grp-dt1 grp-nl4 grp-rb4 grp-sr3 grp-sr7 sys-lg1 sys-ww1 grp-dt2 grp-nl9 grp-rb9 grp-sr4 grp-sr8 sys-li1 grp-nl2 grp-rb2 grp-sr11 grp-sr5 grp-th9 sys-li9 grp-nl3 grp-rb3 grp-sr2 grp-sr6 sys-acs sys-pk2 $ ls /a/elmer-vol4 misc-clbib misc-repl sys-bmc www-1 www-2 $ ls /a/elmer-vol5 WIN32Repository grp-sr10 grp-te1 scr-1 scr-3 scr-5 grp-rb5 grp-sr12 misc-arch1 scr-2 scr-4 www-3 $ ls /a/elmer-vol6 MSprovision grp-nl8 grp-rb6 sys-ct sys-rt2 www-4 $ /a/elmer-vol8 grp-ai1 grp-dt8 grp-dt9 grp-nl7 $ /a/elmer-vol9 ah433-nosnap iscsi-nosnap1 misc-nosnap1
As a result of this compromise, the pathname of a (super)home directory on the filer, such as
vol1/homes-1/maj1/ vol1/homes-5/mgk25/
now includes a q-tree identifier (e.g., homes-1) that the user cannot infer from the user identifier, and which we therefore would ideally hide from users. Users should instead see simple pathnames such as /homes/maj1. Therefore, a two-stage mapping system between filer pathnames and user-visible pathnames was implemented for NFSv3:
- Server-side mapping: Firstly, the filer's /etc/exports file (/a/elmer-vol0/etc/exports in lab-managed Linux machines) uses the -actual option as in "/vol/userfiles/mgk25 -actual=/vol/vol1/homes-5/mgk25" to export each superhome directory of a user under an alias pathname that lacks the q-tree identifier.
- Client-side mapping: Secondly, autofs system is used to individually mount such user directories under a more customary location in the client-side namespace, using mount entires such as "elmer:/vol/userfiles/mgk25/unix_home on /auto/homes/mgk25" or "elmer:/vol/vol3/grp-rb2/ecad on /auto/groups/ecad". Finally, symbolic links such as "/homes -> /auto/homes", "/usr/groups -> /auto/groups", and "/anfs -> /auto/anfs" to give access via customary short pathnames.
This solution is historically grown and was motivated by two considerations:
- When new users are created, their home directory should become instantly available on all client machines, which meant that the mapping needed to eliminate the q-tree identifier from the pathname had to be performed on the server, as there was no practical way to push out such changes in real-time to all client machines.
- There was already an existing historic automount configuration infrastructure and customary namespace in place on lab-managed Unix clients when the filer was installed.
This arrangement is far from ideal, and causes a number of problems:
- Quota restrictions (as reported on lab-managed Linux machines by "cl-rquota" are reported in terms of q-tree identifiers (e.g., homes-5, scr-1, www-1) that have hardly any useful link with the pathnames that the user is accustomed to, making it difficult to guess which subdirectory needs cleaning up if one runs out of quota.
- Some research groups historically had to fragment their group space across multiple q-trees, which complicates understanding quotas and namespace.
- Some users have quota in other's home directory (namely those sharing the same of the eight home-* q-trees), but others do not, which can lead to confusion when users try to collaborate using shared directories in someone's home directory.
- A very elaborate system of symbolic links and autofs configuration is needed to create the customary client-side name-space, which requires substantial setup and tweaking on lab-managed machines that was never documented or supported for implementation on private machines.
- the scheme relies on the -actual option in the filer's /etc/exports, which works for NFSv3, but unfortunately apparently not for NFSv4. As a result, we cannot switch to NFSv4 with the current setup, and we are loosing out on the substantial performance and ease-of-tunneling advantages that NFSv4 would have in particular for remote access (only one single TCP connection to filer needed, "delegation" to maintain consistency of a local file cache, etc.).
There are two reasons why fixing this is non-trivial and hasn't been done long ago:
- The autofs configuration required for client-side mapping is disseminated via LDAP from an LDIF master file that is currently generated by a very complex set of historically grown scripts that are considered unmaintainable and not fully understood by any member of sys-admin. This system is overdue for reimplementation.
- The filer lacks any more efficient command than simple copying to move files from one q-tree into another. Therefore, reorganising the way in which home and research-group directories are distributed across q-trees would not only take several days to complete (which could be disruptive), but would also cause significant "churn" in the backup system, essentially doubling for a long time the amount of disc space required on the backup system.