Editing XML

From RavenWiki
Revision as of 16:51, 4 March 2009 by jw35 (talk | contribs) (First cut)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Most of the configuration files for the Internet2 SP are written in XML. This is a convenient format for expressing structured data in a file, but potentially confusing for SP administrators who will have to edit such files. Here is a brief introduction to editing them.

File format

The files are plain text files. Edit them with a text editor such as vi, Emacs or Notepad (not in a word processor). Under Windows, some of the example files are distributed with Unix line ends and so appear in Notepad as one long line - either find a better editor (Wordpad seems to work) or convert the line ends.

Many of the lines in the example files are long - it is easier to edit them in a reasonably wide window.

XML

XML files can appear complicated and yet the underlying format is quite simple.

XML files consist of elements. These have a start-tag and an end-tag and may have some content. For example:

 <Foo>
   Here is some text
 </Foo>

Start and end tags always come in pairs, with the one exception that an element with no content such as

 <Foo></Foo>

can but need not be written:

 <Foo/>

[HTML authors normally write such empty tags witha space before the '/>' - this is an old compatibility trick for pre-XHTML web browsers that is probably no longer needed, and it isn't needed for XML anyway though it does little harm]

Note that if expanding an empty element into the full form, perhaps because you want to add some content, you have to remember to remove the extra '/'.

Tag names are case-sensitive - <Foo> isn't the same as <FOO> which isn't the same as <foo>. The case of start and end tags has to match too. Some tag names contain ':' - this has a special meaning in XML but for our purposes you can ignore it and treat the whole thing as the tag's name.

An element's content is everything between the start and end tags. This is typically either text or one or more further elements:

 <Foo>
   <Bar>
     Some text
   </Bar>
   <Baz/>
 </Foo>

Here the Foo element contains a Bar and a Baz element, Bar contains some text, Baz is empty. An element's content can also be a mixtuture of text and other elements, but this format isn't seen in Shibboleth configuration files. Where elements are nested, their start and end tags must appear in the right order - this isn't allowed:

 <Foo><Bar></Foo></Bar>

Whitespace is part of content though applications may ignore it. It is a good idea to avoid adding whitespace around text content that has a defined meaning, such as a URL. So write

 <Foo>http://www.example.com/</Foo>

rather than

 <Foo>
   http://www.example.com/
 </Foo>

Otherwise it's usually safe to include whitespace for indentation and layout purposes.

Elements can have attributes. These appear as name/value pairs inside the element's start-tag. The name and value are separated by '=' and optional white space; values are always enclosed in either single or double quotes and contain text; name/value pairs are separated by white space:

 <Foo attribute="value" otherattribure='something else'>
   Here is some text
 </Foo>

Each attribute can appear only once in any particular tag.

Some characters can't appear as themselves in text. '<' can't ever appear because it looks like the start of a tag; "'" and '"' can't appear in attribute values surrounded by single and double quotes respectivly. These three characters are instead written as '&lt;', '&quot;' and '&apos;'. Becasue of this encoding, '&' can't appear as itself either, and so has to be written as '&amp;'. For symmetry, '>' can be written '&gt;'.

Other characters that are hard to type can similarly be entered using either a decimal or hexadecimal UNICODE character code such as '&#60;' or '&#x3c;'.

XML documents can contain comments. These start with '<!--' and end with '-->' and have the unusual restriction that they can not contain '--'. They can span multiple lines but can not be nested. Comments can not appear inside tags.

Errors

Despite the superficial resemblance to HTML, programs processing XML are almost always totally unforgiving of mistakes and will abort at the first one they encounter, rather than trying to guess what was intended as tends to happen in HTML contexts. As a result if you have an error and fix it then its is entirely possible that a program will just move on and report another error. Its common to have to go round a loop finding and fixing errors several times.

An XML document is well formed if it meats the following requirements

  • Every start-tag has an end-tag (or is an 'empty element' shortcut)
  • Elements do not overlap
  • The document contains only one top-leven elementq
  • All attribute values are quoted
  • There is no more than one attribute with the same name in any element
  • No un-escaped '<' or '&' in text

The Internet2 Shibboleth programs will refuse to process files that are not well formed.

Beyond this is is possible to define the detailed structure of XML documents - which elements are allowed or required inside others, which attributes are allowed or required, etc, etc. XML documents that conform to this more detailed definition are said to be valid. It appears that the Internet2 Shibboleth programs vary as to whether they check for validity. The ones that don't will just ignore mis-spelled element or attribute names, names in the wrong case, etc., etc. and these will not generate error messages.