Editing XML

From RavenWiki
Jump to navigationJump to search

We're working on improving Raven resources for developers and site operators.

Try out the new Raven documentation for size.

Most of the configuration files for the Internet2 SP are written in XML. This is a convenient format for expressing structured data in a file, but potentially confusing for SP administrators who will have to edit such files. Here is a brief introduction to editing them.

File format

The files are plain text files. Edit them with a text editor such as vi, Emacs or Notepad (not in a word processor). Under Windows, some of the example files are distributed with Unix line ends and so appear in Notepad as one long line - either find a better editor (Wordpad seems to work, or consider Notepad++) or convert the line ends.

Many of the lines in the example files are long - it is easier to edit them in a reasonably wide window.

Try to avoid guessing because there are so many possibilities that the odds of getting things to work this way are low. If you must guess, make sure you can reliably undo any change because the result of repeated guessing rapidly becomes an unusable 'tag soup'. It's a good idea to make a copy of any file before you alter it.

Elements

XML files can appear complicated and yet the underlying format is quite simple.

XML files consist of elements. These have a start-tag and an end-tag and may have some content. For example

 <Foo>Here is some text</Foo>

This is an element 'Foo' consisting of the start-tag '<Foo>', content 'Here is some text' and end-tag '</Foo>'. Start and end tags always come in pairs, with the one exception that an element with no content such as

 <Foo></Foo>

can but need not be written as the shortcut

 <Foo/>

[ HTML authors normally write such empty tags with a space before the '/>' - this is an old compatibility trick for pre-XHTML web browsers that is probably no longer needed; it isn't required for XML though it does little harm ]

Note that if you expand an empty element into the full form, perhaps because you want to add some content, you have to remember to remove the extra '/'.

Element names are case-sensitive - <Foo> isn't the same as <FOO> which isn't the same as <foo>. The case of start and end tags has to match too. Some tag names contain ':' - this has a special meaning in XML but for our purposes you can ignore it and treat the whole thing as the tag's name.

An element's content is everything between the start and end tags. This is typically either text or one or more further elements

 <Foo>
   <Bar>
     Some text
   </Bar>
   <Baz/>
 </Foo>

Here the Foo element contains a Bar and a Baz element, Bar contains some text, Baz is empty. An element's content can also be a mixtuture of text and other elements, but this format isn't seen in Shibboleth configuration files. Where elements are nested, their start and end tags must appear in the right order - this isn't allowed

 <Foo><Bar></Foo></Bar>

Whitespace is part of content though applications may ignore it. It is a good idea to avoid adding whitespace around content in which whitespace might be significant or otherwise cause confusion. So for example write

  <Foo>http://www.example.com/</Foo>

rather than

  <Foo>
    http://www.example.com/
  </Foo>

Otherwise it's usually safe and helpful to include whitespace for indentation and layout purposes.

Attributes

Elements can have attributes. These appear as name/value pairs inside the element's start-tag (or inside a 'shortcut' tag for an empty element). The name and value are separated by '=' and optional white space; values are always enclosed in either single or double quotes and contain text; name/value pairs are separated by white space

 <Foo attribute="value" otherattribure='something else'>
   Here is some text
 </Foo>

Each attribute can appear only once in any particular tag. Attribute names are case sensitive; attribute values are just text, the case-sensitivity of which will depend on the program processing the data.

Character References

Some characters can't appear as themselves in text. '<' can't ever appear because it looks like the start of a tag; "'" and '"' can't appear in attribute values surrounded by single and double quotes respectively. These three characters are instead written as '&lt;', '&quot;' and '&apos;'. These are called character references. Because of this encoding, '&' can't appear as itself either, and so has to be written as '&amp;'. For symmetry, '>' can be written '&gt;'.

Other characters that are hard to type can similarly be entered using either a decimal or hexadecimal UNICODE character code such as '&#60;' or '&#x3c;'.

Comments

XML documents can contain comments. These start with '<!--' and end with '-->' and have the unusual restriction that they can not contain '--' (which in particular means that they can't contain a run of dashes as a horizontal separator). Comments can span multiple lines but can not be nested. Comments can not appear inside tags (and so can't appear between attributes).

Errors, 'Well formed' and 'Valid' documents

Despite the superficial resemblance to HTML, programs processing XML are almost always totally unforgiving of mistakes and will abort at the first one they encounter, rather than trying to guess what was intended as tends to happen in HTML contexts. As a result if you have an error and fix it then its is entirely possible that a program will just move on and report another error. Its common to have to go round a loop finding and fixing errors several times.

Worse, most XML programs report errors from their own perspective, rather than yours. As a result a mistake in one part of a document will often only be reported as an error much further on in the file when the program finally realises that something is wrong. For example

 <Foo>
   <Bar>
     Some text
   <Baz/>
 </Foo>

was intended to have a closing </Bar> tag immediately before the <Baz/> one. Most programs will only report an error when they process the final line and realise that the opening and closing tags don't match.

An XML document is well formed if it meets the following requirements

  • Every start-tag has an end-tag (or is an 'empty element' shortcut)
  • Elements do not overlap
  • The document contains only one top-level element
  • All attribute values are quoted
  • There is no more than one attribute with the same name in any element
  • No un-escaped '<' or '&' in text

The Internet2 Shibboleth programs will refuse to process files that are not well formed.

Beyond this is is possible to define the detailed structure of XML documents - which elements are allowed or required inside others, which attributes are allowed or required, etc, etc. XML documents that conform to this more detailed definition are said to be valid. It appears that the Internet2 Shibboleth programs vary as to whether they check for validity. The ones that don't will just ignore mis-spelled element or attribute names, names in the wrong case, etc., etc. and these will not generate error messages.