Architectural Forms: A New Generation (Draft 2.3)

1. Introduction

This is a preliminary description of a work in progress called "Architectural Forms: A New Generation", or AF:NG for short. Comments to cowan@ccil.org. This document is highly subject to change without notice.

Copyright 2002 John Cowan.

AF:NG provides the facilities, but does not employ the syntax, of SGML Architectural Forms. AF:NG is intended to be used in conjunction with the schema language RELAX NG, but is not dependent on it in any way.

The purpose of AF:NG is to provide for tightly specified transformations of XML documents, consisting of renaming or omitting elements, attributes, and character data. AF:NG is not intended as a general-purpose transformation language like XSLT or Omnimark. Using AF:NG, a recipient may, instead of specifying a schema to which documents must conform exactly, specify a schema to be applied to the output of an AF:NG transformation. In that way, the actual element and attribute names, and to some degree the document structure, may vary from the schema without rendering the document unacceptable. In particular, it is easy to use AF:NG to reduce a complex document to a much simpler one, when only a subset of the document is of interest to the recipient.

The information provided to AF:NG consists of a short XML document called an architectural map, or archmap, plus the appearance of a special attribute called the form attribute within the source document. The name of the form attribute is given in the archmap, and it is the only required portion of the archmap.

Note: This draft of AF:NG does not have the ability to map a source attribute into architectural character data.

2. Architectural Map Schema

The following RELAX NG schema (in non-XML format) specifies the syntax of an archmap:

namespace default = "x-whatever:somethingorother"
datatypes xsd = "http://www.w3.org/2001/XMLSchema-datatypes"

inheritable =
        attribute arch-ns {text}?,
        attribute source-ns {text}?

tokenmap =
        element tokenmap {
           attribute to {xsd:token},
           attribute from {xsd:token}
        }

start = element archmap {
            inheritable,
            attribute form-att {xsd:Name},
            attribute doc-elem {xsd:Name}?,
            attribute output {"transform" | "decorate"}?,
            element form {
                inheritable,
                attribute data {"preserve" | "ignore"}?,
                attribute children {"process" | "skip" | "literal"}?,
                attribute name {xsd:Name},
                attribute arch-elem {"#NONE" | xsd:Name}?,
                attribute source-elem {xsd:Name}?,
                element attmap {
                    inheritable,
                    attribute arch-att {xsd:Name},
                    (attribute value {text} |
                     (attribute source {"#CONTENT" | xsd:Name}, tokenmap*))?
                }*
            }*
        }

If the inheritable attributes are not present on an element, they are given the value specified on the nearest ancestor element, analogously to the treatment of xml:lang and xml:space. The arch-ns attribute specifies the namespace name of the architecture, and is used to namespace qualify names appearing in form-att, doc-elem, arch-elem, and arch-att attributes. The source-ns attribute specifies the namespace name of references to the source document, and is used to namespace qualify names appearing in source-elem and source attributes.

3. General Processing Model

An AF:NG processor descends the tree of elements in document order, processing each element in accordance with which form it matches in the architectural map, and in accordance with its current modes. The initial modes are: data mode is preserve, children mode is process.

Each element in the source document is tested by the method given in Section 4 to see if it matches any of the form elements in the archmap.

  1. If the source element matches some form, then the source element's name is set to be the value of the form's arch-elem attribute, as namespace qualified by the arch-ns attribute. If there is no arch-elem attribute, the value of the name attribute is used instead. If the data or children attributes are present on the form, then the processing mode is altered for the duration of processing the source element. If the value of the arch-elem attribute is the string #NONE, then the source element makes no contribution to the output, although its children may.
  2. If the source element does not match any form, and it is the document element (root) of the source document, then the source element's name is set to be the value of the archmap's doc-elem attribute, as namespace qualified by the arch-ns attribute. If there is no doc-elem attribute, the value of the form-att attribute is used instead.
  3. Otherwise, if a form attribute is present in the source element, the source element's name is set to be the value of the form attribute, qualified by the arch-ns attribute of the archmap element. This rule allows forms that do nothing more than change the element's name to be the name appearing in the form attribute to be omitted from the archmap.
  4. Otherwise, if the output attribute has the value "decorate", then the source element's name is unchanged.
  5. Otherwise, the source element makes no contribution to the output, although its children may.

In any case, any form attribute present in the element is removed. If the form element has any attmap or content elements as children, attribute mapping is done according to the method given in Section 5. The children of the element are then processed according to the current modes, possibly as modified by the form:

  1. Character data is discarded if the current data mode is "discard".
  2. If the current children mode is "skip", all child elements are discarded.
  3. If the current children mode is "literal", all descendant elements are included in the output as-is, with no processing applied to them.

4. Element Matching

The form-att attribute, as namespace qualified by the arch-ns attribute, specifies the name of the form attribute for the source document. The processor matches elements in the source document with the form elements in the architectural map according to the following rules:

  1. If the source element contains a form attribute, and if some form element in the archmap contains a name attribute whose value is equal to the value of the form attribute, then the element matches that form.
  2. Otherwise, if the source element does not contain a form attribute, and if some form element in the archmap contains a source-elem attribute whose value (as qualified by the source-ns attribute) is equal to the name of the source element, then the element matches that form. This rule allows the form attribute of an element to be omitted in the case where it is always the same for every instance of that element.
  3. Otherwise, the source element does not match any form.

5. Attribute Mapping

If a source element has matched some form, and attmap child elements exist in the form, then attribute mapping must be done. This provides additional attributes in the output whose values are either fixed, or are derived from attributes or character data in the source.

Each attmap element specifies an architectural attribute to appear in the output. The name of the attribute is specified by the arch-att attribute, as qualified by the arch-ns attribute. Any pre-existing attribute of that name in the source element is removed.

The value of each architectural attribute is determined as follows:

  1. If a source attribute is given in the attmap element, and its value is the string "#CONTENT", then the character data of the element is removed and used as the architectural value.
  2. Otherwise, the value of the source attribute (as qualified by the source-ns attribute) specifies an attribute of the source element. The value of that attribute is the value of the architectural attribute. The source attribute is removed from the output.
  3. Finally, if a value attribute appears instead, it is used as the architectural value.

If an attmap element has tokenmap children, then the value of the architectural attribute is treated as a whitespace-separated list of tokens, and appropriate normalization is done. Any tokens appearing in from attributes are replaced with the tokens in the corresponding to attributes, all replacements being done in parallel.

6. Examples

Here is a sample source document:
<html><head><title>Reuters Health Information (2002-02-01):
Whooping cough increasing among US infants</title></head>
<body bgcolor="white"><p class="headline"><strong>
Whooping cough increasing among US infants</strong></p>
<p class="lead">ATLANTA, Feb 01 (Reuters Health) - In the last
20 years, the number of cases of whooping cough increased overall
in the US, especially among infants too young to receive three
pertussis vaccine doses, according to the US Centers for Disease
Control and Prevention (CDC).</p>
<p>Whooping cough, or pertussis, is caused by infection with
the Bordetella pertussis bacterium. Symptoms of whooping cough
include having a cough lasting 14 or more days accompanied by
a gasping sound or "whoop" while coughing. Children may also
vomit or have difficulty breathing during a coughing spell.</p>
<p>Until the advent of the pertussis vaccine in the late
1940s, the respiratory illness was a major cause of illness
and death, especially among infants and small children.
Since the introduction of the vaccine (usually administered as part
of the diphtheria-tetanus-pertussis combo vaccine), rates
for whooping cough have dropped dramatically in the developed
world.</p>
</body></html>

Here is a trivial map:

<archmap form-att="class" doc-elem="story">
    <form name="para" source-elem="p"/>
    <form name="title" source-elem="title" arch-elem="#NONE" data="ignore"/>
</archmap>

The effect of this map is:

  1. Source elements with the form attribute class in them have their names changed to whatever the value of the class attribute is.
  2. Elements named p without a class attribute will be changed to para elements.
  3. Elements named title are suppressed completely, including their character content.
  4. The html element, because it is the document element, will be changed to a story element.
  5. All other element tags disappear from the output; all text is retained.

The output document is:

<story>
<headline>
Whooping cough increasing among US infants</headline>
<lead>ATLANTA, Feb 01 (Reuters Health) - In the last
20 years, the number of cases of whooping cough increased overall
in the US, especially among infants too young to receive three
pertussis vaccine doses, according to the US Centers for Disease
Control and Prevention (CDC).</lead>
<para>Whooping cough, or pertussis, is caused by infection with
the Bordetella pertussis bacterium. Symptoms of whooping cough
include having a cough lasting 14 or more days accompanied by
a gasping sound or "whoop" while coughing. Children may also
vomit or have difficulty breathing during a coughing spell.</para>
<para>Until the advent of the pertussis vaccine in the late
1940s, the respiratory illness was a major cause of illness
and death, especially among infants and small children.
Since the introduction of the vaccine (usually administered as part
of the diphtheria-tetanus-pertussis combo vaccine), rates
for whooping cough have dropped dramatically in the developed
world.</para>
</story>