|This document is available in: English Castellano Deutsch Francais Italiano Nederlands Portugues Russian Turkce|
by Egon Willighagen
About the author:
Will receive his masters degree this year and will start his
PhD reasearch on chemometrics. Still enjoys basketball a lot, as he
does LinuxFocus and Linux in general.
This article contains the presentation given at the Libre Software Meeting in Bordeaux in July. It explains the XML database used for automatic generation of the LinuxFocus.org(/Nederlands) web site.
The system used for document and translation management in the LinuxFocus project consists of several ASCII files, including resdb.txt, issuedb.txt and maindb.txt. These files have a fixed format, and they're used to generate web pages. However, they are difficult to extend, and the separated nature of the data makes it hard to manage all the information available for an article.
LinuxFocus did not automatically generate much web content when I started the new database. As an editor on the Dutch team, I was eager to have the index.html files on the web site dynamically generated. Editing several HTML files each time a new article was translated took a lot of effort and caused many broken links. Therefore, I wanted a new system to which I could add information easily, and from which I could easily generate index pages for the web site. I started working on it sometime in the summer of 2000.
The choice for XML was a bit arbitrary. Suggestions had been made to use a relational database, but I was experienced in XML and preferred a system of text based files. It soon turned out that a new numbering scheme would be useful, because the database could then use one type of ID instead of the two or three schemes then in use. Guido Socher did all the renumbering, which was quite an effort (my thanks!).
The Document Type Definition (DTD) was already in development, and a little bit of content was in the database for testing purposes. With the new uniform numbering scheme, the time was right to load the database with content. After having added about 20 articles, it became clear that this was an enormous project. Writing scripts to use the old files was possible, but not all information that the new database could contain was available, and, as explained, the information that was available was distributed over several files. Fortunately, Floris Lambrechts got involved, and I have to thank him deeply for adding most of the content to the database. Without his help, the system would not be what it is today.
Along with the new format also came the ability to add new information. And over the past year several new kinds of data have been added to the database. Early extensions were a table of authors, translators, editors and other people involved in LinuxFocus, and file locations. The reason for addition of the latter was that there were several filenaming schemes used since the beginning of LinuxFocus. During the renumbering it was reduced to two schemes. Some files used server side includes and used the .shtml extension, where older articles used .html extensions. The <file> tag can be used to overwrite a default. (The current default uses the format "article" + article number + ".shtml". This might include an optional ".meta" in case the file is in LinuxFocus' meta format.)
Now that the database had reached critical mass, I finally got around to benchmarking the software I was writing. The current XSLT stylesheets are not the first implementation. It was preceded by Perl based code. But with the growing size of the database, performance became important. The first try was simply not good enough. But before I start explaining the tools, I'll explain the database format.
XML, first of all, is a syntax specification for markup languages. XML defines how markup should look. The syntax describes the sequence of characters allowed in "well formed" XML document. It declares that a document has one root element and that an element consists of a start tag, content (text, child elements, or both), and an end tag. These tags consist of a "<" character followed by a name and at the end a ">" character. An end tag must have a "/" just in front of the name. Empty tags, like HTML's <br>, take a "/" after the name. A start tag may contain attributes, and these also have a specific syntax. XML tags look like these:
<greeting>Hello, world!</greeting>or for an empty tag
Besides syntax, languages also contain semantics. This describes how certain elements relate to each other. The semantics of HTML declares that the <body> tag should be contained by the <html> element, and not the other way around. The semantics also describe that the <img> element is empty, as is the <br> element. If these semantics are given in a formal notation, they can be parsed with a program and used to validate the document using those semantics. One of these formal notations is called Document Type Definition, or DTD for short. If a document passes the validation process, it is called a valid document. You have to be careful with XML because its validation is very strict.
Now that we know what a DTD is, let's have a look at the LinuxFocus XML Database DTD. For several of the specifications we will provide an example. By examining these examples you will get an idea on how the information is contained in LinuxFocus' XML database.
The root element in the LinuxFocus XML database, or one of its extensions/localizations, is the <database> element.
<!ELEMENT database (themes?, persons?, issues?, articles?)>
First, note that the "?" means the child element may occur zero or one times. Thus, the database may contain information about LinuxFocus' themes, persons, issues and articles. Since this is very straightforward, I'll move on to a more interesting example.
The themes are contained within the <themes> element which is a child element of <database>. Each theme has a unique ID, a title, and optionally an abstract and an image.
<!ELEMENT themes (theme+)> <!ELEMENT theme (title*, desc?, img?)> <!ELEMENT title (#PCDATA)> <!ELEMENT desc (#PCDATA)> <!ELEMENT img (EMPTY)>
Some of these elements must have attributes. These are also given in the DTD. Any textual content is contained in an element with the xml:lang attribute. The value of that attribute may be any token conform the ISO 3166 standard for country codes. Examples are "en", "fr" and "nl". Both the id and xml:lang attributes are specified in the original XML specification and are part of the XML syntax.
<!ATTLIST theme id ID #REQUIRED> <!ATTLIST title xml:lang NMTOKEN #REQUIRED> <!ATTLIST desc xml:lang NMTOKEN #REQUIRED> <!ATTLIST img src CDATA #REQUIRED>
An example database might look like this:
<database> <themes> <theme id="hw"> <title xml:lang="en">Hardware</title> <img src="Hardware.jpg"/> <theme> <themes> </database>
Issues are contained in the <issues> element. Like themes each issue has a unique ID.
<!ELEMENT issues (issue+)> <!ELEMENT issue (title+, published?, file*)> <!ELEMENT title (#PCDATA)> <!ELEMENT published (EMPTY)> <!ELEMENT file (#PCDATA)>
The element <published> flags published issues. The next issue and the SomeLanguage2Eng pseudo issues do not have this element. The <title> element has again the @xml:lang attribute. The <file> element denotes the directory in which this issue is located. It must not point to the index.html, because it is used to determine file locations.
An example (note that we use the @code attribute for sorting):
<issue id="ToBeWritten" code="999996"> <title xml:lang="en">Not yet written articles</title> </issue> <issue id="September2001" code="200109"> <title xml:lang="en">September2001</title> </issue>
Information about authors and translators are stored in <person> elements. Each person must have a unique ID.
<!ELEMENT persons (person+)> <!ELEMENT person ((name|email)*,(homepage|nickname|desc|team)*)> <!ELEMENT email (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT homepage (#PCDATA)> <!ELEMENT nickname (#PCDATA)> <!ELEMENT desc (#PCDATA|%html-els;)*> <!ELEMENT team EMPTY>
Each person can have the following information: a name, an email address (or more), homepage(s) and nicknames. If the person is also part of a translation team, we add a <team> element. For example, the following line in the <person> element means that Floris belongs to the Dutch team <team xml:lang="nl"/>. Finally, each person can have a description, which may contain additional web links.
<person id="nl-ew"> <name>Egon Willighagen</name> <email>firstname.lastname@example.org</email> <team xml:lang="nl"/> </person>
The articles are of course the most interesting part of the database.
<!ELEMENT articles (article+)> <!ELEMENT article (title+, (file|personref|abstract|issueref|themeref| nometa|nohtml|translation|proofread)*)> <!ELEMENT abstract (#PCDATA)> <!ELEMENT nohtml EMPTY> <!ELEMENT nometa EMPTY> <!ELEMENT translation (personref*, (reserved|finished|proofread)*)> <!ELEMENT reserved (#PCDATA)> <!ELEMENT finished (#PCDATA)> <!ELEMENT proofread (personref*, (reserved|finished)*)> <!ATTLIST article id ID #REQUIRED xml:lang NMTOKEN #IMPLIED type (article|coverpage) "article" next IDREF #IMPLIED prev IDREF #IMPLIED> <!ATTLIST file xml:lang NMTOKEN #REQUIRED type (target|meta) "target"> <!ATTLIST translation from NMTOKEN #REQUIRED to NMTOKEN #REQUIRED>
Each article has at least one title; one for each language. The <file> element can be used to give the article's file location, for both the META format and the HTML version (see example below). In cases where no META or HTML version is available, the optional <nohtml/> and <nometa/> elements may be used. Each article can have an abstract. Having the abstract in the database means it can be used to create index web pages.
The <article> element has five attributes: the required @ID, an optional attribute xml:lang to denote the language in which it was originally written, a @type attribute used for cover pages, which are for translation purposes also treated as articles. Finally, two other optional attributes, @next and @prev, which are used to tie articles from a series together.
An article is associated to an issue and to a theme with the <issueref> and <themeref> elements, both having a @href attribute. The value for this attribute must be a unique ID, the ID of the associated issue or theme.
<article id="article206" xml:lang="en"> <title xml:lang="en">Using XML and XSLT to build LinuxFocus.org(/Nederlands)</title> <personref href="nl-ew"/> <issueref href="ToBeWritten"/> <themeref href="appl"/> <abstract xml:lang="en"> This article shows you how parts of the Dutch web site of LinuxFocus is generated with XSLT tools from the XML database. It compares this with the (very) much slower DOM tools in Perl. </abstract> </article>
A localized <article> element looks like:
<article id="52"> <title xml:lang="nl">Enlightenment</title> <file xml:lang="nl">Nederlands/July1998/article52.html</file> <translation from="en" to="nl"> <personref href="nl-tu"/> <reserved>2000-09-06</reserved> <finished>2000-10-04</finished> <proofread> <personref href="nl-fl"/> <reserved>2000-10-04</reserved> <finished>2000-10-04</finished> </proofread> </translation> <abstract xml:lang="nl"> Enlightenment is een Linux window-manager met uitgebreide mogelijkheden. Dit artikel bespreekt ze, samen met de installatie en de instelling van E. Dit alles is niet voor beginners daar E op het moment nog in beta-stadium verkeert. </abstract> </article>
Note that this translation is reserved for translation at a certain date, it is done, but also proof-read. In all cases the person who did the work is linked to with <personref> elements.
For all elements, the best tutorial is the current databases itself:
One of the reasons for creating this new format was to automatically create web indices from it. Now that we understand (?) the database format let's see how we can use it to generate web pages.
First, a bit of history. The first implementation used Perl modules to interface with the database. Though the interface was very clean, the implementation was very slow. The information was contained in an XML container called Document Object Model (DOM). Most implementations for DOM, however, are very slow, at least much slower than the alternative Simple Application interface for XML (SAX).
But if the task is just web page generation a third alternative seems best: XSLT. This is a XML based transformation language. Many XSLT processor currently exist and most programming languages are supported. Some time ago there was a LinuxFocus article on XML::XSLT, one of Perl XSLT implementations. Since the publication of that article, more implementations have emerged, and there are a few that I recommend:
An XSLT processor takes two files for input. One is the XML source to transform. The other is the XSLT stylesheet that defines the transformation. For generation of LinuxFocus web pages the following XSLT stylesheets are available:
To generate the mainindex.html, for example, the Dutch teams runs:
sabcmd stylesheets/mainindex.xslt db/lfdb.nl.xml > ../mainindex.html
The stylesheets know where the English root database is, and just needs the localized database as XML input. Some sheets need an additional parameter:
sabcmd stylesheets/theme.xslt db/lfdb.nl.xml '$theme=appl' > ../Themes/appl.html
The Dutch index.html is also generated from the database, but uses a bit more complex setup. The index.html is made with Guido Socher's lfpagecomposer from a set of preprocessed input files. And these preprocessed input files are generated from a set of .pre files such :
<H2>Vorige nummers</H2> <p>Dit zijn de uitgaven van LinuxFocus in het Nederlands: <ul> <!-- macro xslt previssues --> </ul>
<H2>Recent vertaalde artikelen</H2> < macro xslt recently_translated -->These files are simply HTML fragments with a macro that applies a stylesheet to you localized database. The processing is done with a program called apply_stylesheets.pl which looks for <!-- macro xslt [stylesheet] --> commands and parses the database with that command. Note that the .xslt extension is omitted. Our Makefile contains:
%.shtml: %.pre @echo "Making $*..." @../../xml/bin/apply_stylesheets.pl $*.pre
The resulting *.shtml files are used by the lfpagecomposer script. The stylesheets that are used to generate the index.html are: issuetoc.xslt, previssues.xslt and recently_translated.xslt.
To use this system for other languages, you need to do the following:
The second step is a bit unfortunate. In principle only the text in the output needs to be localized, but the stylesheets do not have localization properties yet. This is possible, however, and I would like to see it implemented.
I recommend using a DTD aware XML editor. In Emacs you can, for example, use the psgml major mode. This will give you the ability to validate the document (with nsgmls). This helps a lot in avoiding mistakes. In Emacs you can then also right-mouse-click to see the elements and attributes you can insert on that specific place in the XML file. (Thanks to Jaime Villate for his excellent talk at the LSM conference in Bordeaux this year.)
Another great help is the Dutch localization of the XML database. If you run into trouble you can consult that file. Though the content is mostly Dutch, you can see how the database elements are organized. If that does not help, you can always email me.
Localizing the stylesheets is probably a bit tricky. Text is intermingled with XSLT commands. The latter you must not touch (unless you know what you're doing), in order to preserve its functionality. I plan to have the stylesheets localized in the future which would mean that you only need to edit a file that contains your translations and no XSLT commands, but this is not yet done.
OK, this should help you to get started. Most things you can copy/paste from the Dutch files. All files are FDL and GPL. In the next year these are my plans with this system:
Webpages maintained by the LinuxFocus Editor team
© Egon Willighagen, FDL
Click here to report a fault or send a comment to LinuxFocus
2001-09-02, generated by lfparser version 2.17