2.1.0

Current release (2.1.0, unix)
Current release (2.1.0, dos)

Overview

Abstract

REXML is an XML processor for the language Ruby. REXML is conformant (passes 100% of the Oasis non-validating tests), and includes full XPath support. It is reasonably fast, and is implemented in pure Ruby. Best of all, it has a clean, intuitive API.

This software is distribute under the Ruby license.

Introduction

Why REXML? There, at the time of this writing, already two XML parsers for Ruby. The first is a Ruby binding to a native XML parser. This is a fast parser, using proven technology. However, it isn't very portable. The second is a native Ruby implementation, and as useful as it is, it has (IMO) a difficult API.

I have this problem: I dislike obscifucated APIs. There are several XML parser APIs for Java. Most of them follow DOM or SAX, and are very similar in philosophy with an increasing number of Java APIs. Namely, they look like they were designed by theorists who never had to use their own APIs. The extant XML APIs, in general, suck. They take a markup language which was specifically designed to be very simple, elegant, and powerful, and wrap an obnoxious, bloated, and large API around it. I was always having to refer to the API documentation to do even the most basic XML tree manipulations; nothing was intuitive, and almost every operation was complex.

Then along came Electric XML.

Ah, bliss. Look at the Electric XML API. First, the library is small; less that 500K. Next, the API is intuitive. You want to parse a document? doc = new Document( some_file ). Create and add a new element? element = parent.addElement( tag_name ). Write out a subtree?? element.write( writer ). Now how about DOM? To parse some file: parser = new DOMParser(); parser.parse( new InputSource( new FileInputStream( some_file ) ) )Create a new element? First you have to know the owning document of the to-be-created node (can anyone say "global variables, or obtuse, multi-argument methods"?) and call element = doc.createElement( tag_name ) parent.appendChild( element )"appendChild"? Where did they get that from? How many different methods do we have in Java in how many different classes for adding children to parents? addElement()? add()? put()? appendChild()? Heaven forbid that you want to create an Element elsewhere in the code without having access to the owning document. I'm not even going to go into what travesty of code you have to go through to write out an XML sub-tree in DOM.

So, I use Electric XML extensively. It is small, fast, and intuitive. IE, the API doesn't add a bunch of work to the task of writing software. When I started to write more software in Ruby, I needed an XML parser. I wasn't keen on the native library binding, "XMLParser", because I try to avoid complex library dependancies in my software, when I can. For a long time, I used NQXML, because it was the only other parser out there. However, the NQXML API can be even more painful than the Java DOM API. Almost all element operations requires accessing some indirect node access... you had to do something like element.node.attr['key'], and it is never obvious to me when you access the element directly, or the node.. or, really, why they're two different objects, anyway. This is even more unfortunate since Ruby is so elegent and intuitive, and bad APIs really stand out. I'm not, by the way, trying to insult NQXML; I just don't like the API.

I wrote the people at TheMind (Electric XML... get it?) and asked them if I could do a translation to Ruby. They said yes. After a few weeks of hacking on it for a couple of hours each week, and after having gone down a few blind alleys in the translation, I had a working beta. IE, it parsed, but hadn't gone through a lot of strenuous testing. Along the way, I had made a few changes to the API, and a lot of changes to the code. First off, Ruby does iterators differently than Java. Java uses a lot of helper classes. Helper classes are exactly the kinds of things that theorists come up with... they look good on paper, but using them is like chewing glass. You find that you spend 50% of your time writing helper classes just to support the other 50% of the code that actually does the job you were trying to solve in the first place. In this case, the Java helper classes are either Enumerations or Iterators. Ruby, on the other hand, uses blocks, which is much more elegant. Rather than:

for (Enumeration e=parent.getChildren(); e.hasMoreElements(); ) {
   Element child = (Element)e.nextElement();
   // Do something with child
}

you get:

parent.each_child{ |child| # Do something with child }

Can't you feel the peace and contentment in this block of code? Ruby is the language Buddha would have programmed in.

Anyhoo, I chose to use blocks in REXML directly, since this is more common to Ruby code than for x in y ... end, which is as orthoganal to the original Java as possible.

Also, I changed the naming conventions to more Ruby-esque method names. For example, the Java method getAttributeValue() becomes in Ruby get_attribute_value(). This is a toss-up. I actually like the Java naming convention more1, but the latter is more common in Ruby code, and I'm trying to make things easy for Ruby programmers, not Java programmers.

The biggest change was in the code. The Java version of Electric XML did a lot of efficient String-array parsing, character by character. Ruby, however, has ubiquitous, efficient, and powerful regular expression support. All regex functions are done in native code, so it is very fast, and the power of Ruby regex rivals that of Perl. Therefore, a direct conversion of the Java code to Ruby would have been more difficult, and much slower, than using Ruby regexps. I therefore used regexs. In doing so, I cut the number of lines of sourcecode by half.

Finally, by this point the API looks almost nothing like the original Electric XML API, and practically none of the code is even vaguely similar. However, even though the actual code is completely different, I did borrow the same process of processing XML as Electric, and am deeply indebted to the Electric XML code for inspiration.

One last thing. If you use and like this software, and you feel compelled to make some contribution to the author by way of saying "thanks", and you happen to know what a tea cozy is and where to get them, then you can send me one. Send those puppies to: Sean Russell60252 Rimfire Rd.Bend, OR 97702USA If you're outside of the US, make sure you write "gift" on it to avoid the taxes. If you don't want to send a tea cozy, you can also send money. Or don't send anything. Offer me a job I can't refuse, in Western Europe somewhere.

Features

Operation

Installation

Run ruby bin/install.rb. By the way, you really should look at these sorts of files before you run them as root. They could contain anything, and since (in Ruby, at least) they tend to be mercifully short, it doesn't hurt to glance over them. If you want to uninstall REXML, run ruby bin/install.rb -u.

Unit tests

If you have runit or Test::Unit installed (with the runit API), you can run the unit test cases. You can run both installed and not installed tests; to run the tests before installing REXML, run ruby -I. bin/suite.rb. To run them with an installed REXML, use ruby bin/suite.rb.

Benchmarks

There is a benchmark suite in benchmarks/. To run the benchmarks, change into that directory and run ruby comparison.rb. If you have nothing else installed, only the benchmarks for REXML will be run. However, if you have any of the following installed, benchmarks for those tools will also be run:

The results will be written to index.html.

General Usage

Please see the Tutorial

The API documentation can be downloaded from here, 82Kb, (or in ZIP format, 413Kb, if you're a masochist). A better solution for the documentation is to download and install Dave Thomas' rdoc and generate the API docs yourself; then you'll be sure to have the latest API docs and won't have to keep downloading the doc archive. The unit tests in test/ and the benchmarking code in benchmark provide additional examples of using REXML. The Tutorial provides examples with commentary.

Status

Change Log

2.1.0: IO optimizations, and support for ISO-8859-1 output. Fixed up pretty-printing a little. Now, if pretty-printing is turned on, text nodes are stripped before printing. This, obviously, can mess up what you'd expect from :respect_whitespace, but pretty printing, by definition, must change your formatting. Updated the tutorial a bit. Please see the section on adding text for a warning, if you're using a non-UTF-8 compatable encoding. Changed behavior of Element.attributes.each. It now itterates over key, value pairs, rather than attributes. This was a feature request. Expanded the unit tests and subsequently fixed a number of obscure bugs. I'm distributing the API documentation seperately from the main distribution now, because the API docs constitute nearly 50% of the total distribution size. FIxed a bug in namespace handling in attributes. Completely updated the API documentation for Element, Element.Elements, and Element.Attributes; the rest of the classes to follow. I'm seriously contemplating removing the examples from the API documentation, because most of them are practically duplicates of the unit tests in test/.

2.0.4: 2.0 munged the encoding value in output. This is fixed. I left debugging turned on in XPath in 2.0.2 :-/

2.0.2: Added grouping '(...)' and preceding:: and following:: axis. This means that, aside from functional bugs, XPath should have no missing functionality bugs. Keep in mind that not all Functions are tested, though.

2.0.1: Added some unit tests, and fixed a namespace XPath bug WRT attribute default NS's. Unicode support was screwing up the upper end of ASCII support; chars between 0xF0 and 0xFD were getting munged. This has been fixed, at the cost of a small amount of speed. Optimized the descendant axes of XPath; it should be significantly faster for '//' and other descendant operations. Added several user contributed unit tests. Re-added QuickPath, the old, non-fully-XPath compliant, yet much faster, XPath processor. Everything is being converted to UTF8 now, and the XML declaration reflects this. See the bugs for more information.

2.0: True XPath support. Finally. XPath is fully implemented now, and passes all of the tests I can throw at it, including complex XPaths such as '*[* and not(*/node()) and not(*[not(@style)]) and not(*/@style != */@style)]'. It may be slower than it was, but it should be reasonably efficient for what it is doing. The XPath spec doesn't help, and thwarts most attempts at optimization. Please see the notes on XPath for more information. Oh, and some minor bugs were fixed in the XML parser.

1.2.8: Fixed a bug pointed out by Peter Verhage where the element names weren't being properly parsed if a namespace was involved.

1.2.7: Fixing problems with the 1.2.6 distribution :-/. Added an "applications using REXML" section in this document -- send me those links! Added rdoc documentation. I'm not using API2XML anymore. I think API2XML was the right model, generating XML rather than HTML (which is what rdoc does), but rdoc does a much better job at parsing Ruby source, and I really didn't want to go there in the first place. Also, I had forgotten to generate the Tutorial HTML.

1.2.6: Documentation fix (TR). Fixed a bug in Element.add (and, therefore, Element.add_element). Added Robert Feldt's terse xml constructor to contrib/ (check it out; it's handy). Tobias discovered a terrible bug, whereby ENTITY wasn't printing out a final '>'. After a long discussion with a couple of users, and some review of the XML spec, I decided to reverse the default handling of whitespace and pretty printing. REXML now no longer defaults to pretty printing, and preserves whitespace unless otherwise directed. Added provisional namespace support to XPath. XPath is going to require another rewrite.

1.2.5: Bug fixes: doctypes that had spaces between the closing ] and > generated errors. There was a small bug that caused too many newlines to be generated in some output. Eelis van der Weegen (what a great name!) pointed out one of the numerous API errors. Julian requested that add_attributes take both Hash (original) and array of arrays (as produced by StreamListener). I killed the mailing list, accidentally, and fixed it again. Fixed a bug in next_sibling, caused by a combination of mixing overriding <=>() and using Array.index().

1.2.4: Changes since 1.1b: 100% OASIS valid tests passed. UTF-8/16 support. Many bug fixes. to_a() added to Parent and Element.elements. Updated tutorial. Added variable IOSource buffer size, for stream parsing. delete() now fails silently rather than throwing an exception if it can't find the elemnt to delete. Added a patch to support REXMLBuilder. Reorganized file layout in distribution; added a repackaging program; added the logo.

1.1b: Changes since 1.1a: Stream parsing added. Bug fixes in entity parsing. New XPath implementation, fixing many bugs and making feature complete. Completed whitespace handling, adding much functionality and fixing several bugs. Added convenience methods for inserting elememnts. Improved error reporting. Fixed attribute content to correctly handle quotes and apostrophes. Added mechanisms for handling raw text. Cleaned up utility programs (profile.rb, comparison.rb, etc.). Improved speed a little. Brought REXML up to 98.9% OASIS valid source compliance.

Speed and Completeness

Unfortunately, NQXML is the only package REXML can be compared against; XMLParser uses expat, which is a native library, and really is a different beast altogether. So in comparing NQXML and REXML you can look at four things: speed, size, completeness, and API.

Benchmarks

REXML is faster than NQXML in some things, and slower than NQXML in a couple of things. You can see this for yourself by running the supplied benchmarks. Most of the places where REXML are slower are because of the convenience methods3. On the positive side, most of the convenience methods can be bypassed if you know what you are doing. Check the benchmark comparison page for a general comparison. You can look at the benchmark code yourself to decide how much salt to take with them.

The sizes of the XML parsers are close4. NQXML 1.1.3 has 1580 non-blank, non-comment lines of code; REXML 2.0 has 23405.

REXML is a mostly conformant XML 1.0 parser. It supports multiple language encodings, and internal processing uses the required UTF-8 and UTF-16 encodings. It passes 100% of the Oasis non-validating tests. Furthermore, it provides a full implementation of XPath.

The last thing is the API, and this is where I think REXML wins. The core API is clean and intuitive, and things work the way you would expect them to. Convenience methods abound, and you can code for either convenience or speed. REXML code is terse, and readable, like Ruby code should be. The best way to decide which you like more is to write a couple of small applications in each, then use the one you're more comfortable with.

XPath

As of release 2.0, XPath 1.0 is fully implemented.

I fully expect bugs to crop up from time to time, so if you see any bogus XPath results, please let me know. That said, since I'm now following the XPath grammar and spec fairly closely, I suspect that you won't be surprised by REXML's XPath very often, and it should become rock solid fairly quickly.

Check the "bugs" section for known problems; there are little bits of XPath here and there that are not yet implemented, but I'll get to them soon.

Namespace support is rather odd, but it isn't my fault. I can only do so much and still conform to the specs. In particular, XPath attempts to help as much as possible. Therefore, in the trivial cases, you can pass namespace prefixes to Element.elements[...] and so on -- in these cases, XPath will use the namespace environment of the base element you're starting your XPath search from. However, if you want to do something more complex, like pass in your own namespace environment, you have to use the XPath first(), each(), and match() methods. Also, default namespaces force you to use the XPath methods, rather than the convenience methods, because there is no way for XPath to know what the mappings for the default namespaces should be. This is exactly why I loath namespaces -- a pox on the person(s) who thought them up!

Namespaces

Namespace support is now fairly stable. One thing to be aware of is that REXML is not (yet) a validating parser. This means that some invalid namespace declarations are not caught.

Mailing list

There is a low-volume mailing list dedicated to REXML. To subscribe, send an empty email to ser-rexml-subscribe@germane-software.com. This list is more or less spam proof. To unsubscribe, similarly send a message to ser-rexml-unsubscribe@germane-software.com.

Applications that use REXML

FAQ

Why is Element.elements indexed off of '1' instead of '0'?
Because of XPath. The XPath specification states that the index of the first child node is '1'. Although it may be counter-intuitive to base elements on 1, it is more undesireable to have element.elements[0] == element.elements[ 'node()[1]' ]. Since I can't change the XPath specification, the result is that Element.elements[1] is the first child element.
Why isn't REXML a validating parser?
Because validating parsers must include code that parses and interprets DTDs. I hate DTDs. REXML supports the barest minimum of DTD parsing, and even that isn't complete. There is DTD parsing code in the works, but I only work on it when I'm really, really bored. Rumor has it that a contributor is working on a DTD parser for REXML; rest assured that any such contribution will be included with REXML as soon as it is available.
I'm trying to create an ISO-8859-1 document, but when I add text to the document it isn't being properly encoded.
Regardless of what the encoding of your document is, when you add text programmatically to a REXML document you must ensure that you are only adding UTF-8 to the tree. In particular, you can't add ISO-8859-1 encoded text that contains characters above 0x80 to REXML trees -- you must convert it to UTF-8 before doing so. Luckily, this is easy: text.unpack('C*').pack('U*') will do the trick. 7-bit ASCII is identical to UTF-8, so you probably won't need to worry about this.

Known Bugs

Please send me bug reports. If you really want your bug fixed fast, include an runit or Test::Unit method (or methods) that illustrates the problem. You don't have to send me an entire suite; all bug submissions go into test/contrib_test.rb. If you don't send me a unit test, I'll have to write one myself, which will mean that your bug will take longer to fix.

When submitting bug reports, please include the version of Ruby and of REXML that you're using, and the operating system you're running on. Just run: ruby -vrrexml/rexml -e 'p REXML::Version,PLATFORM' and paste the results in your bug report.

To Do

Requested features

Credits

I've had help from a number of resources; if I haven't listed you here, it means that I just haven't gotten around to adding you, or that I'm a dork and have forgotten. In either case, feel free to write me and complain. I may ignore you, but at least you tried. (Actually, I don't conciously ignore anybody except spammers.)

1) This is no longer true. I'm a convert to the Ruby naming scheme, for Ruby. The reason being that Ruby does a superb job of hiding the difference between attributes and methods; in fact, for all intents and purposes, you can't access attributes directly; all attribute accessors are methods. What this means in the long run is that there is no reason to have different naming conventions for attributes and methods.
2) Be aware, however, that REXML is neither DOM nor SAX compliant, and will never be. The DOM and SAX APIs are unwieldy.
3) For example, element.elements[index] isn't really an array operation; index can be an Integer or an XPath, and this feature is relatively time expensive.
4) As measured with ruby -nle 'print unless /^\s*(#.*|)$/' *.rb | wc -l
5) REXML started out with about 1200, but that number has been steadily increasing as features are added. XPath accounts for 541 lines of that code, so the core REXML has about 1800 LOC.