XML and compression: gz bzip2 and python example transform

| | Comments (1) | TrackBacks (0)

XML Matters: XML and compression
Nice article discussing transforms of XML and the inherent advantages to structuring the data first.
I was curious what was being done with gz and xml.

When you think about compressing documents, you normally think first of general compression algorithms like Lempel-Ziv and Huffman, and of the common utilities that implement variations on them. Specifically, on Unix-like platforms, what first comes to mind is usually the utility gzip; on other platforms, zip is more common (using utilities such as PKZIP, Info-ZIP, and WinZip). gzip turns out to be quite consistently better than zip, but only by small margins. These utilities indeed tend substantially to reduce the size of XML files. However, it also turns out that you can obtain considerably better compression rates by two means, either individually or in combination.

[...]
Conclusion
Though the utility presented here is a preliminary attempt, even in this early form it does surprisingly well -- at least in some cases -- of squeezing those last bytes out of compressed XML files. With a little refinement and experimentation, I expect that a few percent more reduction could be obtained. Part of what makes writing this utility hard is that bzip2 does such a good job to start with. I was honestly surprised by just how effective the Burrows-Wheeler algorithm was when I started empirical testing.

Some commercial utilities attempt to perform XML compression in a manner that utilizes knowledge of the specific DTDs of compressed documents. It is quite likely that these techniques obtain additional compression. However, xml2struct.py and XMill have the nice advantage of being simple command-line tools that you can transparently apply to XML files. Custom programming of every compression is not always desirable or possible. But where it is, squeezing out even more bytes might be an attainable goal.

0 TrackBacks

Listed below are links to blogs that reference this entry: XML and compression: gz bzip2 and python example transform.

TrackBack URL for this entry: http://kennethhunt.com/mt/mt-tb.cgi/1222

About this Entry

This page contains a single entry by klsh published on January 14, 2005 1:16 PM.

iPod Photo and Canon 20D mini review was the previous entry in this blog.

installing debian from scratch is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.