libxml-ruby vs nokogiri vs hpricot

Update: Aaron told me that he is going to be re-running the benchmarks this weekend so we’ll get a more complete set of data from the machine that originally ran the tests.

If you’re into parsing XML or HTML with ruby then chances are you’re familiar with the various gems out there for getting the job done.  Lately, there have been a lot of things flying around about which is the fastest and to settle it, Aaron Patterson (author of Nokogiri and Mechanize) wrote a test suite.

After it’s release, RubyInside posted about how the tests showed how fast Nokogiri was compared to Hpricot in this article here: Ruby XML Performance Shootout: Nokogiri vs LibXML vs Hpricot vs REXML.  Later in the day, I saw Why’s posting about the release of Hpricot here: hpricot 0.7 and decided to modify Aaron’s tests to use Hpricot.XML and here are the results:

Tests were run at N=5 to get a clearer picture of the differences between the various gems.  At N=2, tests were pretty close, which indicated that a larger sample was needed.

test_IO_parsing(XmlTruth::DOM::XML::LargeDocumentParsingTest) N=5
user     system      total        real   kBps
null          0.690000   0.070000   0.760000 (  0.768641) 46343.68
nokogiri      2.790000   0.130000   2.920000 (  3.015303) 11813.62
libxml-ruby   2.970000   0.140000   3.110000 (  3.130175) 11380.08
hpricot      13.660000   0.370000  14.030000 ( 14.088780) 2528.37
.
test_in_memory_parsing(XmlTruth::DOM::XML::LargeDocumentParsingTest) N=5
user     system      total        real   kBps
null          1.240000   0.010000   1.250000 (  1.260841) 28252.30
nokogiri      4.360000   0.060000   4.420000 (  4.444468) 8014.83
libxml-ruby   4.570000   0.050000   4.620000 (  4.641338) 7674.87
hpricot      13.750000   0.210000  13.960000 ( 14.045647) 2536.13
.
test_simple_xpath(XmlTruth::DOM::XML::LargeDocumentXPathSearchTest) N=5
user     system      total        real   kBps
nokogiri     44.430000   0.300000  44.730000 ( 44.972003) 792.09
libxml-ruby  40.950000   0.210000  41.160000 ( 41.300780) 862.49
hpricot      18.410000   0.090000  18.500000 ( 18.540239) 1921.32
.
test_IO_parsing(XmlTruth::DOM::XML::SmallDocumentParsingTest) N=1944
user     system      total        real   kBps
null          8.150000   0.130000   8.280000 (  8.326070) 4278.17
nokogiri     17.850000   0.100000  17.950000 ( 17.950534) 1984.36
libxml-ruby  19.010000   0.260000  19.270000 ( 19.370769) 1838.87
hpricot      25.320000   0.460000  25.780000 ( 25.827516) 1379.16
.
test_in_memory_parsing(XmlTruth::DOM::XML::SmallDocumentParsingTest) N=1944
user     system      total        real   kBps
null          3.960000   0.030000   3.990000 (  4.005522) 8892.82
nokogiri     18.140000   0.200000  18.340000 ( 18.403396) 1935.53
libxml-ruby  19.760000   0.230000  19.990000 ( 19.999905) 1781.03
hpricot      15.980000   0.150000  16.130000 ( 16.133157) 2207.90
.
Finished in 426.233021 seconds.

5 tests, 0 assertions, 0 failures, 0 errors

You can find my fork of the test suite on github here: Patrick Tulskie’s Fork of XMLTruth

From this small sample of tests, it appears as though Nokogiri and libxml-ruby are similar in performance for most items. This makes sense though since Nokogiri utilizes the native libxml of the current operating environment. Nokogiri clearly excels at parsing larger documents while Hpricot appears to handle smaller, in-memory documents rather quickly.

In real-world scenarios, one might expect Nokogiri to be the ideal solution to parsing large XML or HTML documents from the disk into a database, whereas Hpricot might be a more ideal gem for use in a web crawler where it is rare that a page’s DOM is more than a 1MB.

Please post any other thoughts you might have in the comments.

Posted in Comedy, New Stuff, Software Development, ruby at March 18th, 2009. Trackback URI: trackback
Tags: , , , , , , ,

2 Responses to “libxml-ruby vs nokogiri vs hpricot”

  1. June 11th, 2009 at 1:02 pm #JamesD

    Thanks for the useful info. It’s so interesting

  2. December 9th, 2009 at 7:09 pm #Patrick Tulskie

    No problem. Glad you found it useful.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>