libxml-ruby vs nokogiri vs hpricot
Update: Aaron told me that he is going to be re-running the benchmarks this weekend so we’ll get a more complete set of data from the machine that originally ran the tests.
If you’re into parsing XML or HTML with ruby then chances are you’re familiar with the various gems out there for getting the job done. Lately, there have been a lot of things flying around about which is the fastest and to settle it, Aaron Patterson (author of Nokogiri and Mechanize) wrote a test suite.
After it’s release, RubyInside posted about how the tests showed how fast Nokogiri was compared to Hpricot in this article here: Ruby XML Performance Shootout: Nokogiri vs LibXML vs Hpricot vs REXML. Later in the day, I saw Why’s posting about the release of Hpricot here: hpricot 0.7 and decided to modify Aaron’s tests to use Hpricot.XML and here are the results:
Tests were run at N=5 to get a clearer picture of the differences between the various gems. At N=2, tests were pretty close, which indicated that a larger sample was needed.
test_IO_parsing(XmlTruth::DOM::XML::LargeDocumentParsingTest) N=5
user system total real kBps
null 0.690000 0.070000 0.760000 ( 0.768641) 46343.68
nokogiri 2.790000 0.130000 2.920000 ( 3.015303) 11813.62
libxml-ruby 2.970000 0.140000 3.110000 ( 3.130175) 11380.08
hpricot 13.660000 0.370000 14.030000 ( 14.088780) 2528.37
.
test_in_memory_parsing(XmlTruth::DOM::XML::LargeDocumentParsingTest) N=5
user system total real kBps
null 1.240000 0.010000 1.250000 ( 1.260841) 28252.30
nokogiri 4.360000 0.060000 4.420000 ( 4.444468) 8014.83
libxml-ruby 4.570000 0.050000 4.620000 ( 4.641338) 7674.87
hpricot 13.750000 0.210000 13.960000 ( 14.045647) 2536.13
.
test_simple_xpath(XmlTruth::DOM::XML::LargeDocumentXPathSearchTest) N=5
user system total real kBps
nokogiri 44.430000 0.300000 44.730000 ( 44.972003) 792.09
libxml-ruby 40.950000 0.210000 41.160000 ( 41.300780) 862.49
hpricot 18.410000 0.090000 18.500000 ( 18.540239) 1921.32
.
test_IO_parsing(XmlTruth::DOM::XML::SmallDocumentParsingTest) N=1944
user system total real kBps
null 8.150000 0.130000 8.280000 ( 8.326070) 4278.17
nokogiri 17.850000 0.100000 17.950000 ( 17.950534) 1984.36
libxml-ruby 19.010000 0.260000 19.270000 ( 19.370769) 1838.87
hpricot 25.320000 0.460000 25.780000 ( 25.827516) 1379.16
.
test_in_memory_parsing(XmlTruth::DOM::XML::SmallDocumentParsingTest) N=1944
user system total real kBps
null 3.960000 0.030000 3.990000 ( 4.005522) 8892.82
nokogiri 18.140000 0.200000 18.340000 ( 18.403396) 1935.53
libxml-ruby 19.760000 0.230000 19.990000 ( 19.999905) 1781.03
hpricot 15.980000 0.150000 16.130000 ( 16.133157) 2207.90
.
Finished in 426.233021 seconds.
5 tests, 0 assertions, 0 failures, 0 errors
You can find my fork of the test suite on github here: Patrick Tulskie’s Fork of XMLTruth
From this small sample of tests, it appears as though Nokogiri and libxml-ruby are similar in performance for most items. This makes sense though since Nokogiri utilizes the native libxml of the current operating environment. Nokogiri clearly excels at parsing larger documents while Hpricot appears to handle smaller, in-memory documents rather quickly.
In real-world scenarios, one might expect Nokogiri to be the ideal solution to parsing large XML or HTML documents from the disk into a database, whereas Hpricot might be a more ideal gem for use in a web crawler where it is rare that a page’s DOM is more than a 1MB.
Please post any other thoughts you might have in the comments.
June 11th, 2009 at 1:02 pm #JamesD
Thanks for the useful info. It’s so interesting
December 9th, 2009 at 7:09 pm #Patrick Tulskie
No problem. Glad you found it useful.