Using Nokogiri to parse XML
So recently I found myself needing to parse some XML. This is a quite ordinary task in most languages. In ruby this task is made very easy with the use of the gem ‘Nokogiri’. Roughly translated Nokogiri means ‘Saw’. I can only assume this is a reference to the fact Nokogiri will help you saw through XML/HTML. Nokogiri allows you to parse and or query a HTML or XML document using a simple syntax. Although there is Rexml in the standard library for parsing XML, I think the overhead of using Nokogiri is well worth it since it allows for cleaner and more readable code.
So lets jump right into it. Here is a simple use case for Nokogiri.
require 'net/http' # standard lib
require "uri" # standard lib
require 'nokogiri'
uri = URI.parse("http://www.ibiblio.org/xml/examples/shakespeare/macbeth.xml")
url_data = Net::HTTP.get(uri)
xml = Nokogiri.XML(url_data)
#check for errors
errors = xml.errors
puts errors
Above is a script to pull in an XML formatted version of Shakespears Macbeth using a simple HTTP call. We then use Nokogiri to read the document into XML. Nokogiri makes it very easy to check if there were any error during the read in of the XML. Simply call the errors method on the xml object.
Once you have the document read into memory, we can preform many different operations, Nokogiri includes a ton of methods for processing your XML and manipulating the data into your desired state. Most often you will be using the CSS method to grab your desired data and then process that data.
xml.css("SPEECH").each do |speech|
speaker_name = speech.css('SPEAKER').text
puts speaker_name
end
The CSS method, shown above, allows you to query for this information using CSS3 syntax. Once you have your desired selectors entered Nokogiri use them to search the document and will return and array of the found elements. You can alternatively use xpath() however, CSS queries are often the easiest and most succinct way to express what you’re looking for, so don’t be afraid to use them. That last bit is a direct quote from the Nokogiri XML documentation, and I whole heartedly agree with them. Finally the selectors will only return the full tag and its content, if you need to reference the tags content simply append the text method. This will convert the XML or HTML tag to a its text.
Example:
speaker_name = speech.css('SPEAKER')
will return
speaker_name = speech.css('SPEAKER').text
will return MACBETH
Also another point to note. The css Selector is relative to what its being fed. So in the above loop you will see that I am feeding it the speech variable for each iteration, and thus it is only grabbing the speaker for that single chunk of XML.
If you want to know more about Nokogiri you can visit.
and check out their more in depth tutorial on parsing HTML/XML