Skip to content

Simple HTML Parsing With Rails

July 31, 2011

It is often the case where I wish to do some simple HTML parsing for purposes of finding tags or validations and I am left frustrated with the lack of documentation on how to use what is built into Rails.  Sure there are great gems like Nokogiri, but why add a big fat gem to your list of dependencies when it’s all built right in there already.

That is perhaps the most frustrating part about it.  I know from testing that Rails definitely has some HTML parsing punch, but where’s the documentation?

Well, I finally did some digging and here’s what I found.

The first trick is converting your HTML document, presumably a String, into an HTML object that Rails can manipulate:

my_html = "<div class='content doublespace' id='main-content'>this is some html</div>"
tokens = HTML::Tokenizer.new(my_html)

Excellent. Now we can loop through all of the nodes and check if any of them have the class “content”.  But first we need to convert the nodes from a string to an HTML object.  Tokenizer will loop through HTML tags and text blocks and hand them back to us as Strings so we can parse them:

tags = []
while token = tokens.next
  node = HTML::Node.parse(nil, 0, 0, token, false)
  tags << node if node.tag? and node.closing != :close
end

Now we can easily check the attributes for each tag:

tags.first.name # => "div"
tags.first.attributes # => {"class"=>"content doublespace", "id"=>"main-content"}

So there it is.  From here we can use all the well documented sections of Ruby to do all the parsing we like.

Advertisements

From → Uncategorized

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: