12 Dec, 2006

Published at 03:38AM

Tagged with site, views, and programming

This post has 0 comments

Filter out html blocks using regex

A recent comment helped me realize that I completely forgot to style the pre block. So I did. Then I realized by doing so I raised a potential problem. Well, not exactly from the styling, just the fact of using the pre block in a post, where I didn’t realize this before. As you may know, I’m limiting the post text on the Home and Bits page(s) to a certain word limit. I was thinking, what If I posted a code snippet or blockquote, and the word limit just so happened to fall in between the block, preventing the tags from being properly closed? Not only would the xhtml not validate, but I’m not sure that would play well on the CSS. I’m already accounting for the missing </p> tag, but that’s easy because it could only be a </p> that’s missing, right? Wrong. It could just as easily be a pre tag or a blockquote tag.

So I thought I would dabble in some regular expression goodness and try to come up with something I could use to pull out those special blocks only on the Home and Bits page(s), where I’m not pulling the full post. Here’s what I came up with:

# application_helper.rb
# (Note: replace the "span" with "pre" - I was having issues displaying it properly)
def strip_html_blocks(text)
  text.gsub(/<span>[^<]*<\/span>|<blockquote>.*?<\/blockquote>/im,'')
end

# view
<div class="post-body">
  <%= strip_html_blocks(to_html(p.body)) %>
</div>

So there you have it. The to_html method just converts the textile text to html before passing it to the strip_html_blocks method. Afterall, I would need those tags to show up. I’m far from knowing regex well, but this seems to work in all of my situations. And it is forgiving if you have multiple blocks with text in between, meaning it won’t suck out everything between the first open tag and last closing tag. For instance, say you had:

<p>Here are some words.</p>
<blockquote>
  <p>Here is a quote</p>
</blockquote>
<p>Here are some words in between.</p>
<blockquote>
  <p>Here is another quote</p>
</blockquote>
<p>And more text.</p>

Passing this text into the strip_html_blocks method would return (forgiving):

<p>Here are some words.</p>
<p>Here are some words in between.</p>
<p>And more text.</p>

Instead of (non-forgiving):

<p>Here are some words.</p>
<p>And more text.</p>

But by all means, if you see a fault somewhere please let me know. Like I said, I’m hardly a regex programmer. They’re always so easy once it works. Oh, and the .*? would not work inside of the pre tags. It was not forgiving at all, so that’s a workaround. Anyway, this took me a lot longer than it probably should have, so I thought I would post it in case someone else has a similar situation and want’s to remove a html block using regex.

Comments

Do you have something to say about this post?
Retype the image to the right Spam Hint: Are You Human? Textile Formatting Tips

or

Ryan Heath | Site Management A Ruby on Rails production.

This site is a Formed Function. Formed Function LLC | @formedfunction | Get in Touch