Newlines and Sed

This was actually written by floyd_n_milan, but he had some problem posting it in blogger and hence I do him this favor of posting it in my blog 🙂

This is a not so nice little problem.

Say, you need to rid an HTML file of all the tags. Seems pretty simple if the
entire tag is contained on one single line, like this:

<html>

Could be even two tags on the same line, like this:

<html><head>

That’s not a problem since both tags start and end on the same line. Here’s
what you can do in sed to get rid of them:

sed -e 's/<[^>]*>//g'

It reads, substitute a pattern that starts with a <, followed by zero or more
characters that are not > and ends with a > with a blank, for all instances
on the line.

Why’s the [^>]* necessary then? Why not just <.*>?

Consider this:

<title>Hello World</title>

What would happen with <.*> here? sed will see the < from <title> and the >
from </title>, because by default it matches the longest possible match.
This’ll remove the entire line, which is not what we want. We just want to
remove the <title> and </title>, keeping the Hello World intact.

Hence, we use [^>]*. This means, match zero or more characters that are not >.
So, in essence, we’re matching the shortest pair of <>. Inside this pair,
after the initial <, there can be no >, unless its the end of the pattern.
This’ll match both <title> and </title> separately and keep Hello World
intact.

The problem still remains though. The above sed command will only work if the
entire tag is on one single line, because sed can read the file only line by
line. So, if something like this comes up:

<p><font size="5"
face="blah">Blah blah blah</font></p>

sed will remove the <p> fine. Then it’ll find the <font but won’t find the
corresponding > on the same line. On the next line, it’ll remove the </font>
and the </p> at the end, but won’t remove the face=”blah”> at the start
because it can’t find the initial <.

This problem can be solved using the multiline pattern space in sed. This
script will work:

sed -e '/</{
N
s/<[^>]*>//g
}'

First /</ takes sed to a line that has a <. The commands inside the {} will
then operate on this line.

The N command causes sed to read in the next line, keeping the initial and the
newly read in line, both in the pattern space. So the content that sed
operates upon, now looks like this:

<p><font size="5"nface="blah">Blah blah blah</font></p>

with \n being just another character in the line. The contents then match the
<[^>]*> used by the substitute command properly.

Advertisements

One thought on “Newlines and Sed

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s