Newlines and Sed

This was actually written by floyd_n_milan, but he had some problem posting it in blogger and hence I do him this favor of posting it in my blog 🙂

This is a not so nice little problem.

Say, you need to rid an HTML file of all the tags. Seems pretty simple if the
entire tag is contained on one single line, like this:

<html>

Could be even two tags on the same line, like this:

<html><head>

That’s not a problem since both tags start and end on the same line. Here’s
what you can do in sed to get rid of them:

sed -e 's/<[^>]*>//g'

It reads, substitute a pattern that starts with a <, followed by zero or more
characters that are not > and ends with a > with a blank, for all instances
on the line.

Why’s the [^>]* necessary then? Why not just <.*>?

Consider this:

<title>Hello World</title>

What would happen with <.*> here? sed will see the < from <title> and the >
from </title>, because by default it matches the longest possible match.
This’ll remove the entire line, which is not what we want. We just want to
remove the <title> and </title>, keeping the Hello World intact.

Hence, we use [^>]*. This means, match zero or more characters that are not >.
So, in essence, we’re matching the shortest pair of <>. Inside this pair,
after the initial <, there can be no >, unless its the end of the pattern.
This’ll match both <title> and </title> separately and keep Hello World
intact.

The problem still remains though. The above sed command will only work if the
entire tag is on one single line, because sed can read the file only line by
line. So, if something like this comes up:

<p><font size="5"
face="blah">Blah blah blah</font></p>

sed will remove the <p> fine. Then it’ll find the <font but won’t find the
corresponding > on the same line. On the next line, it’ll remove the </font>
and the </p> at the end, but won’t remove the face=”blah”> at the start
because it can’t find the initial <.

This problem can be solved using the multiline pattern space in sed. This
script will work:

sed -e '/</{
N
s/<[^>]*>//g
}'

First /</ takes sed to a line that has a <. The commands inside the {} will
then operate on this line.

The N command causes sed to read in the next line, keeping the initial and the
newly read in line, both in the pattern space. So the content that sed
operates upon, now looks like this:

<p><font size="5"nface="blah">Blah blah blah</font></p>

with \n being just another character in the line. The contents then match the
<[^>]*> used by the substitute command properly.

Advertisements