Eng-Tips is the largest engineering community on the Internet

Intelligent Work Forums for Engineering Professionals

sed-Strip HTML tags (or XML tags)

Status
Not open for further replies.

kellnerp

Mechanical
Feb 11, 2005
1,141
0
0
US
So far I have this:
Code:
sed -r 's/(<[^>\n]*>)//g'
It does pretty good with tags all on one line, but things like <img src=blah blah blah that may extend over more than one line are not being caught.

Likewise things like <style type=text/css> ... </style> where I want to remove not just the tags, but the text between the tags are not being caught. Again, <style></style> tag pairs run over multiple lines in the general case.

Is there a way to accomplish this in sed? I can do it in awk already.

TOP
CSWP, BSSE
Phenom IIx6 1100T = 8GB = FX1400 = XP64SP2 = SW2009SP3
"Node news is good news."
 
Replies continue below

Recommended for you

kellnerp,

Quite a few years ago, I wrote a crude SGML parser in Perl. Is there any reason you are not using Perl to do this?

Perl's switch command searches a string for an arbitrary sequence, and it returns everything in the string up to the sequence, and a everything in the string after the sequence. If you are messing with text, this is an awesome tool.

Critter.gif
JHG
 
Status
Not open for further replies.
Back
Top