Continue to Site

Eng-Tips is the largest engineering community on the Internet

Intelligent Work Forums for Engineering Professionals

  • Congratulations pierreick on being selected by the Eng-Tips community for having the most helpful posts in the forums last week. Way to Go!

sed-Strip HTML tags (or XML tags)

Status
Not open for further replies.

kellnerp

Mechanical
Feb 11, 2005
1,141
So far I have this:
Code:
sed -r 's/(<[^>\n]*>)//g'
It does pretty good with tags all on one line, but things like <img src=blah blah blah that may extend over more than one line are not being caught.

Likewise things like <style type=text/css> ... </style> where I want to remove not just the tags, but the text between the tags are not being caught. Again, <style></style> tag pairs run over multiple lines in the general case.

Is there a way to accomplish this in sed? I can do it in awk already.

TOP
CSWP, BSSE
Phenom IIx6 1100T = 8GB = FX1400 = XP64SP2 = SW2009SP3
"Node news is good news."
 
Replies continue below

Recommended for you

kellnerp,

Quite a few years ago, I wrote a crude SGML parser in Perl. Is there any reason you are not using Perl to do this?

Perl's switch command searches a string for an arbitrary sequence, and it returns everything in the string up to the sequence, and a everything in the string after the sequence. If you are messing with text, this is an awesome tool.

Critter.gif
JHG
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor