Library code snippets

Converting HTML to Text

Whether you want to convert an HTML page into pure text so you can parse out that special piece of information, or you simply want to load a page from the Net into your own word processing package, this mini function could come in handy.

It’s called StripTags and accepts an HTML string. Using a regular expression, it identifies all <tags>, removes them, and returns the modified string. Here’s the code:

Public Function StripTags(ByVal HTML As String) As String
    ' Removes tags from passed HTML
    Dim objRegEx As _
        System.Text.RegularExpressions.Regex
    Return objRegEx.Replace(HTML, "<[^>]*>", "")
End Function

Here’s a simple example demonstrating how you could use this function in code (see Figure 7-2 for my sample application):

strData = StripTags("<body><b>Welcome!</b></body>")

I admit, it doesn’t look like much, but this little snippet can be a true lifesaver, especially if you’ve ever tried doing it yourself using Instr and Mid statements. Have fun!

Comments

  1. 01 Oct 2007 at 16:16

    Almino,

     I have been given a task to convert an HTML string with tags to text. It appears as if this function will work, but it won't handle all the cases such as <br> and html coded spacing (&nbsp)

     Do you how to modify the existing code to handle this? Your help is appreciated. I am not good at all with regular expressions.  Thanks.

    - Raju

  2. 16 Aug 2007 at 15:51
    I agree it is really a life saver. if anyone wants a c# version, here it is.

    private string StripTags(string HTML)
            {
                // Removes tags from passed HTML           
                System.Text.RegularExpressions.Regex objRegEx = new System.Text.RegularExpressions.Regex("<[^>]*>");

                return objRegEx.Replace(HTML, "");
            }

    Thanks,

    Almino













  3. 22 Jan 2007 at 22:23
    This code snippet ROCKS. EXACTLY what I needed. THANKS!Big Smile [:D]

  4. 30 Sep 2003 at 11:10
    This is very helpful.  Additionally, most people will want to do more formatting to replace all linebreaks with a null string, and then to replace all html line breaks (<br>, etc.) with linebreaks so that the end result looks more like the original document.

    Finally, what regexp would you use to get rid of not only the html tags, but text between particular tags?  Specifically i am thinking of <title>Blah blah blah</title>.
  5. 01 Jan 1999 at 00:00

    This thread is for discussions of Converting HTML to Text.

Leave a comment

Sign in or Join us (it's free).

Karl Moore
AddThis

Related discussion

Related podcasts

  • xpert to Expert: Inside Concurrent Basic (CB)

    "Concurrent Basic extends Visual Basic with stylish asynchronous concurrency constructs derived from the join calculus. Our design advances earlier MSRC work on Polyphonic C#, Comega and the Joins Library. Unlike its C# based predecessors, CB adopts a simple event-like syntax familiar to VB progr...

Want to stay in touch with what's going on? Follow us on twitter!