Library code snippets
Converting HTML to Text
Whether you want to convert an HTML page into pure text so you can parse out that special piece of information, or you simply want to load a page from the Net into your own word processing package, this mini function could come in handy.
It’s called StripTags and accepts an HTML string. Using a regular expression, it identifies all <tags>, removes them, and returns the modified string. Here’s the code:
Public Function StripTags(ByVal HTML As String) As String
' Removes tags from passed HTML
Dim objRegEx As _
System.Text.RegularExpressions.Regex
Return objRegEx.Replace(HTML, "<[^>]*>", "")
End Function
Here’s a simple example demonstrating how you could use this function in code (see Figure 7-2 for my sample application):
strData = StripTags("<body><b>Welcome!</b></body>")
I admit, it doesn’t look like much, but this little snippet can be a true lifesaver, especially if you’ve ever tried doing it yourself using Instr and Mid statements. Have fun!
Related articles
Related discussion
-
MSSQL Query in VB.Net Fails
by 7upsk (1 replies)
-
bar graphs in visual basic.net
by bhabybash (1 replies)
-
How to write the category attribut in a class dynamically
by converter2009 (1 replies)
-
VB.NET: Hide and show table using radio buttons
by converter2009 (1 replies)
-
VB.Net Button Problem
by pysdex (0 replies)
Related podcasts
-
xpert to Expert: Inside Concurrent Basic (CB)
"Concurrent Basic extends Visual Basic with stylish asynchronous concurrency constructs derived from the join calculus. Our design advances earlier MSRC work on Polyphonic C#, Comega and the Joins Library. Unlike its C# based predecessors, CB adopts a simple event-like syntax familiar to VB progr...
Almino,
I have been given a task to convert an HTML string with tags to text. It appears as if this function will work, but it won't handle all the cases such as <br> and html coded spacing ( )
Do you how to modify the existing code to handle this? Your help is appreciated. I am not good at all with regular expressions. Thanks.
- Raju
private string StripTags(string HTML)
{
// Removes tags from passed HTML
System.Text.RegularExpressions.Regex objRegEx = new System.Text.RegularExpressions.Regex("<[^>]*>");
return objRegEx.Replace(HTML, "");
}
Thanks,
Almino
Finally, what regexp would you use to get rid of not only the html tags, but text between particular tags? Specifically i am thinking of <title>Blah blah blah</title>.
This thread is for discussions of Converting HTML to Text.