Library code snippets
Converting HTML to Text
Whether you want to convert an HTML page into pure text so you can parse out that special piece of information, or you simply want to load a page from the Net into your own word processing package, this mini function could come in handy.
It’s called StripTags and accepts an HTML string. Using a regular expression, it identifies all <tags>, removes them, and returns the modified string. Here’s the code:
Public Function StripTags(ByVal HTML As String) As String
' Removes tags from passed HTML
Dim objRegEx As _
System.Text.RegularExpressions.Regex
Return objRegEx.Replace(HTML, "<[^>]*>", "")
End Function
Here’s a simple example demonstrating how you could use this function in code (see Figure 7-2 for my sample application):
strData = StripTags("<body><b>Welcome!</b></body>")
I admit, it doesn’t look like much, but this little snippet can be a true lifesaver, especially if you’ve ever tried doing it yourself using Instr and Mid statements. Have fun!
Related articles
Related discussion
-
How to write a query set to excel using vb.net
by BarbaMariolino (1 replies)
-
Very Urgent regarding deleting the images from a folder
by rameshbandi (2 replies)
-
Block Accessing MSSQL 2000
by militia (0 replies)
-
.NET Developer in Ghana Required....
by sysview (0 replies)
-
Sending SMS to mobile using secure gateway from VB.net 2008 c#
by pratikasthana17 (0 replies)
Related podcasts
-
xpert to Expert: Inside Concurrent Basic (CB)
"Concurrent Basic extends Visual Basic with stylish asynchronous concurrency constructs derived from the join calculus. Our design advances earlier MSRC work on Polyphonic C#, Comega and the Joins Library. Unlike its C# based predecessors, CB adopts a simple event-like syntax familiar to VB progr...
Almino,
I have been given a task to convert an HTML string with tags to text. It appears as if this function will work, but it won't handle all the cases such as <br> and html coded spacing ( )
Do you how to modify the existing code to handle this? Your help is appreciated. I am not good at all with regular expressions. Thanks.
- Raju
private string StripTags(string HTML)
{
// Removes tags from passed HTML
System.Text.RegularExpressions.Regex objRegEx = new System.Text.RegularExpressions.Regex("<[^>]*>");
return objRegEx.Replace(HTML, "");
}
Thanks,
Almino
Finally, what regexp would you use to get rid of not only the html tags, but text between particular tags? Specifically i am thinking of <title>Blah blah blah</title>.
This thread is for discussions of Converting HTML to Text.