Using the 'IsMatch' Method in Regular Expressions to Screen Scrape a Webpage
By Steve Schofield
April 26, 2006
This article shows a code-tip I discovered when
developing a webservice to 'screen scrape' a webpage.
The web service reads a remote URL and then determines if a certain phrase is present
in the HTML. Regular expressions are best suited for achieving the task but
it takes some experience working with the syntax.
In .NET 2.0 the
System.Text.RegularExpressions namespace has
a handy function called 'IsMatch'
which makes this task much easier.
The code snippet below accepts two arguments (URL to read, and Text to
search for). It then makes an HTTP request and reads the webpage into a stream. The Stream object makes it easy to parse the remote HTML and determine if the text is present or not.
The one thing I discovered when using the 'IsMatch' method is the
text is case and space sensitive. For example, if you are searching 'http://www.iislogs.com'
for text in the title of the page, searching for the phrase ( IIS Logs - )
is exactly what is searched for.
I hope this example helps in your Regular Expressions adventures.
Module Module1
Sub Main()
Dim strRetValue As String =
""
strRetValue = readWebPage("http://www.iislogs.com",
"IIS Logs - ")
Console.WriteLine(strRetValue)
End Sub
Private Function
readWebPage( _
ByVal
strSource As String,
_
ByVal
strArgument As String)
As String
Dim strLine
As String
Dim objSR
As
System.IO.StreamReader = Nothing
Dim objResponse As Net.WebResponse = Nothing
Dim objRequest As Net.WebRequest = _
System.Net.HttpWebRequest.Create(strSource)
Try
objResponse = objRequest.GetResponse
objSR = New
System.IO.StreamReader( _
objResponse.GetResponseStream(),
_
System.Text.Encoding.ASCII)
Do
While objSR.EndOfStream = False
strLine = objSR.ReadLine()
If
System.Text.RegularExpressions.Regex.IsMatch( _
strLine, strArgument)
Then
Return "Listed"
Exit Function
End
If
Loop
objSR.Close()
objResponse.Close()
Return
"Not Listed"
Catch f
As
Exception
Return
"Errored:" & f.Message.ToString()
End
Try
End Function
End Module
Steve Schofield is a Senior Internet Support
Specialist with
ORCS Web, Inc.
- a company that provides managed hosting solutions for clients who develop and deploy their applications on Microsoft Windows platforms. Services include shared hosting, dedicated hosting, and webfarm hosting, with specialty in .Net, SQL Server, and architecting highly scalable solutions.