Professional Windows hosting from ORCS Web
    
An Unrivaled Windows Hosting Experience    
1-888-313-9421  |  webteam@orcsweb.com        
From Our Clients:
"I would rate ORCS Web highly on the high percent of uptime and the speed of access to my site. That's my main reason for being with ORCS Web."

Alan Ower
Pine Summit

Join our community of clients at: 1-888-313-9421

Using the 'IsMatch' Method in Regular Expressions to Screen Scrape a Webpage
By Steve Schofield
April 26, 2006

This article shows a code-tip I discovered when developing a webservice to 'screen scrape' a webpage.  The web service  reads a remote URL and then determines if a certain phrase is present in the HTML. Regular expressions are best suited for achieving the task but it takes some experience working with the syntax. In .NET 2.0 the System.Text.RegularExpressions namespace has a handy function called 'IsMatch' which makes this task much easier.

The code snippet below accepts two arguments (URL to read, and Text to search for). It then makes an HTTP request and reads the webpage into a stream. The Stream object makes it easy to parse the remote HTML and determine if the text is present or not.

The one thing I discovered when using the 'IsMatch' method is the text is case and space sensitive. For example, if you are searching 'http://www.iislogs.com' for text in the title of the page, searching for the phrase ( IIS Logs - ) is exactly what is searched for.

I hope this example helps in your Regular Expressions adventures.

Module Module1

 

    Sub Main()

        Dim strRetValue As String = ""

        strRetValue = readWebPage("http://www.iislogs.com", "IIS Logs - ")

        Console.WriteLine(strRetValue)

    End Sub

 

    Private Function readWebPage( _

            ByVal strSource As String, _

            ByVal strArgument As String) As String

        Dim strLine As String

        Dim objSR As System.IO.StreamReader = Nothing

        Dim objResponse As Net.WebResponse = Nothing

        Dim objRequest As Net.WebRequest = _

            System.Net.HttpWebRequest.Create(strSource)

 

        Try

            objResponse = objRequest.GetResponse

            objSR = New System.IO.StreamReader( _

                objResponse.GetResponseStream(), _

                System.Text.Encoding.ASCII)

 

            Do While objSR.EndOfStream = False

                strLine = objSR.ReadLine()

                If System.Text.RegularExpressions.Regex.IsMatch( _

                        strLine, strArgument) Then

                    Return "Listed"

                    Exit Function

                End If

            Loop

 

            objSR.Close()

            objResponse.Close()

            Return "Not Listed"

 

        Catch f As Exception

            Return "Errored:" & f.Message.ToString()

        End Try

    End Function

End Module
Steve Schofield is a Senior Internet Support Specialist with ORCS Web, Inc. - a company that provides managed hosting solutions for clients who develop and deploy their applications on Microsoft Windows platforms. Services include shared hosting, dedicated hosting, and webfarm hosting, with specialty in .Net, SQL Server, and architecting highly scalable solutions.

Copyright © 1996-2010 ORCS Web, Inc. All rights reserved.