Convert a Blog Website into a SQL Data Dump

The challenge I faced: I wanted to catalog a blog site, format the content, and dump it into a SQL database to help out a buddy.

Solution Miststep 1:

I began to go down the road of using .Net’s WebRequest and WebResponse objects.

You find a lot about this on google, and they run like this:

// used to build entire input
StringBuilder sb  = new StringBuilder();
 // used on each read operation

byte[] buf = new byte[8192];
 // prepare the web page we will be asking for

HttpWebRequest request  = (HttpWebRequest)WebRequest.Create("http://myblog.com/blog/");
// execute the request
HttpWebResponse response = (HttpWebResponse)
	request.GetResponse();
 // we will read data via the response stream

Stream resStream = response.GetResponseStream();
 string tempString = null;

int count      = 0;
int length = 0;
FileStream fs = new FileStream("test.txt",FileMode.Create);
StreamWriter writer = new StreamWriter(fs);
 do

{
	// fill the buffer with data
	count = resStream.Read(buf,0,buf.Length);
	// make sure we read some data
	if (count != 0)
	{
		// translate from bytes to ASCII text
		tempString = Encoding.ASCII.GetString(buf,0,count);
		 // locate and isolate the WOD

		begin =   tempString.IndexOf(beginstr);
		if (begin>-1)
		{
			Console.WriteLine(tempString);
		}
	}
}
while (count > 0); // any more data to read?
 writer.Close();
Console.WriteLine("Press enter and all that.");
 Console.ReadLine();

Wow, that feels like a whole lot of code to do something pretty simple?

 

Resolution:

A buddy stumbled across this in a post.  Does everything you want above in one line.  Just for kicks it also strips the HTML out for you.  Nice.

   public string RemoveHtml(string sURL)
   {
     try
     {
       using (System.Net.WebClient wc = new System.Net.WebClient())
         return System.Text.RegularExpressions.Regex.Replace(new System.IO.StreamReader(wc.OpenRead(sURL)).ReadToEnd(), "<[^>]*>", "").ToString();
     }
     catch (Exception ex)
      {
        MessageBox.Show(ex.Message);
        return null;
      }
   }

I’ll take it.  If you take a look at the two sets of code… you’ll notice the second snippet creates a WebClient object and just grabs the HTML dump with OpenRead.  The first snippet creates a webrequest, fires it off, gets the WebResponse.  Then they fill a buffer byte array and execute a do-while loop to strip out each line.  Wow!  I’ll let you decide which code you’d want to use.  Needless to say the smaller code footprint was faster.

 

Problem:

Alas, none of this was getting me any closer to being able to crawl through a specific set of URL’s.  Eventually I realized I was looking for a SiteMap generator, and where best to look for advice than Google.  They’ve compiled a very nice page of SiteMap generators (http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators), and although I was looking forward to building my own in C#, the first rule of programming is laziness, so I quickly found a free tool that got me closer to what I wanted to be for this task:

http://wonderwebware.com/sitemap-generator/download.html

Slick tool.

That got me a HTML dump of all the blog pages.  A little notepadd++ and excel sorting got me what I wanted (date/day/content).  Here’s a couple links I found along the way:

Then I dumped the excel into a SQL databse table.  Done.

Conclusion? This was a case where I was reinventing the wheel.  Standing on the shoulders of giants saved me a lot of thrashing.

Thanks,

Mike

© Copyright Duke Hall - Designed by Pexeto