Convert a Blog Website into a SQL Data Dump

The challenge I faced: I wanted to catalog a blog site, format the content, and dump it into a SQL database to help out a buddy.

Solution Miststep 1:

I began to go down the road of using .Net’s WebRequest and WebResponse objects.

You find a lot about this on google, and they run like this:

// used to build entire input
StringBuilder sb  = new StringBuilder();
 // used on each read operation

byte[] buf = new byte[8192];
 // prepare the web page we will be asking for

HttpWebRequest request  = (HttpWebRequest)WebRequest.Create("");
// execute the request
HttpWebResponse response = (HttpWebResponse)
 // we will read data via the response stream

Stream resStream = response.GetResponseStream();
 string tempString = null;

int count      = 0;
int length = 0;
FileStream fs = new FileStream("test.txt",FileMode.Create);
StreamWriter writer = new StreamWriter(fs);

	// fill the buffer with data
	count = resStream.Read(buf,0,buf.Length);
	// make sure we read some data
	if (count != 0)
		// translate from bytes to ASCII text
		tempString = Encoding.ASCII.GetString(buf,0,count);
		 // locate and isolate the WOD

		begin =   tempString.IndexOf(beginstr);
		if (begin>-1)
while (count > 0); // any more data to read?
Console.WriteLine("Press enter and all that.");

Wow, that feels like a whole lot of code to do something pretty simple?



A buddy stumbled across this in a post.  Does everything you want above in one line.  Just for kicks it also strips the HTML out for you.  Nice.

   public string RemoveHtml(string sURL)
       using (System.Net.WebClient wc = new System.Net.WebClient())
         return System.Text.RegularExpressions.Regex.Replace(new System.IO.StreamReader(wc.OpenRead(sURL)).ReadToEnd(), "<[^>]*>", "").ToString();
     catch (Exception ex)
        return null;

I’ll take it.  If you take a look at the two sets of code… you’ll notice the second snippet creates a WebClient object and just grabs the HTML dump with OpenRead.  The first snippet creates a webrequest, fires it off, gets the WebResponse.  Then they fill a buffer byte array and execute a do-while loop to strip out each line.  Wow!  I’ll let you decide which code you’d want to use.  Needless to say the smaller code footprint was faster.



Alas, none of this was getting me any closer to being able to crawl through a specific set of URL’s.  Eventually I realized I was looking for a SiteMap generator, and where best to look for advice than Google.  They’ve compiled a very nice page of SiteMap generators (, and although I was looking forward to building my own in C#, the first rule of programming is laziness, so I quickly found a free tool that got me closer to what I wanted to be for this task:

Slick tool.

That got me a HTML dump of all the blog pages.  A little notepadd++ and excel sorting got me what I wanted (date/day/content).  Here’s a couple links I found along the way:

Then I dumped the excel into a SQL databse table.  Done.

Conclusion? This was a case where I was reinventing the wheel.  Standing on the shoulders of giants saved me a lot of thrashing.



© Copyright Duke Hall - Designed by Pexeto