Tuesday, February 14, 2012

Unit Testing HTML Parsing, While Keeping AppHarbor and RefactorPro Happy

 

Short Version

“Unit” tests were starting development web server, reading web page responses using HttpWebRequests, loading them into HtmlAgilityPack HtmlDocuments and parsing them.  AppHarbor (it’s American, so no u after o) wouldn’t start the dev web server and RefactorPro parsed a very large test html page because it was part of a VS project, which took up gobs of memory.  The solution was to turn the html pages into txt files, store them as project resources and use HtmlDocument.LoadHtml to keep everything self contained.

Long Version

Yes, my unit tests were actually starting a web server and reading test versions of pages from it.  This breaks the general tenant of unit testing that they shouldn’t go “out of memory”, meaning don’t read from disk, database, network etc.  Everything should be contained within code.

So, it should have been no surprise that AppHarbor wouldn’t start a copy of the development web server for my unit tests to run against.  I knew the time had come to somehow bypass the network and read the HTML files from disk.  Use of the local dev server would be limited to working offline on the GO train.

I was also facing an issue with DevExpress’s RefactorPro.  It likes to parse every HTML file in your Visual Studio projects, just like it parses your code.  Unfortunately, one of the test pages is a list of all 11,000 products at the LCBO, and the LCBO website’s HTML is needlessly complex.  When the HTML page was included in the project it sent Visual Studio’s memory use through the roof.  I like RefactorPro/CodeRush too much to keep it unloaded when working in my LCBO Drink Locator code to disable it, so I kept the HTML page excluded from the project.  It was still on disk, and the dev web server would serve it up when requested anyway.

My first approach to the AppHarbor issue was to attempt loading the existing HTML file from disk.  My assumption was that if my unit test is in some bin\debug or bin\release folder then my HTML file is in ..\..\..\AnotherProject\doc.html.  This is true on my PC, but on AppHarbor’s build server of unknown technology configured with unknown settings, subject to change, it was not so.

The next thing I tried was to add the HTML doc as a project resource to one of the unit tests, and to load it into an HtmlAgilityPack HtmlDocument.  Thank you, StackOverflow for suggesting the approach.  Adding the HTML doc as a resource meant it was part of a project again, sending memory use sky high.  Being a resource, I couldn’t just exclude it and still have it be read from disk.  I could now load the doc from “memory” i.e. not from disk or network, but suffered long waits as RefactorPro parsed the LCBO’s crazy HTML.  (How many nested tables does a page need?)

I noticed HtmlDocument has a method called LoadHtml that accepts a plain string, which gave me an aha moment.  There was no need to make the project resources actual HTML docs; they could exist as txt files.  This meant RefactorPro would ignore them, and I could still load them from memory.  A rename and re-add as resource fixed the issue.

All my unit tests now pass on my PC and when AppHarbor runs them.

image

 

image

Not sure why AppHarbor only sees 51 tests, and not 61, but ok.

FYI, the app is hosted at http://lcbodrinklocator.apphb.com/.  It’s a work in progress, but it’s ability to destroy planets find drinks is fully functional.

image