Eden Ridgway's Blog

.Net and Web Development Information

  Home :: Contact :: Syndication  :: Login
  105 Posts :: 1 Stories :: 78 Comments :: 3 Trackbacks


Article Categories


Post Categories



Friday, November 06, 2009 #

I've been quite curious about the various BigTable like implementations as they could have some bearing on what I am involved with at work. After reading various write ups about the competing approaches (e.g. HyperTable, HBase, Dynomite, Voldemort, Mongodb) I decided that Cassandra looked like the best option for what I'm after. What I didn't realise was that it wasn't going to be as straight forward as it appears to set it all up as there is no end-to-end guide of getting started with it. So the objective of this post is not to explain Cassandra but save you several hours of jumping all over the web trying to work out what you should do next and how to generate the necessary C# Cassandra Thrift client code. If parts of the post don't apply, e.g. VM setup, simply skip ahead to the parts that do. If you want more information about Cassandra you should check out the following posts:


So to get Cassandra up and running in a Virtual Machine (or physical if you choose) you should follow the following steps (valid as at 5 Nov 2010):

  1. Create a virtual machine to host Ubuntu 9.10 x86 Desktop. I use VirtualBox as my VM host on windows so I will refer to this here but feel free to substitute it with you favourite VM software. The VM I created has 384MB allocated to it with an 8GB virtual hard drive using a bridged network adapter (so it can be accessed from the Windows XP host). Once you've gone through the install process install the client tools so that you change the screen resolution, share folders, etc. To do this with virtual box go to Devices > Install Guest Additions and then open up a terminal window and execute

    sudo /media/cdrom/VBoxLinuxAdditions-x86.run

    Once this is complete you will need to reboot your VM.
  2. Now it is time to install Cassandra/Thrift prerequisites and software that we will to set everything up:

    sudo apt-get install libboost-dev automake libtool flex bison pkg-config g++ -y
    sudo apt-get install sun-java6-jdk -y
    sudo apt-get apt-get install ant -y

  3. Once that is done we can install other useful Ubuntu software that we will use:

    sudo apt-get install rapidsvn -y
    sudo apt-get install monodevelop -y

    I also like to install the following:

    sudo apt-get install nautilus-open-terminal -y
    sudo apt-get install samba smbfs smbclient winbind -y
    sudo apt-get install nautilus-share -y

  4. You may discover that there is an issue with your Java runtime home directory. If that occurs, select Sun's Java Runtime as the default Java version by running the following command from the terminal window and selecting the third option:

    sudo update-alternatives --config java

  5. Now open up Nautilus by going to Places > Home Folder. Right click and create a Development folder. Within the Development folder create another two folders: Cassandra and Thrift.
  6. We can now get the source for Cassandra and Thrift. Run RapidSVN by going to Applications > Programming > RapidSVN. Check out the source from SVN by going to the RapidSVN menu option Repository > Checkout. In the URL enter http://svn.apache.org/repos/asf/incubator/thrift/trunk and for the Destination Directory browse to the Thrift development folder. So it should look like this:

    Now get the Cassandra code by doing the same thing but entering http://svn.apache.org/repos/asf/incubator/cassandra/trunk as the URL and selecting the Cassandra directory.
  7. Once we have all the code we are ready to compile Cassandra, so let's do that by executing the following commands from the terminal window:

    cd ~/Development/Cassandra

  8. Start the Cassandra server by opening a new terminal window and running the following commands:

    sudo ~/Development/Cassandra/bin/cassandra -f

    The running server will look like the screenshot below. You should take note of the thrift binding address of localhost/

  9. Now we want to get communicating with the server, so we will need to compile Thrift so that we can use it to generate our C# client:

    cd ~/Development/Thrift
    sudo make install

  10. Now we can finally generate a C# client for the Thrift calls to Cassandra:

    cd ~/Development/Cassandra/interface/
    ~/Development/Thrift/compiler/cpp/thrift -gen csharp cassandra.thrift

    This will create the following directories inside of the Cassandra/interface folder: gen-csharp/Apache/Cassandra. These files along with the Thrift.dll will be copied into a new demo project we will create using MonoDevelop in the Development folder. So again in the terminal window execute:

    mkdir ~/Development/CassandraDemo
    cp ~/Development/Cassandra/interface/gen-csharp ~/Development/CassandraDemo -R
    cp ~/Development/Thrift/lib/csharp/Thrift.dll ~/Development/CassandraDemo

  11. Open up MonoDevelop and create a new console application by going to File > New > Solution. Now do the following:
    1. Select Console project
    2. Enter CassandraDemo as the name
    3. For the location browse to the Development folder that we created in your home directory.
    4. Uncheck the 'Create separate Solution directory' check box.
    5. Click the Forward button
    6. Don't make any changes on the next screen and just click Ok.
    7. Right click on the CassandraDemo project in the left tree view and select Add > Add Files.... Navigate to ~/Development/CassandraDemo/gen-csharp/Apache/Cassandra and select all the files in the folder. You will now have a project that looks like this:

    8. Add a reference to the Thrift assembly by right clicking on References and selecting Edit References. Change to the .Net Assembly tab and double click on the Thrift.dll and click OK.
    9. Now we can add the sample code to test it all out. So double click on Main.cs to ensure that is the class in focus in the editor. Replace the contents of the file with this:
      namespace CassandraDemo
      using System;


      static void Main(string[] args)
                  TTransport transport
      = new TSocket("localhost", 9160);
      TProtocol protocol = new TBinaryProtocol(transport);
      Cassandra.Client client = new Cassandra.Client(protocol);
      Console.WriteLine("Opening connection");

      System.Text.Encoding utf8Encoding = System.Text.Encoding.UTF8;

      timeStamp = DateTime.Now.Millisecond;
      ColumnPath nameColumnPath = new ColumnPath() 
      = "Standard1"
      = utf8Encoding.GetBytes("name")

      Console.WriteLine("Inserting name columns");

      //Insert the data into the column 'name'
      "Joe Bloggs"),

      "Joe Soap"),

      //Simple single value get using the key to get column 'name'
      ColumnOrSuperColumn returnedColumn = client.get("Keyspace1", "1", nameColumnPath, ConsistencyLevel.ONE);
      Console.WriteLine("Column Data in Keyspace1/Standard1: name: {0}, value: {1}",

      Console.WriteLine("Getting splice range");

      //Read an entire row
      SlicePredicate predicate = new SlicePredicate()
      = new SliceRange()
      //Start and Finish cannot be null
      Start = new byte[0], 
      = new byte[0],
      = 10,
      = false

      ColumnParent parent = new ColumnParent() { Column_family = "Standard1" };
      Dictionary<string , List<ColumnOrSuperColumn>> results = client.multiget_slice("Keyspace1"
      new List<string>() { "1", "2"}, 

      (KeyValuePair<string, List<ColumnOrSuperColumn>> resultPair in results)
      "Key: {0}", resultPair.Key);

      (ColumnOrSuperColumn resultColumn in resultPair.Value)
                          Column column
      = resultColumn.Column;
      Console.WriteLine("name: {0}, value: {1}", utf8Encoding.GetString(column.Name), utf8Encoding.GetString(column.Value));

      "Closing connection");
    10. Hit F5 and watch the Application Output window. This is what you should see:


  12. Hopefully this is enough to get you going and have you fill in the remaining gaps. If you found it useful please post a comment. If the Cassandra guys would like to use the C# example they are welcome to it. Good luck!

Saturday, January 03, 2009 #

I’ve always used SQL scripts to manage database creation and migration. At my current company we have a database project with patch scripts that conditionally execute based on the database version. These scripts are simply run in sequence by an installer which stores the current database version as a database extended property. The approach relies on version based naming of SQL scripts and developers to remember to update the patch script whenever they make a structure or stored procedure change. Of course one could try to use a tool such as Redgate SQL compare to generate diff patch scripts but that won’t always suffice as that won’t capture refactoring such as column name changes or data migration (for more read The Problem with Database Diffs).

The use of SQL scripts is not without it’s problems. The most annoying of these is that a patch script may break as SQL will parse and validate your script against the existing table structure even if the statements are not going to run due to version check condition. This forces one to use nasty EXEC/sp_execute alter/create statements with double escaped strings (Yuck!). I find the following particularly problematic with this approach:

  1. The string based SQL statements will always pass validation even though when they are executed they may be completely screwed
  2. It is impractical to generate these by hand and the best way to create them is to use SQL Management Studio.
  3. If your stored procedures are not wrapped in strings, to update the patch script requires you to script out the scripts from the database. Basically this increases development friction and discourages updating of the patch script.
  4. It becomes difficult to determine when a structural change was made as one has to trawl through huge SQL scripts and it is not easy to use a regex search to find a change.

Hence I long for something a little more strongly typed, which is why the growing number of .Net based Database Migration frameworks have caught my eye. I want to be sure that there isn’t something obvious that I was missing out by not using one. Fortunately Ben Scheirman has done a fair amount of research already and wrote a fairly good round up of available .Net migration tools. Looking at the tools there are clearly two different types of frameworks available:

  1. Structured SQL script execution – These tools provide a framework under which SQL scripts will be executed to create and upgrade a database to a specific version. Examples of these are: Tarantino and LiquiBase.
  2. Rails like Active Record Migration – Commands to upgrade or downgrade the database schema and data can be coded in a .Net language, with the possibility of executing SQL statements if required. Tools that fall into this category are: RikMigrations, SubSonic Migrations, Machine.Migrations and Migrator.Net.

The obvious question that springs to mind when looking at these frameworks is whether or not the additional development friction (additional tasks that must be performed during the course of development) is worth the benefit. To be honest I don’t see a massive difference between the structured SQL script execution approach and what I’ve been doing all along. However the likes of LiquiBase do provide a functionality that nicely captures database refactorings and diffs, so maybe it could be used as more of a general toolset. However the Active Migration approach would be a more radical departure so I’d like to explore that more.

Here are my thoughts on Active Record Migration like frameworks:

  1. Best paired with an ORM framework, such as NHibernate or Subsonic, where one is not writing a lot of custom SQL and hence the the upgrade classes can be kept clean and focus on structural and data migration. That is not to say that there will not be benefit to having them execute a lot of SQL statements to add/update stored procedures, it is just that it may get a little messy and possibly tedious.
  2. Using C# code to do database change would provide an opportunity to use something like log4net to dump out progress information. I've always longed for a decent way of doing this that doesn't involve a custom proc that inserts into a table. I really like the idea of having a degree of compile time checking for the database changes. Of course with SQL scripts you could always run them on each build and if they break, you fail the build.
  3. The whole idea of downgrading a database scares me because if someone accidently points an old version of an application (with automatic migration logic) to the database (think xcopy deployment) you could end up losing tables and data. Of course one should always backup a database before doing a deployment, but I’m talking about a situation where one doesn’t know it is pointing to the wrong database until it is too late. I hope these tools backup the database before applying changes. In my mind a downgrade should only be performed if a rollout fails and one has to revert to a backup taken before the upgrade was undertaken.
  4. I really like the idea of using anonymous objects to insert data into the database (RikiMigration supports this) and avoiding the need to write insert statements.
  5. Do you not run into potential development concurrency and structuring issues? So for example during a development cycle does one dump all changes into one class or into several?
  6. The database vendor abstraction that is can provide is only attractive to a small number of projects. Furthermore one invariably will need to execute vendor specific statements and is best suited to a multi-vendor ORM such as NHibernate.
  7. The approach may not fly in an organization where a DBA must review all SQL changes. Although I don’t see any reason why the frameworks couldn’t generate these for you.

I think if I was currently working on a project that used an ORM like NHibernate or LINQ to Entities, the migration frameworks would be a lot more appealing. However with other types of projects I’m worried that the additional effort isn’t really going to be worth the benefit.

Am I missing any other potential benefits (or problems)? I would really like to hear if there are other compelling reasons to adopt the approach.

Wednesday, January 30, 2008 #

Looking at technical interview questions on the Internet it appears that getting candidates to write some code that prints out the Fibonacci series is quite popular. I personally prefer giving candidates slightly more relevant practical questions, but I thought I'd present the various solutions that one may take. For those of you who are unfamiliar with the Fibonacci series you should take a look at the Wikipedia page on it. Each number in the series is the sum of the previous two values except for the first and second numbers which are zero and one respectively. So the interviewer is expecting to see the following series of numbers:

0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584...

The Fibonacci series is often used by computer science teachers to teach recursion, even though this it is a poor fit (see: How Not To Teach Recursion). However in all likelihood this is what the interviewer is after, so the expected standard solution should look like this:

internal static int GetFibonacciValue(int number)
if (number < 0)
throw new InvalidOperationException("Negative numbers are not supported by this implementation");

(number <= 1)
return number;

GetFibonacciValue(number - 1) + GetFibonacciValue(number - 2);

A lot of people forget to check for a negative number check, so including it may even earn you "bonus points". If you want to have a little fun you could always use a terse lambda expression to solve it like this:

Func<int, int> fibonacci = null;
fibonacci = n => n > 1 ? fibonacci(n - 1) + fibonacci(n - 2) : n;

The problem with the approaches above is that they perform really badly because of their recursive nature, so you could always show off and tell them that there is a closed form of the solution which allows you to solve it in the following manner:

private readonly static double rootOfFive = Math.Sqrt(5);
private readonly static double
goldenRatio = (1 + rootOfFive) / 2;

internal static int
GetFinbonacciValue(int number)
return Convert.ToInt32((Math.Pow(goldenRatio, number) - Math.Pow(-goldenRatio, -number)) / rootOfFive);

If you forget the closed form you could also tell the interviewer that using a simple for loop is far more efficient and ask him if he would be satisfied with that:

internal static int GetFinbonacciValue(int number)

if (number < 0)
throw new InvalidOperationException("Negative numbers are not supported by this implementation");

    if (number <= 1)
return number;

runningTotal = 1;
twoValuesBack = 1;

(int n = 2; n < number; n++)
int previousTotal = runningTotal;
runningTotal = twoValuesBack + runningTotal;

twoValuesBack = previousTotal;

return runningTotal;

That should be enough for you to ace that question during your technical interview. You may be curious as to how significant the performance differences between the approaches are, so I timed how long each took to generated numbers from 0 to 121393. The graph below shows the average number of ticks it took to complete the operations.


If you are busy doing interviews, I hope this helps you out. Good luck!

Monday, January 28, 2008 #

With rich AJAX web applications one needs to look at optimising resources such as your CSS, JavaScript and image files. Yahoo has collected together a great collection of best practices that have been put into their YSlow Firefox extension. These should be taken as a guideline and not gospel. In general however the core principals are as follows:

  1. Keep server roundtrips to a minimum. The techniques for the various resource types are as follows:
    • JavaScript - combine them into a single file if possible. One has to make certain they are combined in the correct order in the file.
    • CSS - again merge them as far as possible.
    • Images - Use CSS sprites where one combines several images into one and uses the background image offset to display the one you are after. To create the sprites you can always use this online sprite generator.
  2. Reduce the size of each request. This is done in two ways:
    • GZip/Deflate compress text resources by configuring IIS and/or using an ASP.Net HTTP Module.
    • Reduce file sizes. The tools I use for each type are:
      • JavaScript - Jazmin provides pretty good minification by removing whitespace. If you want better minification then you should take a look at the Dojo ShrinkSafe compressor which will shorten the name of local variables and functions to reduce the file size even more. The downside to this is it makes troubleshooting a production issue more difficult.
      • CSS - CSSTidy not only reduces whitespace but optimises the CSS expressions as well.
      • Images - As PNGs are my image format of choice I use PNGGauntlet to reduce the image file size.
  3. Ensure resources are cached. If you know that the files are going to change infrequently then all you have to do is set IIS to enable content expiration for a folder/file to expire after a set number of days, say 10. This will ensure that the client browser caches the resources and does not re-download them. However depending on the frequency of your deployments and your server architecture you most likely want to ensure that your etags are setup correctly. These are used to indicate whether or not the cached browser resources have changed. When loading a page the browser will send a request to the server, with the etag, for each resource and if it has not changed then the server will only return a 304 status code. Note that this advice is contrary to Yahoo where they tell you to remove etags for a web farm scenario. What you should rather do is ensure that the etag change number is the same for each server in the web farm. This ensures that the etag generated for the same resource on a different server is the same (how to do this varies depending on your version of IIS - read this post for more IIS6 information). If you are using a single web server however the whole this is a non-issue for you.

Okay so we know we want to combine and minify our CSS and JavaScript files when we deploy to our test, staging and production environments, but we don't want to develop and debug our applications with monolithic files with no whitespace and cryptic names. The problem is that our pages contain references to these files and we are merge them them together for deployment our references will now be incorrect. Well this is quite easily solved using compiler directives, like so:

<% #if DEBUG %>
<script type="text/javascript" src='<%= ResolveUrl("~/JavaScript/File1.js") %>'></script>
<script type="text/javascript" src='<%= ResolveUrl("~/JavaScript/File2.js") %>'></script>
<script type="text/javascript" src='<%= ResolveUrl("~/JavaScript/File3.js") %>'></script>
<% #else %>
<script type="text/javascript" src='<%= ResolveUrl("~/JavaScript/Section.minified.js") %>'></script>
<% #endif %>

For CSS files you can either use the directives around alternative <link> elements or around @import statements, like this:

<style type="text/css">
<% #if DEBUG %>
@import '<%= ResolveUrl("~/Css/File1.css") %>';
@import '<%= ResolveUrl("~/Css/File2.css") %>';
<% #else %>
@import '<%= ResolveUrl("~/Css/CombinedStyles.minified.css") %>';
<% #endif %>

There is a catch to this however, since the aspx pages are compiled separately, setting your application to compile as a release build will not cause the references to change. What you need to do is change the compilation setting in the application web.config file, which should be set to debug="false" in the production environment:

<compilation debug="false">

Now on to the merging and minification. This could be accomplished using MsBuild or a batch file, but a batch file is simpler, so I'll discuss that. On one of the build steps for the web site(s) invoke the batch file which then ensures that the optimised files are always up to date. So you want to add something like this to your pre/post build compilation step in the project property window:


I setup the "Build Scripts" folder to contain the optimisation programs as well as text files with lists of files to merge, for example:


You'll notice that I group the JavaScript files on the site by Section of the site and not by page. The granularity of file grouping really depends on the application, but by section or function is a pretty good starting point. Each line of the file will contain a file name, like this:


I don't need to specify the full relative path of the files, in respect to the build folder because when I develop web sites all JavaScript files are placed within a JavaScript folder off the root of the web site.

In my simplified scenario the "Optimise Web Resources.bat" batch file contains the following:


:: Change to the current directory
cd /d %0\..

::Ensure there are no old temporary files lying around
del *.js
del *.css

::Create a single file for the JavaScript in the admin section
for /F "tokens=*" %%f in (AdminJsFiles.txt) do call type "..\Web Sites\Site\JavaScript\%%f" >> AdminCombined.js
type AdminCombined.js | jazmin > AdminCombined.minified.js
copy AdminCombined.minified.js "..\Web Sites\Site\JavaScript\AdminCombined.minified.js"

::Combine the Stylesheets into a single file
for /F "tokens=*" %%f in (CssFilesToMerge.txt) do call type "..\Web Sites\Site\CSS\%%f" >> CombinedStyles.css
csstidy.exe CombinedStyles.css --template=high CombinedStyles.minified.css
copy CombinedStyles.minified.css "..\Web Sites\Site\CSS\CombinedStyles.minified.css"

::Remove the temporary files
del *.js
del *.css

::Restore the old "current directory"
chdir /d %OLDDIR%

The one part of the batch file that confuses most people is the FOR /F "token=*" statement. That reads in a line at a time from the file specified in brackets and puts it in the %%f variable. The contents of each file is then appended into a single file using a "type SourceFile >> DestinationFile" statement. In the case of the JavaScript files they are then piped through the Jazmin compressor/minifier, the output of which is redirected in an output file, i.e. AdminCombined.minified.js. With the CSS files, the command line syntax is a little different, but the outcome is the same.

Hopefully the rest of the batch file is pretty self evident and there is enough information here to get you started on implementing the approach yourself. If there are areas that need more explanation please leave a comment on the blog post.

Sunday, January 27, 2008 #

SQL Server Common Table Expressions (CTEs) are most often used to solve problems such as paging and hierarchical queries, however in this post I want to demonstrate another use: booking of timeslots. If you were writing a system for an attraction with predefined timeslots, say to see the damper in the Taipei 101 building, you would need to be able to distribute the allocation of tickets over these timeslots. So if a group of 16 people arrived, depending on the space available in the timeslot they may be allocated over several timeslots and would have to patiently wait until their turn came around.

The most obvious way to solve this problem is to use a looping structure that moves through the timeslots, allocates the tickets and reduces the number of tickets available for that timeslot. Unfortunately to ensure that slots are not over booked while this is going on, one has to lock the tables affected by the allocation process. This can cause performance and concurrency issues as the number of requested tickets and therefore the lock duration increases. A more efficient and elegant solution is to use the inherent looping/recursive nature of a CTE to do this allocation in one step. So instead of doing a loop, insert into select, loop again cycle, the code can simply do a single insert into select statement to block out the allocated timeslots in one implicit transaction.

To demonstrate the approach, let us take a scenario where we want to allocate 16 tickets. The number of remaining places available per timeslot are as per the diagram below.


Our simplified database would therefore have a Timeslot table that contains the time and the running available capacity for each one, like this:

Time Available Capacity
1 Jan 2008 08:00:00 0
1 Jan 2008 08:20:00 5
1 Jan 2008 08:40:00 0
1 Jan 2008 09:00:00 7
1 Jan 2008 09:20:00 2
1 Jan 2008 09:40:00 10

We want to allocate 16 tickets across those timeslot. So the CTE we create should operate in the following manner:


The CTE to do this looks like this:


@SeatsRequested = 16;

TimeSlotAllocation (Time, SeatsAllocated, SeatsLeftToAllocate) as
--Find the first time slot to use
        dbo.MinNumber(@SeatsRequested, AvailableCapacity)
as SeatsAllocated,
        @SeatsRequested - dbo.MinNumber(@SeatsRequested, AvailableCapacity)
as SeatsLeftToAllocate
AvailableCapacity > 0
--Sequentially find the other timeslots that can be used to fulfill the sale request
        dbo.MinNumber(TSA.SeatsLeftToAllocate, TS.AvailableCapacity),
        TSA.SeatsLeftToAllocate - dbo.MinNumber(TSA.SeatsLeftToAllocate, TS.AvailableCapacity)
Timeslot TS
--Join on to the next timeslot
INNER JOIN TimeSlotAllocation TSA ON
SeatsLeftToAllocate > 0

You’ll notice the code relies on a user defined function, MinNumber which simply returns the minimum value of the two supplied parameters. As you can see the logic is pretty straight forward:

  1. Find the first available timeslot.
  2. Allocate the lesser of the number of SeatsLeftToAllocate or the available timeslot capacity.
  3. Store the number of tickets remaining in SeatsLeftToAllocate
  4. Move on to the next timeslot by adding the number of minutes between timeslots.
  5. Repeat steps 2 to 4 until there are no SeatsLeftToAllocate.

Of course the final solution would include other logic, such as bounding the allocation of timeslots by day and checking that all tickets had been allocated, but this is merely an illustration of an approach.

If you want to play around with a working example, here is the code.