Clickpass: Building a fault taulerent, redundant web server architecture

When we first sat down to pencil out the architecture of Clickpass, one of our chief concerns was uptime. We were building a single-sign-on (SSO) and authentication system that both users and our partners would rely on, and it had to stay up.

Clickpass is a service targeted at developers that makes it easy for them to integrate OpenID and single-sign-on into their website. Integrating OpenID and single-sign-on has traditionally required a lot of work and been difficult for developers to do in a way that is meaningful and effective for the end user.

The challenge Clickpass solves

Integrating OpenID requires a lot of thinking. It requires thinking about how you will transition users who are currently signed up to your site onto a new, SSO system and what the flow and interface for that process should be. How will the new system mesh with the authentication and registration systems you already have?

An SSO web service

Clickpass delivers a solution to these challenges with a single, OpenID-based web service which plugs onto a raw OpenID installation and handles all of the UI and flow for the consumer website. It makes OpenID simple to install and easy to do right.

The service also acts as an OpenID converter, plugging into proprietary authentication systems from Google, Facebook and Microsoft at one end and spitting out an OpenID service at the other. By doing this conversion the service delivers millions more users than raw OpenID alone.

The importance of uptime

In building a service oriented around registration and authentication we knew it was essential that the service should be reliable. If there's one thing you want from a web service delivering authentication it's uptime.

Building redundancy on a budget

Although redundancy was critical we did not have a lot of money to splash around on doing so. Early on we talked with an ex-CTO from Real Media who kindly gave us an hour of his time to advise us. It was a very helpful conversation but when he started describing how best to split the development between our data centre team, application team and business teams I had to point out that he was talking to all three of those teams on our two-person conference call.

The constraints

We had to build redundancy and we had to build it cost effectively. We made an early decision that we weren't going to go as far as building across separate data centres in the first iteration but we wanted to be sure that we could withstand any losing any individual box whether application or database.

We also had another important constraint. One of the features of Clickpass is that it gives users complete anonymity. We generate unique OpenIDs for every site someone visits so that in the event that someone Googled the OpenID used at one site there is no way to connect it to any other. What that meant was that should a database go down, it was critical we didn't lose any records, as each one mapped to a unique identity. Any hot-swap-over tools we put in place were going to have to kick in in seconds and not suffer any data loss as they did.

The technology

Clickpass was written in Ruby on Rails. Rails is a great agile development environment, however it does not always have the simplest production setup. Hopefully this is going to change with the great work being done on mod_rails for Apache. For our setup we used Apache with Mongrel running our Rails application.

On the database side we used MySQL. For us the choice was between PostesSQL and MySQL. I think both have good merits, it was a close call overall and both would have worked for us. We chose MySQL because its use is more common in web production setups and we felt would provide more support and scale better.

Our servers

The first decision we made was to have separate database and application servers. The database server would be responsible for running our database (MySql) and our Application server would run our Ruby on Rails application stack. The benefits of keeping servers separate:

  • Easier to manage
  • Easier to scale and allocate appropriate resources
  • Better ability to troubleshoot and identify problems

Gives you more flexibility in disaster recovery. e.g. if application server is down database is still accessible and vice-versa.

We decided to get 2 of each, giving us 4 servers altogether. This allowed us to have no single point of failure.

Application server setup

The application server had the following stack.

  • Apache with modproxybalancer: This served our static files and through modproxybalancer this acted as a load balancer to our Mongrel cluster.
  • Mongrel: Mongrel is a single thread ruby server that can run Rails. We had a cluster of 20 of these on each machine. Our Rails application ran on this server
  • Keepalived: This took care of fail over. If one web server became unresponsive the other would take over.

We served all dynamic images from S3 and dynamic user data was served through the database servers. This meant that failover could be handled seamlessly through Keepalive and in theory we could scale our application servers horizontally without restriction.

One complication we had was how to make sure that requests would come through to the correct application server in the case our primary server fails. To solve this we considered the following solutions:

  1. Put a load balancer in front: This would be a separate server that is responsible for passing requests on to Apache. In the case of failure it would stop sending requests to one of the failed server.
  2. Do dynamic DNS routing: If a web server fails the IP address of clickpass.com would be shifted over to the other application server
  3. IP takeover: In this scenario the IP address is reallocated from the failed server to the other application server.

All the solutions have there pros and cons. The first solution introduces its own single point of failure. Option 2 can lead to some downtime as the new DNS propagates. Option 3 is often not possible if your server does not support it and it requires the two servers to be in the same rack.

Fortunately we spoke to ServerBeach (our hosting solution) and they put all our servers on the same rack and put a router in place that could support IP takeover. Therefore the IP takeover solution was the one we implemented.

Database servers

One would think in 2009 (well it was 2008 at the time, but its still the same!) it would be very easy to set up a Database with a full backup and instant recovery solution. This turned out to be the hardest part of our setup.

A standard solution for MySQL is to have two servers one acting as a master where all writes are done and another acting as slave. In this scenario you have the following benefits

  • Can read from multiple locations. This useful as most applications do more reads than writes
  • Can backup and do more complex admin read tasks of a slave server.

This alone did not provide us with a bullet-proof solution, as the failure of the master would still cause downtime. We needed to set up an automated way to move database writes to the slave server in the scenario where the Master server failed.

To do this again we used Keepalived and IP takeover. If the MySQL master stopped responding we ran a script that did IP takeover and it also changed the slave configuration to be the master.

In theory this could lead to some loss of data if certain writes had not yet propogated from the MySQL master to the slave. In practice this would be a maximimum of 100ms as both the severs were on the same rack and our write operations were all relatively fast. Given that the chances of server failures are generally very unlikely a minor data loss was acceptable.

A more complex solution such as Hadoop or CouchDB could have been used to solve some of our redundancy and fault tolerance constraints. This would require a lot of custom work to integrate and would have been overkill at the time.

Backup solution

Backup is always critical for any website. We had a simple solution with a script that we set to run using a cron job. The script copied our SVN repository to S3 and also did an LVM snapshot of the database.

Monitoring

We used a combination of monit and pingdom.com to monitor Clickpass. This meant that if Clickpass was ever down or slow we would be notified instantly. Unfortunately this also meant I had at least two nights where I was woken up in the middle of the night with sever SMS messages! The first night was before we were even launched, I was not amused, but it was a good demonstration that our monitoring worked.

Conclusion

Our end setup at Clickpass was a fully fault tolerant system, we did not have any single point of failure. Since launching we've had managed to stay up very well and despite adding thousands of users to the system each day, the service has stayed fast, dependable and delivered better reliability than the Yahoo front page.

You might also like...

Comments

Immad Akhund Immad is co-founder and ex-CTO of Clickpass.com. Clickpass.com was recently acquired by Synthasite. Immad has build 3 web start-ups and is now working on his new startup, Heyzap.com, dubbed the you...

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“In order to understand recursion, one must first understand recursion.”