Project Name: RAM

Overview

RAM was a project created to store and share data/files across several geographic locations. A variety of programming languages were used to create a RESTful service with a simple interface for storing and retrieving files from Amazon S3 with both web and Matlab clients.

The content below covers the entire project, but a lot of the more-specific details are a more focused on the "server" and "upload/download manager" components, as those were the parts I was responsible for implementing.

Goals & Considerations

In a nutshell, the high-level, primary problems we were trying to address were:

  1. Ability to back up and restore large collections of files programatically.
  2. Securely access to the backed up files from anywhere (to simplify working with external partners).
  3. Reliability of data storage is critical. Loss of data is not acceptable.
  4. Ability store custom metadata about the files to simplify custom searching tools to be implemented.
  5. Easy to interface with the system from various programming languages, Matlab being the most critical.

Some considerations we had to keep in mind:

  1. Users of the system would be in several geographic locations around the world, some which would be employees of the company, some which would not be (and therefore, are not inside the corporate network).
  2. Users must be able to download content at least 6.67Gb/s from all locations (there was some logic behind this number which is not relevant).
  3. Three (3) TB of disk space was initially required, with it expected to expand to 10TB in the foreseeable future. Requiring clients to have a mirror of the data on each of collaborator's machine was not feasible (or desired).
  4. Approximately 15GB of data was expected to be transfered each month (storing 10GB, retrieving 5GB).

We looked at existing file storage systems, both locally hosted as well as cloud-based options as well as both proprietary and open source options. The systems which allowed us to easily store files programatically didn't seem to provide an elegant means of tracking or searching custom metadata. Additionally, many seemed overly complicated for us to extend with this functionality ourselves.

Approach

We broke the project in to two primary milestones:

  1. Implement the fundamental functionality of backing up raw data from Matlab to a "durable storage" location.
  2. Implement the ability to record and search metadata at the individual file and at the "collection" level. Include a web interface to the information to lay the groundwork for providing a more intuitive interface to do complex searches.

We broke the task into four distinct responsibilities: a server API, an interface for storage and retrieval of files, a Matlab client, and a web client (more on this in the "Technical Notes" section). We broke out "storage and retrieval" of files into a separate "app codebase" to allow us to host multiple of these as necessary depending on the throughput we were seeing during the upload & download processes from different locations.

Technical Notes

  1. We created four separate codebases for the full system. This was a very nice way to work, as it created a very clear division of responsibility.:
    1. The server: Provided a RESTful API all clients interacted with and stored the metadata of the system (collection names, contents, usage statistics, etc). [Application stack: Rails 3.1, PostgreSQL, Nginx, and Unicorn]
    2. The upload/download manager: Abstracted interaction with the durable store (S3 in this case) so the clients did not need to know anything about where files where stored. This was a very small Sinatra application which was designed to allow us to embed it as middleware in our main server application or break it out and host it at each physical location to give us an opportunity to increase the upload speed for clients in countries where the communication with our S3 zone was slower than desired (acceptable). [Application stack: Sinatra 1.3.1, Nginx, and Thin]
    3. Matlab client: Enabled users to manage content in the system via a set of simple commands (create, overwrite, verify, retrieve). [Application stack: Matlab, Java]
    4. Web client: Enabled users to check the status/contents of collections via a web interface. [Application stack: Nginx, Backbone.js]
  2. We used header detection on our web server to identify which requests were looking for our RESTful API and which were standard 'HTML' requests. All API requests were forwarded down to our application servers, which were serving up our Rails app. The requests for HTML pages were forwarded to our Backbone.js application (an index.html file and associated JavaScript files).
  3. We had to enable CORS support to allow for (easy) local development initially, and later, to permit requests to come from both HTTP and HTTPS versions of the site. Although all API traffic was over SSL, the users were initally allowed to load the web site over a standard HTTP conneciton, which required us to enable CORS for the non-secure domain as well as localhost.
  4. We used Relish to keep the API documentation in sync with our implementation. Additionally, we used a structure for our endpoint links which allowed us to provide a direct link to the documentation on Relish which describe the functionality and requirements of that particular endpoint.
  5. We added some custom code to our ChiliProject installation to give users who were employees a "single sign-on" experience with their corporate credentials. This was a big-win, as it allowed us to easily control access by feeding off of the LDAP authentication system in the company (via ChiliProject), but also gave us the flexibility to create user accounts ourselves if necessary.
  6. We used Jenkins for our continous integration and Git for our version control.
  7. All traffic was transmitted over SSL and required Basic Auth for all endpoints except the "root" endpoint of the site.

Learnings

  1. Named endpoints are fantastic.
    • We used "named endpoints" instead of hard-coded URLs for the various actions which can be performed via the API. In a nutshell, this means all clients agree to use whatever URL is returned from a "named endpoint" in your responses. This gives you a lot of freedom to switch out your application's structure and introduce optimizations without breaking clients whenever you change a URL. At this point, it's hard to imagine creating a RESTful service and not using named endpoints instead of hard-coded URLs.
  2. Using the Accept header to handle API versioning felt really nice.
    • We did what seems to be the "new hot thing" in the Rails community of creating a vendor-specific mime-type. I believe we (the Ruby community) were a little late to the party on that one relative to other programming lanugage communities, but, better late then never, I guess...
  3. Thinking about API versioning up front is a good idea.
    • It paid off significantly before we "released" even our first version of the system. We were able to start implementing some changes in the structure of the API responses based on feedback from the guy implementing the "web client" component (@roykolak) without affecting the guy implementing the "almost completed" Matlab client (@paulsexton) a couple of days before the release.
  4. Sending back JSON objects for entities in your response body provides a lot of flexibility to modify your API without breaking clients.
    • For example, our initial release of the API sent endpoints back as key -> value pairs of 'endpoint_name': '/endpoint_url_here'. Doing this seemed fine, but when we wanted to change the format to add links to the documentation for that endpoint, we had to bump the API version number and old clients were now "outdated." On the second release, we returned objects similar to the following: 'endpoint_name': {'url': 'endpoint_url_here', 'docs': 'http://.../docs_link_here'}. If we wanted to add another piece of metadata about links now, we can...without affecting "old" clients at all...they just won't leverage the new fields. This is another example of a pattern which I suspect has been around "forever", but was something which slapped us in the face early on in the project.
  5. Using Cucumber and RelishApp.com has real potential for web service API documentation.
    • We used the docs link in our endpoint link objects to provide a link to the cucumber scenario describing how to use the associated action/endpoint. Because of the proximity of the team (all next to each other), we haven't end up using it extensively to date, but really liked the way it felt on several levels (principle, workflow, etc). We hope the investment will pay of down the road, as we'd like to open source the project if possible...
    • As with most things, there are varying opinions about the value of using Cucumber to do integration testing on an application's API. I can see both perspectives and think it depends on the situation. However, if you are writing a publicly availabe API and the cucumber are steps written with the user of in mind (Joe Developer using your API), I think this is potentially a great use case for Cucumber and RelishApp. This is a scenario where your stakeholders are very likely to thoroughly read and value the effort you put into making your steps match the language of your domain (which is often a point of contention for those apposed to Cucumber, in general). Additionally, your actual documentation is in sync with your code...seeing that become a potentially reality is kind of neat (I've been skeptical about it in the past).
  6. Setting up fake data at unimplemented endpoints worked well.
    • Not a huge revelation here, and not something that can always be done, but I wanted to make a note of it so I didn't forget it...
    • When we were in the middle of rolling something out, we didn't want to wait to implement some things on the client, but the service wasn't ready for it. We added in the endpoint(s) and put in some fake data so the client applications could implement the desired functionality (at least for happy-path stuff). At some point, the actual implementation was deployed and a few hours later the other devs noticed the data was live. Like magic. That was kind of neat.
    • Another nice thing about how we implemented this particular feature was by drafting three cucumber scenarios which included the required JSON request and response bodies and throwing it around the horn of developers for approval. We generally would talk about features, but it was cool to just send a link to a Relish page (see above) with it formatted nicely and get a quick check.
    • Like I said, not a huge revelation, but I thought I'd mention it.
  7. YAGNI reared it's ugly head yet again.
    • I'm not convinced breaking the "uploader/downloader" component out in advance was necessary. This was a requirement which came out of a discussion with some of our stakeholders, but is something we are not leveraging in any way at this time. Keeping that responsibility in the main "server API" application's codebase would have been convenient for several reasons (simplified deployment, testing, etc). In the event that we do end up rolling out multiple uploaders, the implementation will be very simple, but...

External Links

Associated Articles I've written

  1. Named Endpoints in RESTful APIs 11 Jan 2012

    An overview of the implementation and impact of using named endpoints in a recent project which made heavy use of RESTful APIs.