This week, I decided to tackle a particularly nasty piece of code in our code base: a repository class built on top of DynamoDB, that was written before we had much understanding of how the DynamoDB library in C# worked (and before I understood how DynamoDB worked, truth be told).

The repository class is used to support a commonly accessed interface component, so recent data needs to be cached in our MongoDB instance, but all historical data is stored in DynamoDB. We never used the .NET document mapper for DynamoDB; rather I wrote some logic to convert to a local read model. This class over time has accumulated debris and is huge due to all the hand written queries to account for all the different widgets in our data visualisation dashboard.

I decided to take the plunge and write a new repository on top of LINQ2DynamoDB, a native C# DynamoDB implementation that translates LINQ statements to DynamoDB queries (similarly to LINQ2SQL). For those unfamiliar with LINQ, it provides a standardised language-level (C#) interface for talking to data stores and databases (basically anything that can be represented as an iterable list of objects). You write a LINQ query, for e.g.:

Db.Where(x => x.Id == myId)

And it will be translated to the equivalent SQL statement.

select * from x where x.Id == @myId

This is a fairly simplistic example, but you can easily apply a bunch of functional programming-ish transformations to data using LINQ without touching a line of SQL.

It also quickly becomes powerful when unit testing business logic and refactoring as mocking an IEnumerable/IQueryable is far easier than mocking an entire database instance.

The Limit of LINQ

There are numerous issues with the leaky abstraction provided by something that eventually transpiles a language like LINQ; they've been discussed before so I won't rehash these.

This week, I have noticed one key deficiency of applying the LINQ database mental model to accessing DynamoDB - it kind of erases the benefits of using DynamoDB. Dynamo has many restrictions placed on it to ensure affordable, scalable, available and consistent speed queries, and the escape hatches that exist in the database engine to ensure you can actually access your data are not available in LINQ2DynamoDB.

For example, you cannot apply KeyConditions or FilterExpressions to a DynamoDB Query with LINQ2DynamoDB. As soon as you start specifying anything other than the HashKey (and RangeKey), LINQ2DynamoDB will kick the engine into Scan mode, which is an incredibly expensive operation on large dataset.

So you have to be super careful, and force the Query to run (use .ToList()) before filtering on anything other than the HashKey. Of course, then you have the caveat of running an in-memory filter, which again is expensive and really not recommended for a large dataset (we're pulling roughly 10,000 objects at a time from a query against over two million records and filtering them in memory, for example).

I'd advise you to make use of DataContext's OnLog method when developing and testing so you can check out the sorts of queries that are being generated.

What's worse is that even if I wanted to extend the existing functionality of LINQ2DynamoDB so that you could do these things, it'd no longer be a LINQ implementation - these things fall outside the scope of LINQs theoretical model of databases. So it's not even a fault of the developers as much as an impossibility of ever being able to do things like this, failing a rethink of LINQ's understanding of datasets.

This limit is something that really shouldn't be ignored, especially by novice developers still new to the C# development scene. For newbies like myself, LINQ's magical data transforms are mostly all we've known. That's not to mention the many exotic database types that are in vogue at the moment and will continue to proliferate throughout the developer community as more use cases are discovered for them.

As it stands, LINQ2SQL is an excellent tool to get started with DynamoDB, especially if you don't have complex or large dataset needs.