Friday, September 14, 2018

On aggregates: values, references, and transactions.

Gods help me, I'm thinking about aggregates again.

I think aggregates, and the literature around aggregates, deserve a poor reputation.

Part of the problem is that the early descriptions of the concepts have implicit in them a lot of the assumptions of enterprise solutions written in Java circa 2003.  And these assumptions haven't aged very well - we are writing a lot of software today that has different design considerations.

A lot of the mechanics of what was going on were implicit; they worked because we were writing to monolithic databases using RDBMS solutions, and therefore a bunch of magic "just happened".  You didn't need to think about it, so you didn't need to talk about it.

For me, the questions that really shook me centered on creation patterns.  At first, the answers all seemed straight forward -- use a factory to create a new version of the aggregate, save it in the repository.  What's the problem?

But then, I discovered distributed systems, and unreliable networks, and how do we ensure the correctness of the system when two concurrent processes are trying to create the same entity?

Add event sourcing into the mix, and it starts to get really tangled.  A new aggregate means a new stream; how does that happen?  How do subscribers find out about the new stream?  What happens if we need to move a stream, how do we find it again?

In short, I wasn't able to find a good story about indexing.

I have two protocols that I employ when I'm trying to work my way through a design snarl like this one.  The most common approach is to pull out the biggest hammer I can find, and start beating on things until fundamental principles start falling out.

That protocol hasn't been getting me anywhere for a long time.

The second protocol, which I hate falling back to, calls for letting the problem stew in the back of my mind, until I discover some obscure clue that I can use as a heuristic to study the problem, then pull out the hammer again, and beat on things until fundamental principles fall out.

For a moment, let's think about a set, as a data structure.  The semantics are easy, we can add values to it, and read them back out.  The plot twist is that enumerating the added values doesn't repeat -- adding a value to a set is an idempotent operation.

How do we implement that?  One answer is to enumerate all of the elements of the collection during an add.  If you find the element, return without doing any work, otherwise append the new element to the end of the collection.

Not a popular choice - the scaling properties are unfortunate.  More common is to break the problem down into something smaller; use the value to calculate a key, use the key to find a specific shard, then enumerate the subset of elements that appear in the shard.  Because the value uniquely specifies the key which uniquely specifies the shard, it must be present in this shard, or not at all.

A one bit key allows you to use two shards.  A two bit key supports four shards.  As you continue adding bits, you can continue to add shards, reducing the probability of "collision", and reducing the amount of time you spend enumerating the elements in a single shard.

A carefully chosen value to key function will balance the number of elements per shard, independent of the number of shards and the population of unique values.  An example of this would be a hash function.

In other words, hash(value) acts as a sort of signature for value.  If the hash function produces enough bytes, collision risks are negligible, and we can treat the signature as unique.

In a sense, the set uses the signature to look up the value.

That in turn implies that the signature can be used as a proxy for the value.

In our services, we typically have some persisted record of information, and that record of information changes over time.  So we have a current value, and a successor value, and a bunch of constraints from the business that determines whether or not a transition to this specific successor is allowed or not.

This spelling is deliberately more general than the one I usually see, where the successor value is calculated by inputting the current value to a system of functions that satisfy the business constraints.  Think PUT - here is a representation of a new value; is transitioning to this new value compliant with the business constraints, or no?  Such a check might be done in two parts -- first, to ensure that the new representation describes an acceptable state, and then to determine whether the transition between the values is allowed.

Given two values, where the successor satisfies the business constraints relative to the current, we then need change things so that the verified successor becomes the new current.  This is typically done by a reference - we "set" the reference to use the new value.

"set", however, has last writer wins semantics.  It is vulnerable to the lost update problem.  The semantics that we want are first writer wins -- normally provided by "compare and swap".

As discussed above, because signatures uniquely identify values, we can achieve the same results by performing a compare-and-swap of signatures, and then using the signatures to retrieve the values.

OK, now lets introduce aggregates into the mix.  Aggregates optimize the problem of checking that a successor value is valid, by partitioning that value into isolated elements.  In other words, to determine if a change is satisfies the business constraints, you only need to consider a neighborhood of elements local to each change.

Aggregates are a partitioning of the information in the data model.

What that means is that we don't need to lock the entire data model to safely make changes - we only need to acquire coarse grained locks over the neighborhoods  a in question.

Of course, coarse grained locks, compare-and-swap; what we are describing here is a need for transaction semantics.  We use a database to coordinate the locking.  The locks themselves are mutable references to signatures, which can be used to lookup values.

Now, let's bring it together -- what's the story for an index?  how do we store a new aggregate in a way that ensures the data model is consistent with the business constraint.

Well, we have a value that lists the mapping of known keys to aggregate references.  That value itself has a signature.  The index itself has a reference to that signature.  When we save the new aggregate:
  • we write the value of the new aggregate
  • we compute the reference of the new aggregate
  • we write the value of the index, mapping the business key to the computed reference
  • we open a transaction:
    • add the reference for the new aggregate
    • update the reference for the index
    • commit
Easy peasy.  So long as the reference to the index is stored with the reference to the aggregate, we're good to go.

Note the subdivision: our object store scales with the size of the values we are preserving, the amount of persistent history we collect, and perhaps garbage to be collected.  The reference store (with the transaction semantics), scales with the number of references.

Also, there's an interesting implication - if we can update multiple references in this use case, then of course we can update multiple references in other use cases as well.  There is nothing fundamentally wrong with updating the references of multiple aggregates in the same transaction -- so long as you know that the references are in the same store.  You don't actually need the values in the same store - the reference can include an identifier for a store and an identifier for the key.

I suspect that GDPR can be managed this way as well -- personal information goes into an access restricted object store, reachable via some opaque signature.  When you need to forget, that value gets evicted, or replaced with a tombstone.  Consumers that don't have access to the PII can still forward the signature to consumers that are authorized to access the data. 

Indexing in event sourced systems?  We're good here too - here's a value for the index, here's a value for the stream, update both references commit.   Which also implies that updating multiple streams in a single transaction is supportable as well.

In various scenarios, the value graphs can get arbitrarily complicated.  But they are just values - you can cache them freely because they aren't going to "change" (although as noted in the case of GDPR they might go away).

The above discussion owes a lot to Rich Hickey's The Language of the System, and the insights he expresses there.

No comments:

Post a Comment