The Art in the Architecture – vSAN & Shared Nothing

April 15, 2018 in vSAN clock 11 minutes music Grizzly Bear – Knife

As I mentioned in the first article in this series, vSAN is the world’s leading hyper-converged solution,1 and with good reason. Previously, I covered the vSAN object store and its marvelous ability to foster an agile, robust datastore. I’d like to now examine the relationship between vSAN nodes, specifically the “shared nothing” concept. In general, “shared nothing” means multiple parts do not require a common component to operate within a system. It’s a frequently used and well-trusted way to ensure resiliency and agility. Way back in 1986, a brilliant article out of Berkley,2 The Case for Shared Nothing, made a strong case for this as an operational model for distributed work.

Anytime a system has parts which share state dependence to operate, the risk is a single failure causing total loss of operation. An example of a “shared state dependence” would be the chassis of a traditional storage array: if that single chassis fails, the internal redundant disks, controllers, network interfaces, and such are for naught. Any LUNs or volumes this array was presenting are no longer available. The working assumption is that this sort of thing doesn’t happen: a chassis is simply metal with parts bolted inside. However, entropy is perfect, and failures do indeed happen, especially because there’s a bit more to that enclosure than “simply metal”. I won’t belabor this point with anecdotes,3 but rather focus on the obvious: any part can fail, so it stands to reason elimination of single points of failure between parts of a system is a good thing.

3-2-1 Cluster Demonstrating an SPOF
A typical “3-2-1” configuration with traditional storage, as shown here, exemplifies how the storage system becomes a single point of failure.

vSAN has an enduring challenge: it must express resiliency and integrity using commodity hardware. A key to how this is accomplished is through a shared nothing model: each vSAN node is a world unto itself. It shares an agreement of state (e.g., what the datastore is, and what’s in it) with other nodes, but does not rely upon another node to express its part in the solution: it’s either individually available and participating in the cluster, or not at all. This isn’t to say each node has the same data (that’d be a massive waste, and ruin the fun of being a distributed system anyway) or that a certain quantity of nodes aren’t needed to express data (that’s the side effect of avoiding the aforementioned waste). Rather, this means that an individual node operates independently, representing whatever data it ought to without reliance upon anything else. Whether that node’s representation of data is enough to get whatever data I’m after is another matter.

Understanding vSAN’s Shared Nothing Implementation

To put this to practical terms, let’s examine the I/O flow in vSAN and how each node thinks of objects. vSAN has multiple software layers which get the various parts of work done. A good summary of them is found on page 51-52 of the vSAN Troubleshooting Guide, as well as the fantastic book Essential vSAN. These layers all exist within the codebase of every ESXi instance (even if they’re not a member of a cluster). So we can see right away that every node must have some knowledge of the cluster, the objects in the object store, the construction of storage media, and how to move data around all of these. Now, let’s examine what each node specifically knows as it pertains to what you really care about: your VMs in the datastore.

At the higher levels of vSAN, information is understood in the object store as objects, and its also understood these objects are constructed from components that reside locally on hosts in the cluster. This understanding is primarily what the DOM is about (though other parts of vSAN are involved in this awareness as well). For each object, there is a “DOM Owner” which is responsible for the state and path to the object. This DOM Owner will be some node in the cluster, which means each node ends up being the DOM owner for many objects over time. Other things are happening in the DOM as well (see the aforementioned resources), but let’s stick with this “Owner” piece to understand the shared nothing principles at work.

Each object only has one owner. If a node is removed from the cluster or fails, an election is held (very quickly), and a new owner is chosen for all the objects that node was an owner for. But all of this assignment, shifting, and state of ownership is not centralized. Instead, it’s distributed across all nodes, with each focused only on the changes that affect objects that node already has or newly receives. Several mechanisms exist to ensure ownership is exclusive, and re-assignment can quickly happen. But it does not require a central coordinator. If it did, the failure of that central coordinator would fail the system.

For a moment, let’s entertain the thought of having some central coordinator for this process, and let’s attempt to make it as resilient as we can. First, let’s add a backup coordinator. Great, now we’ve more or less built an array: two controller elements4 with some kind of failure association. But where shall these coordinators be placed? Seems obvious enough: let’s put them on separate nodes in vSAN. But what if those two nodes fail? What if those two nodes can’t see each other, but everything else works fine? Fine, let’s instead think about an owner, and a bunch of “potential owners”. And hey, while we’re at it, let’s make every participating node a potential owner. The more resiliency we can afford, the better, right?

Now, we just need some software to help potential owners know when to step up. Of course, the easy way to do this would be some central “nominator” that helps with this, but we just spent all this time getting rid of a central owner. We don’t want the same problem again! So instead, let’s have each node understand the state of other nodes from a membership perspective: is it a member still? Is it alive? Are there new members we should be aware of too? Great, we just talked ourselves into something that’d look a lot like vSAN’s CMMDS. We can play this game some more to understand interpretation of storage policies (CLOM), and how we’ll get access to an object (DOM Client), and more.

Let’s talk now about the node-exclusive stuff: what don’t we need to share? One really obvious thing here is the access to and movement of data within the physical storage media: what good am I as software on one totally separate piece of hardware to a disk on another piece of hardware? We can think of the DOM Component Managers and LSOM pieces as being these blank outlines that exist on each node, and are only known and filled in by stuff going on inside the node. A great benefit of shared nothing systems is you can have parts of your system focus on the only things they should care about. If I’m node #1 in a vSAN cluster, I don’t care about node #2’s disk activities, because there’s nothing useful I can do to contribute to them. Mr. Stonebraker makes a note in section 3 of his paper that for data operations, if we can restrict operations to “single site” (in vSAN, single node), then less work has to be done and considerable benefit is derived. The challenge is ensuring all the coordination happens (as I’ve belabored, there is no central coordinator). That’s one of the many extraordinary challenges the vSAN engineering teams solved.

Some parts within vSAN coordinate over the vSAN network, and some deal with node-local operations and don’t need to coordinate externally.

Some storage systems have built a connection point here and made it possible (to varying extents), that software on one part can control hardware on another. This sounds good until we realize we’ve brought a new layer of complexity, abstraction, and quite a lot of latency into the system. We’ve also muddied the waters as to what failure means: what are my fault domains? Do I have multiple? Do I have fault domains within fault domains that I have to think about? It gets to be headache-inducing. As the aforementioned paper from Mr. Stonebraker calls out, part of the elegance of shared nothing is simplicity. We know where our boundaries are for operation.

In a vSAN cluster, if the vSAN software local to the node is not operating, then I can’t do anything with the storage this node has. Given this sensitivity, I had better make sure there’s a good set of protections for the uptime of this software. I also need to make sure other nodes can take over in case I can’t keep the software up. I’ve provided a snapshot of how some of this is done here and in the previous article in this series. Regarding the uptime of the software, VMware protects this with the same mindset the VM uptime is protected: put it at tier 0 within the system. What I mean is, this problem was already solved in thinking about keeping a VMs CPU and RAM going: make sure the hypervisor, the software that provides these things, is at the lowest level possible to minimize risk. With vSAN, VMware has chosen to follow suit by placing it at a very low level: right inside the core of ESXi.5

Now we have a more complete understanding of this shared nothing architecture: data that gets distributed around the cluster is individually understood to the extent necessary, but has an ownership mechanism so that each node can do its own work without having to repeat the labor of others. This shared knowledge can transfer ownership around as needed. There’s local stuff that we don’t have to care about outside the individual nodes, like talking to the hardware on that node. Rather than criss-cross work at this point and introduce complexity and latency, let’s just make sure parts are resilient.

Improve Write Integrity

vSAN has another fantastic benefit owing to the shared nothing model it employs: writes are necessarily distributed at time of commit. When a write happens to an object (like a VMDK), that write is sent to any components which must store that data. For a VM using vSAN’s “out of the box” policy, that would involve two nodes, which each house a copy of the data. How many nodes and what kind of write operation happens varies with the policy set upon an object.6 But for any component I’ve configured via policy to have resiliency (i.e., an FTT > 0), this distributed write action occurs. What’s more, it’s simultaneous and traversing to different nodes. As discussed above, each node is able to achieve this quickly, because it’s only involved in transactions it has to care about. vSAN also has the ability to understand failures ahead of time, or time out and accept the acknowledged writes it was able to get from nodes should a failure suddenly occur.

All this considered, this is a far superior solution than the usual “accept write locally, then replicate” strategy employed when multiple discrete systems are involved in a storage solution. By having separate, shared nothing nodes acknowledge individual writes, we can be more sure data is secure when the write is acknowledged back to the VM. The VM doesn’t know or care about anything after this acknowledgement, so the less propagation or making up for lack of parity of data we must do after VM write acknowledgement, the safer we’ll be.

Is Shared Nothing the Perfect Solution?

There is an old joke that if a newspaper titles an article with a question, the answer is “no”. Consider that true here. The very heart of computer systems engineering is that there isn’t a perfect solution. Shared nothing introduces some challenges of its own, a few of which I’ve called out above. It does however lend to a marvelous way of focusing work appropriately. By employing this shared nothing approach, vSAN has been able to generate incredible performance and agility, while uniquely increasing resiliency. Next time, I’ll explore some deeper dive thoughts around how vSAN addresses challenges often found in shared nothing architectures as part of an examination on vSAN’s amazing performance design. Thanks for reading!

  1. In the short time between this article and the last, vSAN surpassed 10,000 customers. Rad. 

  2. Brilliant out of Berkley” might be the tech generation’s “Fresh outta Compton”… 

  3. My colleague, [John Nicholson](], is fond of saying, “the plural of anecdote is not data.” I agree. 

  4. Of course, controllers in an array are doing more than owning parts of data (and usually, arrays aren’t contributing to object storage decisions like this). But hopefully the metaphor is clear enough. 

  5. I’ll repeat my threat from last time that I won’t avoid this in-kernel vs. out of kernel argument for much longer. Stay tuned, friends. 

  6. I’ll dive into this sometime, but a beautiful job on covering this is done every year at VMworld in the “A Day in the Life of a vSAN I/O” session. Here’s the 2017 recording and a PDF of the slides