The Art in the Architecture – The vSAN Object Store

calendar-alt July 30, 2017 clock 8 minutes music Pelican – The Cliff share-square Share this: twitter linkedin-in facebook hacker-news envelope


vSAN is the world’s leading hyper-converged storage solution, with over 8,000 customers at the time of this writing. There are many reasons it has been so successful, most of which are discussed and debated frequently throughout the Internet. Below the parade of higher-order, very visible reasons for success, there are multiple technical facets that make vSAN incredibly unique and remarkably powerful. In this series, The Art in the Architecture, I’ll explore aspects of vSAN’s fundamental architecture that enable the amazing features, performance, and integrity that has led to its record-shattering success and make it truly unique among storage solutions.

A Hypervisor for Hyper-Convergence

Let’s begin this exploration with the fundamental way vSAN handles data: the distributed object store. Generally, software-defined storage for hypervisors is a “bolt-on” solution to avoid or lessen the costs of shared storage by offering some kind of filesystem (usually NFS) distributed across virtual hosts to allow media to be attached closely to the host instead of relying upon a discrete purpose-built hardware array. This generally reduces hardware cost. But it invariably results in some loss of key features found in modern hypervisors, such as a fluid and native method for balancing compute load or a “sideways tier” in which VMs must provide basic infrastructural support to other VMs, instead of all VMs sharing the foundation of the hypervisor.

Some of these considerations were diminished as purpose-built appliances came forth, due to having an engineered hardware stack to overcome complexities and shortcomings of the necessary workarounds being done in the software.1 However, even with these appliances, the fundamental structure providing a given volume to virtualization hosts remains the same: distribute a single filesystem across multiple disjoined hosts. This is a necessity, as hypervisors only support a given range of storage protocols (usually at least Fibre Channel, iSCSI, and NFS). Of these, only NFS is really an ideal choice (the “why” to this is probably worth an article by itself, but others have covered this extensively already). Thus, a hyperconverged storage solution is bound to NFS, it would seem. Unless, of course, the hypervisor could be modified to support something new.

This is precisely the case with vSAN: the vSphere hypervisor has been fundamentally modified, down to the kernel level, to take on a new type of storage system – one purpose-built for storage in the modern virtualized datacenter. For now (but not long), I’ll politely sidestep the long-tail discussion on the merits of in-kernel vs. not, and instead focus on another implication of this: there’s no requirements to obey the normal rules of vSphere datastores. So long as it provides unwavering integrity and stellar performance while not critically interfering with the features and operations that should be expected from a vSphere host, vSAN can go about this however is best. Doing things differently was built into vSAN’s fundamental design.

vSAN Cluster Diagram
vSAN allows vSphere to pool directly-attached storage into a shared object store. (Image credit: VMware)

The Storage Primitive

From the perspective of how data is managed across hosts, instead of the usual distributed filesystem made necessary when one has to retrofit a solution, vSAN provides a distributed object store that is natively and collectively understood by all vSphere hosts. Object storage differs from filesystems in several ways, but one of the most immediate advantages presented is the opportunity to define the storage primitive: what is the lowest-level thing the system understands?

In filesystems, the storage primitive is always a file, which has a static implementation and little variability. This is even more sharp when only NFS is considered, as the rather lengthy NFS v3 spec only allows so much difference between implementations. (Block size is the usual variant, but there’s certainly some others.) For brevity’s sake, I’ll spare the gory details as to how one can create one kind of underlying filesystem upon one system and use NFS to “bridge” it to others. It’s simple enough to say that characteristics of the logical filesystem upon a single system are necessarily limited by the bounds of NFS itself once it’s presented to vSphere. There’s a rational limit to how much unique design can be accomplished this way.

An object store natively understood by the hypervisor is another matter entirely. In an object store, instead of files that in turn are composed of blocks, object stores are made of, well, objects. These objects are extremely versatile and malleable, as archetypes can be defined simply and are represented within the object itself. Each object usually contains some unique identifier, some metadata to describe what the object is and some key characterestics, and of course the data itself. vSAN implements this by having types of data natively understood by the hypervisor, with the type of object and certain other aspects identified within the metadata of the object. Types of objects (“classes” if you’d like extra points for technical correctness)2 are very different from each other and uniquely created to be maximally efficient for whatever part of the datastore’s data they represent.

Most of the vSAN object classes are devoted to the parts that make up a VM: virtual disks, snapshots, VM descriptors, swap, and the like. These objects are divisible, meaning they can be broken up into smaller parts for various reasons (the most obvious of which being size; a 62 TB VMDK needs to span several capacity devices to be stored). I find it easiest to think of objects like atoms. There are sub-atomic particles like protons and electrons, which in turn are made up of elementary particles like quarks and leptons.3 However, when we think about a given physical object, the building block for it that we usually point to is the atom, because this is the lowest level at which there’s a specific enough description of the matter. Likewise, while a given virtual disk has components and these in turn have sub-components, in vSAN software and operations I’m usually thinking of objects at a higher level than this.

To further the atomic analogy, usually the molecule is the real reference point for material, and the whole VM is usually the real reference point for what’s stored on vSAN. I can observe and even work with the “atoms” (VMDK, snapshots, swap object, etc.), and I can even observe and affect the “sub-atomic particles” (replicas and stripes of VMDKs, for example). Through working with the “atoms” and “sub-atomic particles”, I can cause changes to the “elementary particles” (like those pieces of VMDKs separated out to avoid size constraints) and of course observe them as well. However, my normal operation and monitoring centers around the “molecule” (VM). Okay, that’s enough physics analogies for now.4

Because each part of a VM has its own kind of objects within vSAN, I have unparalleled flexibility in how I can natively optimize for their nature within the storage system. I can do so in a way I simply cannot in a distributed filesystem. To a filesystem, everything is a file. To an object store, everything is an object, and there can be very great differences between objects. This is wonderful, because a virtual disk, a snapshot, and a swap object are not the same thing in software at the hypervisor level. Filesystems force us to cast a universal type upon them in storage, but there’s no gain in doing so except that it’s convenient for the filesystem. In the object store, I can allow unique expressions of each of these parts.

Storage Policy-Based Management

One immediate way this is realized is through the flexibility of Storage Policy-Based Management (SPBM).5 Typically, characteristics such as how data should be stored (mirroring vs. erasure coding, how much redundancy, etc.) is determined at the volume or LUN level. Specifically, it’s determined within the storage system, whether that system is hardware or software-defined. It isn’t exposed to or controllable by the hypervisor, which is unfortunate, because this guarantees an immediate fragmentation as to how administrators can think of the state of applications.

An example of the vSAN object store
An example of the vSAN object store in action: a VMDK is stored as objects based on a policy set specifically against it.

To see and change the application’s compute state, I work via the hypervisor. To do the same for storage, I must use a different control context. This context might even be linked to or even embedded in the hypervisor’s console (as many hyper-converged products tend to offer in one way or another). However, it is still a fundamentally different context: there’s no unified story to tell around the state of configuration. VMs are one kind of container, LUNs or volumes are another. So if I want to make a change to how a VM is stored, I have to think in this context that isn’t about VMs.

To consider this another way, when I want to modify the vRAM for VMs, I don’t put them in a “RAM LUN” and hope my “RAM LUN” is big enough and set up right so I get the RAM I want. And if I want to affect the relationship of vRAM among VMs, I don’t configure a relationship or tiering mechanism between “RAM LUNs”; I just put the VMs directly into Resource Pools and set the pool to affect the VMs: there’s no separate container to think about. Storage for hypervisors ought to be as straightforward as compute resources have been for years. The vSAN object store makes this possible through providing a very direct mechanism to affect the state of specific objects, using SPBM. I’ll be sure to write sometime on how this is managed and implemented, but for now the official documentation on vSAN policies is quite clear on the matter.

Pictured: a RAM LUN, one of the lesser known circles of Hell.

LUN Locking

Lastly, another significant benefit the object store confers is the lack of “LUN locking” for VMDKs6. Put simply, locking is the idea that a hold on changes to data must occur locally when a distributed filesystem propagates changes across its domain. The results and behaviors have been generally ironed out over the years, but all the same it represents a compromise that introduces complexity. There are many manifestations and implementations of locking, but let’s focus on a particularly infamous one: VMFS locking upon LUNs when VMDKs change. A great brisk read on this can be found from Cormac Hogan’s article VMFS Locking Uncovered from back in 2012. A pithy quote from this:

All distributed file systems need to synchronize operations between multiple hosts as well as indicate the liveliness of a host.

Now, immense effort has gone into VMFS locking (and indeed, most locking mechanisms) to minimize pain, but there’s time and resources involved in cycling and managing these locks, even if nothing goes wrong. This becomes more evident as more stuff is in the same contendable domain. Meaning, a LUN with a single VMDK has much less to think about than a LUN with 100 VMDKs. Additionally, it increments the number of steps needed for a write cycle. Finally, it complicates planning and design, as exemplified in this nice little write up I happened across, VMware Datastore Sizing and Locking.

So, to avoid write locking you can try to keep all your servers on one datastore. But, that’s not really practical long-term as VMs get migrated between hosts. Or, you can minimize the number of VMs that are using each datastore. In addition to keeping the number of VMs/datastore low, a strategy to consider is to mix heavy I/O VMs with VMs that have low I/O requirements; which will help manage the queue depth for each LUN.

All true, but my eyes are watering just a bit…

vSAN’s object storage system avoids this muck by having a different relationship between hosts than hosts sharing a LUN would have: it’s shared nothing. This means locking isn’t necessary for objects because each host handles the I/O individually and uniquely, and agreement is maintained in a higher order at the object storage level, rather than a need for distributed locking at a filesystem level7. A closer look at this is offered by understanding the lifecycle of an I/O in vSAN, which you can read about in the vSAN Caching Algorithms whitepaper, or see in a recorded VMworld 2016 session, “A Day in the Life of a vSAN I/O“. Better yet, VMworld 2017 is just around the corner, you ought to come and hear this year’s updated vSAN I/O and vSAN technical deep dive sessions!8

By the way, the “shared nothing” approach is a powerful artwork in itself: stay tuned!

Update: I changed how comments are done on this site, but there was a good discussion going in the legacy system. If you like, you can view the previous comments on this article.

  1. VxRail is a different animal in this regard, due to it being a vSAN appliance. Thus, it shares the benefits discussed here. 

  2. There are actually both object “types” and “classes” in the vSAN object store, but for the sake of brevity, I’ll focus on the classes, as this is the interesting bit here. 

  3. At the risk of sending you down a rabbit hole, those links (and that whole site) are really interesting reads. Physics is weird, man. 

  4. Julius Sumner Miller would be proud, I hope. 

  5. SPBM” is such an awkward acronym… spuh-bum? spi-bum? es-pee-bee-em? We need some fancier names! At least it’s a really accurate acronym… 

  6. The inimitable Pete Koehler calls this out in his appearance on the Virtually Speaking Podcast (episode #38). He draws the immediate connection to performance, and fascinating thoughts on storage performance in general. 

  7. There is a locking implementation in vSAN within VMHOME objects, but it isn’t a shared or distributed lock in the sense I’m talking about here, thus does not run the aforementioned risks. 

  8. Links to VMworld 2017 sessions expire eventually, but you ought to be able to find them on YouTube around late September 2017 by searching “sto1926bu” and “sto2986bu”. 

Updated: June 09, 2018 (Add a TOC and some minor syntax improvements.)

Comments on "The Art in the Architecture – The vSAN Object Store"

Engage, converse, argue, reason. Disagreement is welcomed, personal attacks are not.