Last time in this series, I reviewed the shared nothing model behind vSAN, and why this is important. I also mentioned there are design considerations anytime a shared nothing architecture is implemented and no single “perfect solution” to systems. Indeed, in section three of the paper I referenced, The Case for Shared Nothing, Mr. Stonebraker approaches several potential challenges faced in shared nothing systems. Let’s review these challenges in light of vSAN, and use this opportunity to discuss vSAN performance as a whole. There’s quite a lot to cover here, so I’ll spread the information across more than one article.
I love and agree with this quote from Mr. Stonebraker’s paper:
To ordinary mortals tuning is a “black art”.
Too often I see storage systems requiring dozens of “micro-tweaks” necessary to optimize I/O flow for a given workload, and I’ll admit this was one of my fears with vSAN when I first started looking at it. If a single-chassis storage array has a complex bevy of knobs to turn just right to calibrate for my infrastructure and application data, what am I getting into when we start distributing storage across shared nothing nodes? Because vSAN distributes data and individual nodes work on their own parts, how and why transactions are distributed is an important question. Also, failure of some part has to mean no other part is unduly affected. Lastly, each transaction must function as quickly as possible.
In the world of storage, we can quickly sketch some requisite boundary lines for where data should be processed: localized physical interactions with storage media should be distributed across nodes for each to separately own, and we should tune those entirely to the benefit of the individual node. In other words, if the action involves local storage media on a particular host, then those operations could and should just be done on that host. This must be balanced against the need for data in the datastore not being only localized; we usually can’t afford to not be able to reach this data if something bad happens to a single node.
I’ll detour briefly here: In a classic case of “the exception proves the rule”, sometimes we can afford not reaching data if something happens to a single node. I’ll provide three real-world cases I’ve seen:
- Test VMs which are built to explore some software or experiment with some new technology, but didn’t take much effort to instantiate.
- For example, if I want to muck about with some containerized apps I grabbed from Bitnami, I might decide to throw together a few VMs to give these a home.
- VMs which run highly resilient applications and are very easy to rebuild or reinstantiate. In these cases, the application can provide agreement with other instances of the application on other VMs. These are often their own shared nothing implementations at the application level.
- Many big data analysis solutions like Hadoop or Mesosphere require instances like this.
- You could also consider something like Microsoft’s SQL AAG to work this way, but it’s worth considering if it’s really trivial to create a new SQL instance and deal with the effects of one being lost.)
- Research or data analysis applications, which spin up worker VMs to analyze something and report back. If one fails, simply spin up another to replace it; most everything is automated.
- Scientific research clusters work this way sometimes.
- Or, some kinds of big data solutions might do this for parallelism.
In these cases, maybe I don’t care about losing that data if a node fails. In vSAN, I can use a policy rule “No data redundancy”. This rule means these objects have a DOM owner and live somewhere in the cluster (maybe one node, maybe spread across a few), but have no resiliency. In this case, the loss of access to any one of the involved nodes or even storage media (disks) means a loss of data. But I don’t care because the data is “protected” some other way: it’s either trivial to re-instantiate (like test VMs or research/data analysis applications), or already protected at another level (like resilient applications).
“No data redundancy” does exactly what it says. This means the failure of any involved storage media, even a single disk, may cause permanent loss of that data! Use it only when you’re sure re-instantiating the data is less trouble than saving it.
vSAN “tunes” for these “FTT = 0” cases by simply not doing something: no part has to do the work usually required to give this data resilient components. We have fewer transactions because there’s simply less to do. Ownership of and writing to the object becomes a simpler affair. This is just one example of cases where vSAN tunes transaction performance by undertaking only what is necessary. Another good example is how vSAN does not need to redistribute data every time a new disk or node is added: data is only re-distributed if a storage device in the system is over 80% full, or a policy changes. There’s no need for vSAN to go bumping a bunch of data around when I might need every ounce of performance capability I can get should a surprise I/O spike occur. We’ll look at another key way vSAN helpfully exercises the “less is more” mindset when we discuss data placement in detail below.
vSAN also tunes performance by making sure interactions are localized to just the involved components. To drive that last bit home, when I make a change to a VMDK, I only need to modify the part of the VMDK involved, not write bits into a large RAID set that has to be propagated across many disks. Nor do I have to access abstract containers when working with stored VMs: every virtual disk, snapshot, and other parts of my VM are first class citizens in vSAN storage. vSAN performs targeted, simple 4k writes to local media that are laser-focused on the objects I need. vSAN’s lack of reliance on disk RAID is great: pass-through disk means transactions can be very specific to just what I need, and not affect other data. This is further enhanced by the use of distributed object storage, which is fundamentally designed for working on data in a highly specific and retrievable way.
There’s another key in the localization concept: each disk group within a node is its own shared nothing system. vSAN uses disk groups as targets for a read from or write to any given node. Each of these groups, which are made of one cache device and 1-7 capacity devices, contribute to one single datastore in a vSAN cluster. However, they’re their own independent units when it comes to accepting a write or delivering a read. The software process working on one disk group (as part of LSOM) doesn’t care about any other, on the same node or a different node. Now we can see that vSAN solves the problem of tuning transaction performance by undertaking focused, minimal, and localized actions. This drastically reduces costly interactions (like software reaching across one node to boss around another node’s hardware) and removes some interactions altogether (like how we don’t create extra objects where resiliency isn’t needed).
You’ll notice all the points discussed so far aren’t about tuning that a user of vSAN has to do: these are processes vSAN simply follows as part of its design. This is very intentional: vSAN represents the end of hunting for micro-tweaks and the beginning of fundamentally tuned storage. There are several elements you can control via storage policies and cluster settings. I’ll cover them below and in follow-up articles. But these control points are not necessary for fundamental transaction tuning: vSAN inherently handles the most crucial aspects necessary to ensure quality, consistent performance. This is a good thing because one reason you really need to tune storage performance is to deal with a very hard problem: hot spots.
While it’s all well and good to say data is splashed around evenly in vSAN (something I covered from a few angles already in this series), all VMs are not made equal. Some demand more storage performance than others, either because of the sheer volume of IOs or high sensitivity to latency, so we need more IOPS delivery. There are a few ways to address this in vSAN, but let’s start with the shared nothing fundamentals. It’s good that a given node is only focused on the actions it’s involved in because it can perform at its individual best. However, if left unchecked, this means some nodes end up busier than others. They could even end up so busy that their maximum achievable throughput is exceeded while other nodes are relatively idle.
The primary defense against this is dispersing data: if we can make sure that not only the volume but intensity of components are distributed, this helps avoid bottlenecks. I had a great conversation with one of our insanely smart vSAN R&D engineers a while back about this sort of thing: how does vSAN decide where to put data in light of performance considerations? He raised a good point: determining how intense a given piece of data will be ahead of time without manual assistance requires the kind of psychic abilities that’d be amazing in a compute system, but might also try to kill us and send Terminators back through time to destroy any hope for our future resistance1. As someone who only beat the Terminator 2 arcade game once after I don’t know how many quarters, I felt really thankful engineering had the courtesy to not unleash the singularity just to shave IOPS. While I was reflecting on this, the engineer I was speaking with had finished his point, so I had to ask him to repeat it.
Beyond trying to determine throughput needs ahead of time, he began explaining with the kind of special patience R&D engineers reserve for people like me, trying to do a bunch of stuff after data is placed has a cost: moving this data around might reduce throughput until its moved. Also, who’s to say what I’m adjusting for won’t go away before I can finish the adjustment, or collide with another unexpected “hot spot”? These are well-known problems in the world of storage and have been addressed a number of ways, primarily through tiering. For those not familiar, storage tiering breaks deliverable service out into layers, with each being a little better. These are often called “Gold, Silver, and Bronze” or something like that.2
Like all choices in storage system design, problems immediately arose with tiering data:
- How should data be measured to know which tier it belongs to?
- IOPS? What about when some IOs are one size and some are another?
- Throughput? What about when some data is many small IOs and some is many large IOs?
- How should placement be performed?
- Manually? How will an administrator know where to place data? How much work will it be to juggle?
- Automatically? How will the system decide where to place data in light of the above questions? How quickly should it respond to changes in workloads?
- What about when the same data only occasionally needs a higher tier?
- Do you give up valuable “gold tier” space for a workload that only needs it once a week?
- Do you try to manually or automatically escalate it to a higher tier based on observed changes?
- What will be observed to make an incidental tiering change and how quickly can the tier change be completed? How will it be ensured there’s enough space in that tier?
To be sure, there are possible answers to all these questions. The journey to a better storage system is finding ideal answers. For vSAN, there’s a compelling opportunity to address these challenges: eliminate traditional tiering. If tiering was created to solve the problem of how to deal with hot spots, is there an opportunity to revisit the original problem and perhaps solve it a better way? vSAN differs from most other storage systems in that there’s no concept of “tiered zones”. This means no “pin to flash”, no “auto-tier technology”, and so on. Instead, all writes follow one universal, simple path. I’ve mentioned this session before, but if you want to understand this path well, I recommend you check out “A Day in the Life of a vSAN I/O”3.
Because vSAN writes directly to the cache disks (“tier” if you like) on multiple shared nothing nodes, it has the advantage of a simple, fast place to put writes as soon as they leave the VM. Thanks to not differentiating which writes follow this fast path, this means every I/O gets to be “gold tier”. Sounds nice, yes? So nice, you might wonder why every storage system doesn’t do things this way. In most cases, it’s been because the cost would quickly grow enormous: even today, fast flash disk is cheaper than ever, but still quite a bit more costly than slower media (especially spinning drives). vSAN solved this problem through a brilliant localized data destage design, in which data can be efficiently moved out of this cache layer at an appropriate time, ensuring we don’t need too much cache. Instead of separating IOs by type or intensity of the operation, vSAN simply directs all writes to cache, always. How does vSAN avoid hot spots in that cache? The simple answer is that it doesn’t need to; it only needs to make sure there aren’t more hot spots in a given cache device than that device or the software running it can handle.
Now that we’ve journeyed through the world of tiering and come out with a discovery of how vSAN’s tiering is different, let’s examine how it’s decided which cache disk a given object should go to, as this is crucial to making sure vSAN doesn’t work a disk group beyond its limits. If a write is modifying an existing VMDK, it’s clear what data needs to be changed: the VMDK that was already stored. vSAN is an object store, so a host will simply call up that object by asking its local DOM instance where the data backing that object lives. More than likely, that data is already sitting in some capacity disk on one or more hosts. vCenter also provides you with an ability to see where this data is if you’re the curious type. You can find this information by navigating to Your Cluster caret-right Monitor caret-right vSAN caret-right Virtual Objects, then locate your VM in the middle pane, check off the objects you’re interested in seeing, and select “View Placement Details”. This will display the hosts and even disks used to store that given object.
How did vSAN decide to use those hosts and those particular disks? When CLOM interprets a policy, part of the determination is how many pieces will be needed to form the primary object, like a VMDK. (We call these pieces “components”.) In the above example, a 40 GB VMDK has a policy with an FTT of 1 using Erasure Coding, and a Stripe Width of 1. This makes the determination pretty simple: we need just 4 components, each containing parts of the distributed 3 stripes, 1 parity model vSAN’s erasure coding uses for “RAID-5”. If you’re interested in learning more about that specifically, I suggest the “Host Requirements” section of the vSAN Space Efficiency Technologies whitepaper.
Now let’s follow the placement decisions in detail. To start, CLOM will mandate that placement of these 4 components not overlap fault domains; that would defy what an FTT means. Assuming I haven’t configured any specific vSAN fault domains, this means the components simply won’t share nodes: each will definitely be placed on a different ESXi host. DOM will decide which hosts, and uses a few metrics to determine this. Primarily, how “full” each host is at the time (both in terms of space consumed and how many objects are stored). Remember the conversation I mentioned with the R&D engineer? He told me that vSAN does not necessarily determine which nodes are “busiest” in terms of ongoing IOs. As discussed above, it’s nearly impossible to know how to do this correctly: too much can change a moment later. So the metrics for node selection are relatively simple and focused on the predictable, known quantities in the system. Once a particular node has been selected, DOM will select a disk group upon that node to use for the component. Again, which disk group is used depends on a few metrics, but not really which is “busiest”.
So why not come up with at least some way to figure out which node and disk group is “least busy”? To a certain extent, vSAN has done so. If a node or disk group has relatively few components compared to others, it’s also likely (but not guaranteed) it is less busy than another node or disk group with components. More importantly, trying to guess on this is expensive4 and once again, unreliable. Instead, vSAN solves the root problem: hot spots are handled because any modification to data is always done in a simple, single cache device per involved node at first. This is one device in one disk group and one or more other devices in parallel in completely different disk groups. I’ll revisit this in a future article when I discuss vSAN’s remarkable write integrity, but the point I’m making here is that the simple distributed placement model solves hot spots by engaging separate parts of the system (different nodes and disk groups) to handle the balanced load. Further, how few and fundamental the parts that control a given object are is a great help here. I only have to involve the hypervisor and the disks in the host: there’s no additional software or VMs required to talk to the local disks. This is as short of a path as you can hope for when it comes to VM storage, and that means those I/O peaks are going to be processed quickly.
In part 2 of this review of vSAN’s performance design, we’ll conclude looking at hot spots and take this on to Mr. Stonebraker’s last concern for shared nothing systems: concurrency control. Thanks for reading!
That might not be exactly how he said it, but you can imagine, yeah? ↩
Sometimes an environment will have even lower tiers of storage, or tiers not performing as they should, leading to designations like “cardboard tier”. That’s the kind of hilarity you’re missing out on if you’re not working in IT. ↩
This recording is from VMworld 2017. I hope to see a repeat of this session at VMworld 2018, giving all of us great insight into how this story keeps getting better. ↩
In this context, “expensive” means it requires a considerable amount of computation to determine. ↩