I’ll resume our discussion from the previous article in this series, specifically around vSAN’s handling of “hot spots”. Let’s start with a return visit to The Case for Shared Nothing. Mr. Stonebraker calls out three ways to solve hot spots:
- get rid of them
- divide a hot spot record into N subrecords
- use some implementation of a reservation system
Armed with the information we’ve covered so far, we can line up vSAN’s implementations of these quite well.
As a storage system for VMs, vSAN doesn’t have the luxury of simply not performing writes desired. However, it can make these writes less of a hot spot by ensuring the involved parts are few in number and high in capability. The same efficiency you experience from running ESXi as a bare metal, “tier 0” operating system to provide VMs with CPU and RAM also means it can provide that efficiency when it comes to storage. I don’t need to proxy read/write activity through an array, VM, or something else. The hypervisor will communicate directly with the disks on one side and the VM on the other. This lessens complexity and latency involved and avoids potential bottlenecks like the availability of a VM, the route to a storage fabric or array, and so on.
The storage media involved in that VM write is only a single device, which means I don’t have to muck about with disk RAID or other things like that. The write goes directly to a single cache disk on at least one node, but usually to 2 or more nodes, depending on the “Failures to Tolerate” policy involved. Further, the writes are done in parallel (see “A Day in the Life of a vSAN I/O” to understand this better). This parallel behavior means no waiting for a replication to somewhere after every write before I can acknowledge back to the VM. Even better, deduplication and compression isn’t performed until I destage down to capacity, so I don’t have to worry about that, either. This pushes a great deal of IOs through very quickly, so many hot spots are actually avoided. To quickly review:
The hypervisor communicates directly to the storage, no middleman array, fabric, or storage VM is needed.
The write is received upon a single cache device per node, which means no disk RAID complexities or chokepoints. (Ever seen what happens when the write back cache isn’t large enough on a RAID controller?)
The redundant writes are in parallel and not replicated, which means I’m not waiting for post-write activity before I can acknowledge back to the VM.
vSAN sub-divides components in two cases:
- A given object is > 255 GB in provisioned size, so separate components are created for each 255 GB “chunk”. These can be placed on the same or different disks (including other nodes).
- An object has a stripe width policy setting > 1, so the component is sub-divided into the requested number of stripes, which are always placed on different disks (also including other nodes).
We can see a nice visual of these when examining the data placement of a VMDK > 255 GB in size:
The VMDK pictured has been divided into smaller parts. This alone can help certain transactions performed against it function much faster, as there’s less data in each part. In general, computers, especially general-purpose servers like vSAN runs upon, prefer many small problems over fewer large problems. Great, we’re dividing a large problem up into small parts and presenting an opportunity for advantage here. However, these “chunks” aren’t really made for performance’s sake: they’re made so we can fit large VMDKs onto storage media we might use in vSAN. After all, I can make a 62 TB VMDK in vSAN 6.7, but I can’t buy a 62 TB capacity disk!1 Because the focus of these components is more practical than performance-focused, they can be placed on the same disk, so there’s still a chance that a very busy VMDK could not be engaging enough disks to work as quickly as it should.
This is where the vSAN Stripe Width policy rule comes to the rescue. John Nicholson wrote about vSAN stripes a while back; it’s a great quick read. The key point of stripes in vSAN is that they’re always on different disks from their counterparts. To see this in action, here’s a smaller VMDK, but with a stripe width of 3:
We can see that there’s three parts to each replica. Each of those three parts within the same replica will always be on different disks, so we can ensure plenty of disk engagement here. As John notes in the article I referenced above, this is a great option as needed, but it’s actually best to start at the default single stripe, examine metrics and data, and scale to more stripes as you need them. We often find that engaging more disks is not necessary for many objects in vSAN, due to the other intrinsic performance features within vSAN that I’ve noted here and in the previous article.
As you add more stripes, you need more disks to support those striped components. Check out Cormac’s great writeup on vSAN stripes and disk requirements for more info.
So we can see here that vSAN has two ways of dividing potential hot spots, for two different reasons:
Divide large objects for practical purposes
Offer optional division of any object to arbitrarily engage more disks
I’m sometimes asked why vSAN doesn’t have an “IOPS guarantee” policy or something like that. That might be worth a whole article in itself sometime, but the simplest answer is that vSAN focuses on delivering sufficient IOPS to every requester, instead of making you micro-manage that. Moreover, what would be done to “guarantee” these IOPS? Limit all other I/O unconditionally? That gets dangerous, like when an important VM is being resynchronized after a disk or node failure. Remove some threshold? Why not have it run unfettered in the first place? vSAN gives you a great deal of control over the performance of your VMs, but doesn’t have or need a performance reservation. However, there are two different reservation systems in vSAN that are very relevant here.
While I can’t (and personally, don’t want to) specify some “IOPS floor” for VMs, I can specify a limit. This is great for testing VMs that might be subject to scripts or processes that have unintended consequences. Do I really want that test SQL server ticking away at 10,000 IOPS just because someone accidentally ran an update statement that changed 2 million rows? Even if vSAN can handle that burst with no issue, there’s other side-effects this might create. IOPS limits ensure that VM (or even VMDK) can’t make more requests than I allow. These limits can be monitored in comparison to the actual ongoing IOPS in the VM by navigating to Your VM caret-right Monitor caret-right Performance caret-right vSAN caret-right Performance, then select the “Virtual Disks” tab and choose the VMDK you’ve limited from the “Virtual Disk:” drop down.
Adaptive resync dynamically allocates network bandwidth to different types of data queues in vSAN based on need, and available bandwidth. For example, if VM traffic is busy enough that it’s putting pressure on the network available to vSAN, then resync traffic will be throttled to no more than 20% of the total bandwidth available. As soon as that pressure is relieved, that choke is removed. This is a great consideration, because so far in this article we’ve been focusing more on storage media while ignoring just how crucial available network bandwidth is to ongoing IOs. This is a type of reservation system, but entirely focused on network bandwidth, as that’s especially out of vSAN’s control: it can’t dictate when or how VMs will ask for reads and writes, it just has to “roll with the punches”. If you’re interested in learning more about this feature, here’s a great quick gist on vSAN Adaptive Resync by Duncan, and the official Tech Note on Adaptive Resync in vSAN 6.7.
So, while vSAN has some reservation systems, these take a back seat to the other methods mentioned above to control I/O flow: eliminating hot spots where possible, and sub-dividing work across agents. Mr. Stonebraker agrees with this priority in his notes about improving database performance:
It is clear that this tactic [of using reservation systems] can be applied equally well to any of the proposed architectures; however, it is not clear that it ever dominates the “divide into subrecords” tactic. Consequently, hot spots should be solvable using conventional techniques.
Hopefully this has given some insight into the native performance characteristics of vSAN. There’s plenty more that could be said, of course. I chose to focus on key aspects that focus on what’s relevant to core I/O flow and shared nothing systems. Besides what I’ve linked in this article and the previous vSAN performance design article in this series, some other great material on vSAN performance is below.
- Storage Performance on the Virtually Speaking Podcast
- Pete Koehler chats with the hosts of Virtually Speaking about storage performance and vSAN’s advantages
- Designing for Performance on the Virtually Speaking Podcast
- Pete returns to discuss more performance considerations and great new performance features in vSAN 6.7
- Extreme Performance Series: vSAN Performance Troubleshooting
- Amithaba and Suraj provide an awesome deep-dive presentation on understanding, troubleshooting, and designing for performance in vSAN at VMworld 2017
- vSAN Performance Evaluation Checklist
- A quick rundown on how to properly assess vSAN performance hands-on
- vSAN Caching Algorithms Whitepaper
- The caching behavior of vSAN is reviewed in detail, shedding light on the performance gains found in its remarkable design
Next time in this series, let’s put this great performance to specific work by examining Storage Policy-Based Management in detail. Until then, thanks for reading!
At least, not yet. We’re up to [16 TB], so let’s see how far this goes… ↩