Finding a Better Way to Estimate IOPS for VDI

Posted on 2010/10/31

Planning storage is probably the most difficult part of architecting any VDI solution and estimating IOPS is probably the most difficult part of planning the storage. For a little background on some of the discussions I have had around this topic, see my Saving IOPS with Provisioning Services blog and the associated comments. Since posting that blog in August, I have had fielded multiple requests for “numbers” that can be used for sizing. I am very leery about making decisions based on estimates, but I understand it is hard to justify a business case without some type of numbers.

Let me reiterate what I have stated in my various webinars around architecting a Citrix XenDesktop solution, the only way to prevent yourself from under-sizing the storage tier is to run a pilot and analyze the results. However, for those of you that are looking for some ball park numbers I will provide some more detailed guidance in this blog – which of course makes it almost a whitepaper. 🙂

The goal is to never exceed the capacity of the storage layer under any normal load. Storage solutions are generally architected to provide burst capacity, but you don’t want to be running on that capacity all the time. Events like a virtual machine boot can be scheduled for off peak hours. Events like logons can be predicted because they occur around standard times. However, some events like compiling code (which is I/O intensive) will happen at random times. Statistically speaking we should be able to plan on the average IOPS across all users and multiply it by the number of users in an environment to get the storage requirement. However, because of the nature of user activity that model sometimes fails.

Existing Model for Estimating IOPS

I architect solutions using the premise that the two most taxing events on the storage tier are the boot storm and the login storm. Generally speaking I tend to focus on meeting the needs of the login storm since it is usually 30-50% less stressful than the boot storm and because the boot storm can be controlled by starting machines during off-peak hours.

In a perfect world you could monitor the virtual machines for a while and take the highest value recorded for the IOPS rate and use that value as the expected IOPS model for all your users. This model is generally the safest approach but probably not the most cost-effective approach since sizing for that type of performance is generally extremely expensive and will drive the ROI of VDI down significantly. Enter planning based on averages instead of maximums.

The model that I have seen most often is to plan the storage based on the average IOPS for the user workload and possibly add 10-20% extra capacity for a buffer. When I use the average IOPS method and calculate the maximum simultaneous logins with the average IOPS method the formulas look something like this:

Login IOPS = MaxSimultaneousUsers * Average Login IOPS (Incremental IOPS for new user logons)
Workload IOPS = MaxSimultanousUsers * Average Workload IOPS (IOPS when all users are online)
Peak IOPS = Workload IOPS + Login IOPS (Theoretical maximum when all users are online and the last set login)

So, before I went on site I would run through the calculations and provide the client with the storage sizing information necessary to be successful. As an example, here is what it would look like for a 3500-user workload assuming an Average Login IOPS value of 15 and an Average Workload IOPS of 4:

* Login rate of 120 users per minute to get all users online within 30 minutes.
* Average user login takes about 30-seconds.
* Maximum simultaneous logins would be 120 assuming the first users are finishing up as the next start.
* Login IOPS = 120 * 15 = 1800.
* Workload IOPS = 4 * 3500 = 14000
* Peak IOPS = 14000 + 1800 = 15800.
* SAN Capacity = 15800 + 20% buffer = 19000.

Then I would show up on site with a SAN that supports 20000 IOPS and before I could get half the users on the system I was ordering more storage. At some point I decided I needed a better model since I did not enjoy looking stupid as the expert. I am sure that has never happened to you.

My Model for Estimating IOPS

The model I now use came out of my work at the HP Solution Center where I had a P4500 (Lefthand) SAN. I have a spreadsheet that does these calculations, but at this point because it so specific to the environment tested, I think it is better to just provide the formulas so you can adapt it to your environment.

The model includes several variables (as shown below in the screenshot) as input which I will explain. The discussion and examples are based on the same 3500-user example I provided above, which by the way had an actual peak IOPS of 28822.

IOPS Estimation Model

Number of desktops: The total number of desktops that will be using the storage array or cluster. If you have multiple clusters you should calculate this value separately for each cluster. For the discussion scenario we are using 3500 desktops.

Launch Rate: How quickly the users will be logging into the system. In this example the users logged in at the rate of one user every half second. For my scenario all 3500 users were to be logged on within 30-minutes, which required a login rate of 120 users per minute.

Login IOPS: The amount of I/O per second (IOPS) generated while a user is logging onto the system. The value for my Windows 7 environment that made the model accurate was 100. Remember in this model the number represents only the writes. Strangely enough if I measure the perfmon counter Disk Transfers/sec it does not show a value this high. My only conclusion then is that additional overhead is added by the SAN hardware (such as would be there for RAID5 vs RAID10 configurations) and/or the hypervisor.

Workload IOPS: The amount of IOPS expected per user during the workload execution. During the test I was using the LoginVSI Medium workload so I selected 5.

Average Desktop Login Time: The amount of time (in minutes) it takes a user to login to the desktop. The time is measured from the CtrlAltDel Login screen to desktop shell initialization complete. In my environment I had a tool that measured this time accurately so I was able to determine the average time was 30 seconds across all users during the test.

Before explaining the output, I should explain the primary assumptions behind the model which are admittedly not a perfect representation of a true environment.

1. Desktops are in either a login state or a workload state.
2. Desktops start in the login state and move to the workload state based on the Login Time parameter.
3. Desktops enter the login state at the rate defined by the Launch Rate parameter.
4. Desktops in the login state and desktops in a workload state have different IOPS requirements.
5. Read IOPS are ignored because they amounted to less than 2% of the total IOPS.

The new model is an improvement on the previous model because the new model takes into account the higher IOPS requirements that occur during the login state which are limited to the time during which the state occurs. I believe these state-related differences in IOPS can cause the unanticipated spike in IOPS that eventually leads to the procurement of more storage.

Peak IOPS: Represents the peak IOPS expected based on the parameters provided. The following formula is used to calculate this value:

Peak IOPS = MAX((DesktopsInLoginState * Login IOPS) + (DesktopsInWorkloadState * Workload IOPS)

Steady-State IOPS: Represents the amount of IOPS necessary to execute the normal workload. The following formula is used to calculate this value:

Steady-State IOPS = Number of Desktops * Workload IOPS

Estimated Boot IOPS: Represents the estimated IOPS required to boot all the machines. This value was the most difficult of the three to determine and is the least accurate which is why it comes with the +/- 10% disclaimer.

Two factors contributed to the difficulty in defining this accurately. First, the Desktop Delivery Controller is configured to startup ~10% of the idle pool at once but because of hypervisor response times a few more or less could be starting at once. Second, several factors could affect the boot time for all the workstations and I had no way to accurately determine the time the machine was in the boot state within a large farm.

After analyzing multiple test runs the best factor of the peak IOPS turned out to be the number of desktops. In the end, I chose to use the following formula for the estimation, but the confidence level in this formula is lower because it is based on a correlation that may not exist the future or with other hypervisors:

Estimated Boot IOPS = Number of Desktops * 22

The new model was successful (meeting or exceeding the actual IOPS) when using it to retroactively estimate the peak IOPS encountered on previous tests where the number of desktops exceeded 1000. I believe this model does a better job of estimating the actual peak IOPS than using the average IOPS method. Of course hindsight is 20/20 and the results will only be as accurate as the workload IOPS value. If you are unable to complete a pilot to obtain actual results, the next best thing is to guess.

The model discussed here was created from a configuration which used standard-mode Provisioning Services (PVS) vDisks with the target device hard drive as the write-cache drive residing on the SAN and accessed via the hypervisor storage over iSCSI. The only files residing on the SAN were the write-cache drives, so 100% of the IOPS traffic was related to the write operations during the tests. Since PVS was used the C: drive read IOPS came from primarily from the PVS server. This configuration kept the read IOPS on the SAN to less than 2% of the total IOPS with that traffic being attributed primarily to profile-related reads after the roaming profile was written to the write-cache drive.

Calculating Workload IOPS

I have seen numbers ranging between 4 and 100 used for estimating workload IOPS. I believe that in most situations the idle IOPS activity from processes such as memory swapping, temp files, log files, and auto-save events is about 4 IOPS. I therefore recommend treating 4 as a no load value and work up from there. Remember antivirus and monitoring software will likely increase that no load value.

The Login Consultants Virtual Session Indexer on a medium workload generates about 5 IOPS per user, which is only slightly more than idle. This behavior occurs because the workload is designed to simulate user workflows that consume RAM and processor resources. In fact the webpages that are browsed are read from disk and are not written or cached so from a IOPS perspective the test generates very little load.

When someone asks me for ballpark guesstimates to plan storage these are the ones I feel comfortable providing based on my reading and experience at this time.

Light user: ~6 IOPS per concurrent user. This user is working in a single application and is not browsing the web.

Normal user: ~10 IOPS per concurrent user. This user is probably working in a few applications with minimal web browsing.

Power user: ~25 IOPS per concurrent user. This user usually runs multiple applications concurrently and spends considerable time browsing the web.

Heavy user: ~ 50 IOPS per concurrent user. This user is busy doing tasks that have high I/O requirements like compiling code or working with images or video.

Before you get too worked up about these “outrageous” values, consider that you probably have a mix of these users in the environment. For instance, you might have 20% Light, 50% Normal, 20% Power and 10% Heavy users. If you take those proportions you can calculate a “loaded” rate for the environment like this:

Loading IOPS = Light (.20*6) + Normal (.5*10) + Power (.2*25) + Heavy (.1*50) = 16.2

Replacing the Workload IOPS rate in the original 3500-user scenario with that new loaded rate results in the following model output:

Model using Loading IOPS calculations

If those numbers seem high you might want to plan on performing a limited pilot and analyze data from your actual environment. Remember these are still ballpark estimates and they are intentionally inflated slightly to account for unknown factors.

Analyzing the Environment

If you have the ability to perform a pilot, you can gather data and analyze the user’s actual IOPS. The key in this area is to focus on the peak average IOPS. To calculate the peak average IOPS you would look at the average IOPS for each user during the pilot and select the highest value for the calculations. I would recommend using daily averages for the IOPS (as opposed to weekly) so you get the average IOPS when the user is actually working. To make this easier to understand, here it is in a formula:

Peak Average IOPS = MAX (AvgIOPSuser1, AvgIOPSuser2, … AvgIOPSuserN)

By using the peak average approach you can be fairly certain that the performance of the SAN will be able to provide the needs for all of the users because you have planned to based on the busiest user. I say “fairly certain” because the pilot users may not be 100% faithful at using their XenDesktop or are not a representative sample, thus the estimates that come out of the pilot are slightly skewed. Keep in mind that you may want to monitor the average IOPS for each user on some type of consistent basis so if the peak average starts to rise you can adjust before its effects are recognized.

Once you have your storage tier in place and are aware of its capabilities, you should take time to understand its limitations. One area to review is its capacity for weathering boot and login storms (in terms of concurrency) which might occur. The calculation for this information is fairly straight forward.

Boot Storm Size = Total IOPS available / 300
Login Storm Size = Total IOPS available / 100

For instance if your SAN is capable of 20,000 IOPS, then it should be able to weather a boot storm of around 65 virtual machines and a login storm of around 200 simultaneous logins. You can use this information to tune your VDI environment so the storage is not inadvertently overwhelmed.

Wrapping it up

Let me wrap up with a final word on other ways to get the most out of your storage. Many vendors have solutions that significantly increase the performance of the storage tier, sometimes at fraction of the cost of purchasing that same performance from the SAN vendor. Here are some of the technologies providing this additional performance.

1. Faster front-end drives such as SSDs that can manage under a high load and later migrate the data to slower storage.
2. Virtual appliances with large amounts of RAM that appear to the hypervisors as attached SAN storage but are actually streamlining communication with the backend SAN storage.
3. Software that converts random data access (historically slow) to sequential access (faster and more efficient) through the use of journaling techniques.
4. Virtualization aware file systems optimized for SAN storage.

Some of these technologies are seeing up to a 90% reduction in write IOPS and can reduce the storage space requirements up to 10-fold. If you want to implement any of these new storage performance solutions you should consider a hybrid approach for estimating IOPS. One such approach is to purchase storage based on average user IOPS and rely on the burst capacity of the solution to manage the peak periods such as login or boot storms.

Feel free to comment below if you have information pertaining to this topic. If you found this information useful and would like to be notified of future blog posts, please follow me on Twitter @pwilson98.