Wednesday, September 25, 2013

Horizon Data - Storage


I use Horizon Data everyday, and it really is amazing how easy it is to work with. However, maybe it’s my storage background, but I can’t help but wonder what happens to my file after I’ve uploaded it into my Horizon Data folder.

Where does it go?

Horizon Data is integrated into Horizon Workspace – which means that it’s part of the Workspace vApp.
 






The Horizon Data VM, which obviously handles the Files / Data side of things in workspace, includes several VMDKs (which are, fortunately, thin provisioned).


If you’re running a small-scale deployment (e.g. an evaluation or demo environment), you can stick with the default configuration of having Horizon Data store things inside the VMDKs.

In production environments, however, it is recommended to use NFS storage for Horizon Data (how to add an NFS mount is documented here).
It’s worth noting that this NFS volume will be mounted into the VM, not to the ESXi Host(s).
This means that the NFS traffic will be using the VM Network that the Horizon Data VM is connected to, rather than a VMKernel connection to the ESXi Hosts – so it’s worth considering whether this network will have connectivity to your NFS storage array as well as how much available bandwidth / performance there will be for NFS traffic on that network.

In any case, they key thing to bear in mind is that Horizon Data uses a mysql database to index the files that users upload. Mysql then stores these files as Blobs.



What is a Blob?
Image courtesy of wikipedia

A Blob is a Binary Large Object, which is basically a method for storing pretty much any kind of file within a table in a database.

To me, this makes a lot of sense – rather than having unstructured data scattered around a filesystem, the database keeps everything neat & tidy.

I like to think of it like this…back in 2002, I converted most of my CD collection into MP3s, and went to a lot of trouble to keep my MP3 files organized – each artist had a folder, and inside there was a folder for each album.
I started off with about 5GB of Music, which was probably like 30 folders of stuff. At the time, I used to browse through the folders, find things I wanted to listen to, then drag them into a music player (I was a big fan of WinAmp!).

11 years later, and I now have nearly 40GB of Music. I’ve been through 6 laptops in that time, and 3 USB hard drives.
The last time I looked at my Music folder, it was a mess. There was a bunch of duplicate files & folders, things in the wrong place, things missing (some laptop migrations were a result of hard drive failures!).
So, browsing through the folders and dragging things into a music player just doesn’t work.


The last time I migrated my laptop, rather than copy the files over myself, I used iTunes to import everything, with these 2 options set:


 




Now, I see all my music in one place, and click on whatever I want to play. I don’t care where iTunes stores each file. If & when I need to migrate off this laptop, I’ll just Export my iTunes Library.







So, getting back to Horizon Data.
Here’s a file I made earlier, and uploaded into my Horizon Data folder.

If I log into the Horizon Data VM (as root), I can navigate through to the directory where the blobs are stored (/opt/zimbra/store).

Inside that directory, mysql has structured things very carefully. Here’s what the blob looks like:


The path, as well as the filename are both important. AFAIK they relate to tablespaces etc within the database, I’m sure someone who understands mysql better than I do can tell you all about it. 

If I look at that .msg file, you can see it’s actually the .txt file I uploaded into my Horizon folder.

So, Horizon Data has changed the filename & extension, and manages it’s own directory structure to store things, but the actual content of my file hasn’t been modified.
From a storage perspective, this means that any block based deduplication should work very well for files stored by Horizon Data.
File based single instancing (leaving behind stub files etc) wouldn’t be a good idea, but anything block based which is invisible to the filesystem and preserves the file & directory structure should work very well at freeing up disk space.
I’d be very interested to see someone do some testing with this & see what kind of dedupe ratios they achieve.