DFS for Webfarm Usage - Content Replication and Failover
by Scott Forsyth
June 9, 2006
Windows Distributed File System (DFS) has been around
for a long time and it has always had a lot to offer. With the latest update in
Windows Server 2003 R2, DFS has become quite an impressive product.
At ORCS Web, we've recently started to use DFS for some of our high availability
offerings that use a central NAS (Network Attached Storage) content server. We're
using DFS for handling the content server, both for replication and for automatic
failover to a backup server in the event of maintenance or a server failure.
There were a number of things that I learned while researching, testing and rolling
out DFS for webfarm content hosting that I'll share here. This isn't a step by step
walkthrough, but rather some pointers that you will hopefully find useful.
DFS has many usages ranging from keeping content in sync between different physical
sites, to giving a single easy-to-remember path that can serve up content from a
variety of folders across a local or wide area network. (thus the 'distributed'
in DFS).
DFS in its simplest form is a way to have a single friendly UNC path on your network
which can have folders distributed across multiple servers. This friendly UNC path
will be permanent while the real folders that it accesses behind the scenes can
be most anywhere. Subfolders can point to completely different locations on disk
or to different servers on your network. This flexibility is great for our webfarm
situation and allows a primary and at least one backup server to handle the content
with a clean failover solution in the event that the primary server fails.
Installation
The installation is fairly straight forward once you understand the concepts. Partial
DFS functionality is already installed on Windows Server 2003. The replication side
of things needs to be installed separately. As long as you’ve upgraded to Windows
Server 2003 R2 you can install this from Add/Remove programs and the Distributed
File System category. I recommend installing all 3 optional features as the extra
management tools are better for managing your redundant DFS system. This needs to
be installed on the servers hosting the namespaces and the folder targets if you
will use replication.
The extra replication features of R2 do require Active Directory changes. If you
have already upgraded your domain controllers to R2, then no additional action is
required. If you haven't upgraded your domain controller to R2, no worries, you
aren't required to do so, but you do need to extend the schema.
Here is a link on how to do that.
Like anything of this nature, make sure to have a good disaster recovery plan in
place and do this at a non-peak time. But the schema installation is straight forward
and doesn't cause any interruption of service in Active Directory.
Once installed, there are three hotfixes that should be installed, located
here. One is required for the client
failback feature to fail back to the primary content server when it's back online
after a failure, another allows you to have multiple domain-based DFS namespaces
on Windows Server 2003 Standard Edition if you desire, and the 3rd supposedly fixes
a potential RPC issue with replication, although I didn't run into this issue. KB
Article 898900 needs to be installed on all of the servers accessing DFS (the web
nodes). The other two need to be installed on the DFS content servers.
Configuration
You have two graphical tools to use at this point, both support most features. My
preference is the DFS Management tool which is available after the Add/Remove programs
step above. You'll find this in Administrative Tools.
There are 3 terms/levels to take note of: Namespace, Folder and Target Folder. These
terminologies changed with R2 so don’t get confused with terms you used in the past.
Top Level - Namespace
A namespace is a container to hold the
folder and replication settings. The path to the namespace might be something like
\\Domain\Webfarm. You can have multiple namespaces per server.
Second Level - Folder
A folder is a virtual DFS folder which can
have one or more target folders. The name of the folder is what is used in the UNC
path. For example \\Domain\Webfarm\Site1, where Site1 is the Folder.
Third Level - Folder Target
A folder target is the real location
of the content. This path is masked though and not seen in the DFS UNC path.
You can have multiple target folders which point to different physical locations.
There are various options to determine which target folder is used, but in our case
we want to always point to a primary content server and only fail over to the backup
content server when the primary server is unavailable.
Active Directory comes into play too with domain-based namespaces but management
is still done from DFS Management.
Redundancy
Here's where it gets fun. To have everything fully redundant in the event that a
server fails, every part of this needs to be mirrored. I'll discuss the various
levels of redundancy here.
Namespace
The namespace server holds the metadata for the namespace. Be sure that this doesn't
depend on a single server. The data stored here is often pretty small unless you
have hundreds or thousands of folders in the namespace, so a dedicated server isn't
necessarily required for this role as long as the namespace server can always respond
quickly to any queries. The namespace servers can be the same servers as your content
if you want.
To create a mirrored copy of the namespace, in the DFS Management tool, right-click
on the Namespace and click on "Add Namespace Computer". Here you can point to an
existing share on a different server or create a new share.
Folder Target
DFS masks which server is used for the folder target. To fully use DFS in this situation,
you will need to point to multiple folder targets. In my situation, I want to have
one server always used as long as it's available. I don't want to hit a random server
because there could be data integrity issues. DFS replication is good, but it doesn't
handle data locking or data write-through. This means that there could be a delay
from when something is written on disk until it has replicated to all other servers.
For that reason, I only want to fail over when absolutely necessary.
- To achieve this there are a few things that are necessary.
- The failback hotfix mentioned above needs to be installed.
- All webfarm nodes need to be running Windows Server 2003 SP1 or later
- The caching duration for the folders need to be changed. The default is 1800 seconds
(30 minutes) which is too long for our situation. That means that less requests
are made to the namespace folder, but it also means that the failback could take
up to 30 minutes after the primary server is back online. You can update this by
right-clicking on the folder in "DFS Management", going to properties and then the
Referrals tab. Make sure to do this on each new folder. You can also change the
cache duration on the namespace, but the default is already 300 seconds (5 minutes).
- In the Referrals tab of the namespace properties, check the "Clients fail back to
preferred targets" checkbox.
- In the Referrals tab of the folder properties, check the "Clients fail back to preferred
targets" checkbox.
- On the properties of the primary folder target, in the Advanced tab, enable "Override
referral ordering" and select "First among all targets"
- On the properties of the backup folder targets, in the Advanced tab, enable "Override
referral ordering" and select "Last among all targets"
Now you have a primary/backup server configuration that will always use the primary
server as long as it is available.
Active Directory
The Active Directory part of things is done automatically and apart from the steps
mentioned already, doesn't need any extra configuration. Just be sure to have redundant
domain controllers in your Active Directory environment.
Links and Paths
There is a growing list of links and paths that can be used to testing purposes.
Let me summarize them here assuming that the folder is called Site1 and the Folder
Targets are also given the same name.
Using the DFS path directly: (DFS level)
\\domain\webfarm\Site1
Accessing directly using the first namespace server: (namespace level)
\\namespaceserver1\webfarm\Site1
Accessing directly using the second namespace server: (namespace level)
\\namespaceserver2\webfarm\Site1
Accessing content directly on primary server without using DFS: (folder target level)
\\contentserver1\Site1
Accessing content directly on second server without using DFS: (folder target level)
\\contentserver2\Site1
Notice that it’s the DFS path (\\domain\webfarm\Site1) which will be used
on the web servers and for most usages. It will always be the same, regardless of
the namespace or target folder changes over time. The other paths are for testing
and troubleshooting and could change over time.
Content Replication
With R2, DFS replication uses what is called Remote Differential Compression (RDC)
which will only update changes to files and won't send the entire file across the
wire. This is especially handy when replicating across a wide area network, but
it's also good for this situation.
If you set up two or more folder targets using DFS Management, the wizard should
have asked you if you want to set up replication, but if you did things in a different
order, you can set it up manually after the fact. This can be done using the DFS
Management tool as well.
Changes to the servers aren't immediate so DFS doesn't work well for transactional
type data where both servers need to be 100% in sync within a couple seconds of
each other. But for a website related situation that is mostly read intensive, DFS
works great.
You have a few options but in our situation we'll use the Full mesh which means
that any server will write to any other server. This means that in a failure situation,
the content changes made on the backup server will push back to the primary server
when it is online again.
How Good Is It?
DFS failovers are pretty impressive. If the primary content server becomes unavailable,
DFS will fail over to the backup content server in a small number of seconds. In
this webfarm situation, almost every time that the primary server fails, the HTTP
protocol will retry for a few seconds until IIS is able to serve up a successful
page.
This means that there is zero downtime if the primary content server fails. The
only issue I ran into in testing is if the page load was 1/2 done when the primary
server failed using master pages or web controls. It could potentially process 1/2
of an ASP.NET page and fail processing the rest. But this is pretty rare and I would
say that the failover is as close to perfect as can be.
A failure of the namespace server is even smoother, resulting in no noticeable downtime
or slowness.
File Change Notification in ASP.NET
There is one thing to keep in mind during a failover and failback situation. ASP.NET
and IIS uses what is called File Change Notification (FCN) to let IIS know of any
changes to files. For example, if you add a new .dll to your /bin folder, ASP.NET
will recycle the AppDomain and reload and recompile some of the site. During a failure,
although the switchover is smooth, it does take a few seconds, which is abrupt enough
for IIS and ASP.NET to reestablish the File Change Notification handle using the
different content server.
The issue comes with the failback. The failback is so smooth that the File Change
Notification isn't updated back to the restored server. This means that if you make
any changes to ASP.NET files on the restored content server, the changes aren't
noticed by IIS and ASP.NET. Even deleting the entire /bin folder won't be recognized
by ASP.NET if the site was visited and cached while running on the backup server.
Static pages don't have this issue, but the caching in ASP.NET makes this a problem.
At the time of this writing, I'm working with Microsoft Product Support Services
(PSS) to try to find a good solution for this. To resolve it, simply recycle the
app pool of the site(s) and it will start to function normally again. So, this isn't
necessarily a show-stopper but it is something to keep in mind with the failover/failback.
Caching and DFS
DFS client computers (webfarm nodes in this case) cache the DFS information for
the length of time that you specify, as I mentioned already. This shouldn't be too
low or you will have too much traffic to the Namespace server, but it shouldn't
be too high or changes to the namespace and failbacks to a restored server will
take a long time to be noticed. It is up to your environment what you want to set
this at, but in every situation, it's important to know that there is some caching
that takes place.
Make sure to keep in mind that adding a new folder to your DFS namespace won't be
noticed immediately. You can force the DFS client cache to be cleared by running
dfsutil /PktFlush from the client server. dfsutil.exe is a tool that is available
in the Windows Server 2003 /support/tools folder of the installation CD. I simply
copy that file to C:\Windows\System32 and I can run dfsutil from the command prompt.
When setting up new sites, make sure to wait until the new site has been recognized
by all of the webfarm nodes, or force a cache flush from all of the nodes before
attempting to set up or update the site.
Backups of the Namespace
Make sure to make regular backups of your Namespace. This can be done easily using
DFSUtil. Simply export to an .xml file on a regular basis and have your backup process
back up that file. An example of the syntax needed is:
dfsutil /root:\\OW\webfarm /export:c:\NameSpaceBackups\webfarmroot.xml
I did run into something when importing the namespace. I received the following
error:
System error 1168 has occurred.
Element not found.
After some research and stumbling through it, I found out that I was using the domain
name 'orcsweb.com' instead of NetBIOS name 'OW' in the UNC path, which the import
didn't like. OW is used by DFS in this case. The export worked with either name,
but the import only worked with \\OW\ which is what was in the exported XML file.
Links and Resources
Here are a number of resources that I've found helpful:
Microsoft DFS Landing page
DFS hotfixes, post R2
DFSUtil Examples
Whitepaper on Designing Distributed File Systems
There is a lot to consider with DFS and I've only scratched the surface,
but I hope that this has been helpful to cover a few common configuration settings
that are required for configuring DFS on Windows Server 2003 R2 in a webfarm situation.
Scott Forsyth is Director of IT at
ORCS Web, Inc. - a company that provides managed complex hosting for
clients who develop and deploy their applications on Microsoft Windows platforms.