DFS Replication Monitoring
For the past two years I’ve been working, on and off, on a way to easily monitor my DFS replication tree. I just want to make sure backlogs aren’t getting stuck and so forth. The Microsoft tools are pretty unwieldy for this, by the way. This sent me on the journey into DFS and how it works and how I can better monitor it. Read on to see how it went!
Version 2.6 release: An update to this post, I’ve recently released verison 2.6 of DFS Monitor with History that now saves the data in XML format. Read more about it here.
My first attempt was the DFS Replication Monitor written in vbScript. You can find it here. It’s a good script and recently updated per some Spiceworks community feedback. But it’s just snapshot view of DFS as it was at the time the script ran. Also it shells out to DFSRDIAG.exe so requires some setup before it will run properly.
Time for a better solution, and I think this new script DFS Monitor With History is the answer. It was an interesting experience getting this script written! A lot of work, a lot of testing, a lot more work and a lot more testing! I want to thank my friend ChristopherO for helping me out on the testing–it was his environment that kept breaking my script that was working perfectly!!
I’m calling this script version 2.0, but in reality there were a couple of fully different scripts in between DFS Monitoring Widget and this one, but since they will never see the light of day this one get’s the clean slate that is 2.0! This new script had to accomplish several goals:
- Had to be easier to setup then DFS Monitoring Widget
- Eliminate using DFSRDIAG.exe. I knew this was possible when I ran across this script: Get-DFSRBacklog.ps1. To be honest, my first re-write borrowed heavily from this and I still use the general structure but with some pretty heavy re-writing.
- Increase the reliability–I’ll get into this later.
Goal #1: Make it easy
As I’ve been using PowerShell more and more it soon became apparent that the original way I was saving data to SQL was perfectly fine, and even has some performance benefits believe it or not, but made setup pretty difficult. I felt this was not only a barrier to using it but kind of defeats the purpose of a script–which is supposed to be a quick, easy way to get something done. So saving the data had to be easy, if not transparent to the user. Then I discovered the Import-CSV command. With one simple command I could load an entire set of data into an array and manipulate it any way I wanted. Export-CSV then would save the data back into a Comma Seperated Value’s file. Goal #1 accomplished–of course, it’s never as easy as all that, but I’ll spare you the details!
On to Goal #2: Begone foul DFSRDIAG.exe
Reading and playing with Get-DFSRBacklog I soon figured out how to get my script to read the backlog count and even get the backlog files directly out of WMI. Wrote the script out, spent several hours working on a bug that was a PEBKAC and it all looked good. Getting good data over and over again, I’m pretty happy. So then I handed it over to ChrisO and bang! Several things happened, first it was slow slow slow. Chris has over 25 remote sites and some of them are over very slow links (DSL, baby) and the script was running very slowly. What’s worse is that several of the sites were failing to communicate at all causing the script to return nothing at all!
How to handle this? I went back to the drawing board and really studied what I was trying to do, and started seeing some patterns. Instead of making multiple WMI calls, why not one big one and gather all the data? This helped a lot, and my sites ran much quicker, but poor ChrisO was still slow and getting WMI failures. Then ChrisO asked about multi-threading? He even found a great link on how to do it! This was very interesting, but would require a massive re-write. A couple of days later I had a working version of the script that was using multi-threading and what a difference! My own sites went from taking about 15 minutes to produce the results to less then 5 minutes. I couldn’t wait to see how ChrisO’s sites did.
Massive improvement for Chris too! Success! Time to get a cold drink, put my feet up and bask in my brilliance. Or maybe not. The script now ran all of his sites in less then 10 minutes but now when it built the Google visualization it was taking hours! What?! I had re-written this part too to be more reliable but it had been working fine on my sites for weeks. WTF?
So Chris had set his data to save for 14 days, running every hour for about 45 different replication/folder groups and since each of those replication groups/folders has an incoming and outgoing that’s 2 more. That’s a little over 30,000 records. To build the Google visualization I have to query those records for every distinct day (so that’s 24 x 14) and every distinct replication/folder group (another 45), so that’s about 15,000 queries. When I used the PowerShell Measure-Command cmdlet I was finding the query was taking about 1.5 seconds to run, which meant the whole thing was going to take about 6 hours to build the Google visualization! For a script you want to run hourly, that’s not such a good number.
The fix here turned out to be simple and effective. Since I was making a query against date and replication group/folder I simply did a query before that on the date only. Then when I was looking for the folder I only had to query against 45 records. Suddenly that query was down to something like .0025 seconds. Whole build time for the Google visualization dropped to under 15 minutes.
Goal #3: reliability.
One thing I kept running into, but not consistently, was reliability. I use PING to test if a server is even there before I start trying to make WMI calls against it. Why not use the cmdlet Test-Connectivity, you ask? I would prefer that too, but to be honest PING gives me more information! So I PING. You probably should too. But occassionally PING would fail, as networks aren’t 100% reliable–sorry if I burst a bubble there but it’s true. I decided to not let this bother me too much since this script was meant to run hourly and will pick up any missed data on the next run–and honestly this data isn’t so important that a missed run was that big of a deal. But I had to modify the script to deal with missed PINGs.
Another fact of life is that WMI calls don’t always work and return a big fat nothing when that happens. I ended up writing a custom WMI function that if a call fails try 3 times and almost always the next call (or the next) will work. If it still isn’t working after that report on it and move on. I ended up re-writing the multi-threading to use this new function too so that was a lot of re-writing and testing!
Next thing that isn’t as reliable as it could be? Multi-threading. PowerShell uses something called jobs for multi-threading. The idea is you use the Start-Job cmdlet to submit a block of script into a background process. You can use Get-Job to monitor it’s progress and you can use Receive-Job to retrieve any object it might be returning. As you may know PowerShell is all about objects and just about every cmdlet you run will return some form of object–and if it doesn’t it should! In general any script you write should also return an object even if it’s just a string say “Completed!” Of course, this script doesn’t but hey, I did say “in general.” Back on subject you can then use Remove-Job to clean-up after yourself and remove the job from existence. Ryan, who I linked to earlier, had a great way of seeing if all of the jobs you have submitted have finished using the Count property and most of the time this works but there are plenty of times that it just doesn’t. This is very intermittent and getting it to happen in test was just too hard. What does that mean? Basically I wrote into the code to recognize that it’s having a problem retrieving data from the background job and try again. If you still can’t get it after three tries just give up and move on. This goes back to the idea that since the data isn’t that important on any given run we can live with a zero count every now and then. Trying to solve this was more work then it was worth!
The script has now been working for about 2 weeks at both my sites and ChrisO’s so I feel pretty safe releasing it to the wild. I figure if it can survive Chris’ environment it can handle most anything–and now I’ve jinxed myself.
If you need a link to download DFS Monitor With History, just click here.