Distributed Computing and Powershell
Great little script request came across on Spiceworks last week, something I’ve been looking forward to for a long time but never really thought I’d get a chance to do. Time to unlock the power of Remoting in Powershell and dive into true Distributed Computing–not multi-threading, but Distributed Computing!
Scoping out a Script
It was a simple enough request, user has a rendering program that takes advantage of distributed computing to chunk out pieces of a picture and have PC’s that aren’t that busy do the work. Problem is the program isn’t too good and keeping track of things and often leaves orphaned renders on the distributed PC’s. We just need to hit those PC’s, locate the render file and copy it back up to the server where a manual “re-stitch” can be done.
At first glance we would just have a list of the computer names and use Get-ChildItem to scan the directory needed and copy it to the file server and just loop through all the names. Which would work just fine, but it would be pretty slow especially since these files can get as big as 500mb in size. How do we speed it up? Time to think about multi-threading, and this would work just fine too but the problem is multi-threading big file copies isn’t really going to save all that much time because all of the data has to flow through the PC running the script so it would be faster but still not as efficient as I’d like.
The next thought then is why have the PC running the script do all the work? Why not have each individual PC scan its own files and then copy up to the server? This seems much better! But not without flaws either. So while we’d have some great distributed computing going on all of these PC’s (as many as thirty!) would be copying their files all at the same time. Now we’ve placed the burden on the render server and unless it’s a server truly designed for massive file serving we’re going to run into the same problem that the multi-threaded script would have.
That means we’ll have to throttle how many distributed computers can be working on this task at a time. This will take some thinking.
Invoke-Command
I know I’m going to have to use Invoke-Command to do this, since we’re going to be using Remoting to accomplish this task and it turns out that Invoke-Command has this cool little parameter called -AsJob. With this parameter Powershell will automatically make the command into a Powershell job and submit it into the background. This is the point where this whole script fell into place for me. I’ve done lots of multi-threading with jobs, including throttling how many background jobs can run at a time. By controlling how many jobs get run we can control how many remote computers are copying files at the same time.
But if we’re going to use Remoting, we’re going to need to make sure our network is set up for this. First we have to make sure all of the computers in question have Powershell installed. If this is Vista and higher that’s no problem because Powershell comes pre-installed but XP will need to have it installed. Luckily we can accomplish this pretty easy with WSUS, Group Policies or logon scripts.
Next we need to make sure Remoting is enabled for all of the machines, and luckily I had already written a “How-To” on Spiceworks on how to do this (link here).
Last we need to use the multi-threading code I’ve already talked about here. As it turned out, I ended up changing ONE LINE of code to turn the script from a multi-threading script to a Distributing Computing one! One line.
The only change I had to make was changing Start-Job–see the “Submit the Job” section on the Multi-Threading Revisited post–to Invoke-Command like so:
Invoke-Command -ComputerName $Computer.Name -ScriptBlock $Scriptblock -ArgumentList $FileName,$SearchPath,$CopyPath -AsJob
Use the -ComputerName parameter to submit the job to that computer, while the -ScriptBlock and -ArgumentList parameters remain the same. Last you add the -AsJob parameter to make this command a background job and all of the other code we’ve used before to monitor and control background jobs will work exactly like it before. And you’ve done it. You’re now using Distributed Computing with Powershell and it simply couldn’t have been easier
When Can You Use It?
For me this was the biggest problem. I’ve been wanting to write a script like this ever since the potential of Remoting hit me several months ago. But to be honest, I’ve just never had a workload that lent itself to this kind of work. Sure I could farm out Active Directory updates to several computers in the IT department but it would be more for show and tell than actual needed workload relief. I was so glad when this script request came along that I could finally give it a go. And to discover that as I wrote it I already had all of the control mechanism’s written, and that they didn’t even require any adaptation was amazing!
If you come up with a great way to use Distributed Computing at your workplace, let me know I’d love to hear about it!
Pretty cool, Martin.
In the case of the file upload problem, you might want to look into BITS (the BITS module comes starting with PS 2.0) (http://technet.microsoft.com/en-us/library/dd819415.aspx, and http://technet.microsoft.com/en-us/library/dd819420.aspx). Starting the upload on the remote computer using BITS will push it into the background on that remote computer. This will provide some additional (automatic) load management for the destination server.
Thanks Art! Sounds like the beginnings of another post!! Frankly could use some ideas 🙂
I never got around to fully playing around with it, mostly because we didn’t have any large, computationally expensive tasks and I
only realized later that I would never get the new managers to approve any the purchase orders of software that would lend to generate the data I’d want to crunch and somehow I think they may fire me if I suddenly chose to opt to make the the majority of our computers ready for my experimentation regardless of whatever lame CPOE crap they’re running, ha.
But story aside, if you know about Apache Hadoop and kind of by extension then MapReduce then it’s a little easier:
1. A PowerShell module for Map/Reduce, PsMapRedux:
* http://geekswithblogs.net/dwdii/archive/2012/03/28/mapredux—powershell-and-big-data.aspx
This is more the code sample, initial motivation and overview.
* https://github.com/dwdii/PsMapRedux
The GitHub link. It hasn’t been touched much lately as it covers the idea rather fully, and works fine.
* http://en.wikipedia.org/wiki/MapReduce
* http://webmapreduce.sourceforge.net/docs/User_Guide/sect-User_Guide-Introduction-What_is_Map_Reduce.html
* http://ayende.com/blog/4435/map-reduce-a-visual-explanation
In case you didn’t know about Map/Reduce, first the Wiki, second a pretty good separate overview that shows some other stuff, and third is
my favorite, as it lends some limited means of graphically showing a common task that would be done.
2. A PowerShell module for a implementation of LINQ, within PowerShell, also hopefully you know of it.
** http://josheinstein.com/blog/2010/02/linq-for-powershell/
A module of a collection of functions that are much like LINQ
** http://en.wikipedia.org/wiki/Language_Integrated_Query
A quick intro Wiki if hadn’t known or used before.
3. A PowerShell module for utilizing HDInsight’s role in the Apache Hadoop hierarchy
*** http://blogs.msdn.com/b/cindygross/archive/2012/08/23/how-to-install-the-powershell-cmdlets-for-apache-hadoop-based-services-for-windows.aspx
This is somewhat newer and needs someone who has a crippling disease.
*** http://blogs.msdn.com/b/carlnol/archive/2013/06/07/managing-your-hdinsight-cluster-with-powershell.aspx
This is a more thorough explanation and talks about management through PowerShell