Getting Directory Information Fast
By now, you may have noticed I’m always on the lookout for better performing code. This has turned out to be a good habit now that I’m working at athena health, as the pure scale of things is so much larger than places I’ve been at in the past. One piece I’ve never been able to speed up, though, is iterating through folders and files. Nicolas1847, a PowerShell scripter on Spiceworks, has come up with an ingenious method to get simple directory information using Robocopy (of all things), and a colleague at athena health likes to shell out to CMD.exe and use the old DIR command. But are they faster? And if so, which one?
What’s the Problem?
This might be your first question, and it’s a really good one! The core problem is Get-ChildItem isn’t the most performant cmdlet in your PowerShell toolkit, and let me outline a couple of problems with it.
- It uses the .Net file handling classes, which don’t support long files names–so nothing more than 260 characters in length.
- Too much data! This isn’t really a problem so much as a consequence of what the cmdlet is doing. When the file system was original created many of the object types that we use every day didn’t exist. So the dates that are in the file system object are not .Net Date/Time objects, but some other format. That means Get-ChildItem has to convert them–and there are about 6 different dates that have to be converted. And then there are a number of other properties that have to be populated. All of this takes time and when you start working with truly massive file structures speed starts to become an issue.
So if we want to limit how much data we’re getting–usually because we don’t need all of it–we have a couple of techniques that will get us very basic information about directory structures very quickly.
Robocopy is a long-standing utility that has been around for a very long time. Finally in Windows 2008 Microsoft started packaging Robocopy with Windows instead of making you download it in a toolkit separately. Robocopy is an amazing copy utility that will copy file structures, mirror two folders (even deletions), maintain ACL’s and a whole host of other things. Including simply listing the files present in folder structure. But what it will be returning is a text stream, so we’ll have to parse that so we can break it up into a custom object we can use it. The most powerful tool for parsing text is Regular Expressions, which I’ve found myself using more and more lately! One of the nice abilities of Regular Expressions (RegEx) is that you can capture a group of text that matches your criteria. Not only that, you can give that group a name, making it much easier to locate the information you wanted. To do that you start you grouping using parenthesis, a question mark and then your group name within the greater than/less than brackets: (?<groupname>text criteria).
Here’s the code to parse the Robocopy text and return our custom object. You can read about the different Robocopy parameters used here.
Run Robocopy with a ton of switches, most of which turn off things we don’ t need like the summary, header, retries, etc. After we’ve captured the output of Robocopy into $RoboCopyList we simply go through it line by line looking for RegEx matches (which at this point should be nearly every line) and extract the folder, date, file size and name of the file which I then assign to a custom object (transforming the date and file size into their proper types).
There’s Dir and then there’s Dir
The second option is to shell out to CMD and run the old-fashioned DIR command. Remember that we can’t just use DIR in PowerShell since that’s actually an alias to Get-ChildItem–which would pretty much kill the whole point of these tests! DIR requires a couple of extra switches too, to make sure we get the output as close to what we want as possible. Extracting the folder took a couple of extra hoops to jump through, since DIR places the path in a separate line of text. I had to do an extra RegEx match to extract the folder name and then go through all of the file in that folder.
Here’s the code that parses the text returned from DIR, and you can read about the various parameters that go with DIR here.
Just to be thorough, and to have a control, here is the code I used for Get-ChildItem (quite a bit simpler, isn’t it?!)
And the Winner is…
To get a good measurement of which technique is faster I’ll have to wrap both in the Measure-Command cmdlet and run them a few times. The winner?
Get-ChildItem, under PowerShell 4.0 is surprisingly competitive here, but clearly the slowest technique. DIR is the fastest, and on average is around 50 milliseconds faster in just about every test I ran (although Robocopy will pull out a win every now and then). Let’s run it against the entire C: drive of my laptop and see what we get:
Pretty much as expected, right? DIR continues to take the lead with Robocopy fast on its heels and Get-ChildItem making a surprisingly strong push. What I have observed though, is that Get-ChildItem will bog down when you get up over the 5000 mark, but that was with PowerShell 2.0 so perhaps the new 4.0 cmdlet is a bit better?
As is always the case with PowerShell, it depends. Clearly Get-ChildItem is the easiest to work with, and it provides the greatest amount of information to you as a scripter. In almost all cases this is what you’ll use because on smaller directory structures the performance differences are minor–I believe 100 milliseconds is the bare minimum that most people can even perceive–and the ease of running the cmdlet and the information it provides easily make it the favorite.
But if you are working with gigantic file structures and only need the most basic of information then one technique you can use to speed up your script is the DIR command and RegEx to parse the text for you. It’s a bit more involved but it can provide you with just about all the same information that Get-ChildItem can and it will do it MUCH faster. You can even get the creation dates and last accessed dates.
Robocopy is a great alternative, but has some limitations. It’s not the fastest (by a fraction), but the output is a little easier to deal with then DIR. But it’s limited in the information it gives you, basically you can pull out folder, name, size and last written date and that’s all. If you’re already using Robocopy in your code, using DIR isn’t enough of a performance difference to make you rewrite your code but if you’re starting from scratch I don’t see any reason to use this technique when you can use DIR instead. Also, if you’re running your script on an older system Robocopy may not be there for you and would require an install–it’s scary how many Windows 2003 servers are still out there and going strong!
No comments yet.