More on String Performance
Was reading another blog from Tome Tanasovski and he was doing a simple series on learning Powershell. One of his suggestions was taking a large dictionary file and finding all of the five letter words and locating the Palindromes. I thought this would be interesting, and another opportunity to look at performance and Powershell.
Locating 5 Letter Words
Assuming you got the dictionary text from Tome’s web page (here if you didn’t), how do we locate all of the 5 letter words? A common property of any string is length and we can simply check that property against every word in the dictionary. But as we discovered in this post, using Powershell to do a massive amount of string searches isn’t very efficient and I wondered if we could write a RegEx that would perform better? My RegEx foo is not very strong, but I managed to come up with “^…..$” where the carot designates the beginning of the string, and the “.” means any character, repeated 5 times comes and then a “$” to tell RegEx to look no further. I then fired up Expresso and discovered I could also notate it like so: “^.{5}$”.
One other thing, when using the Get-Content cmdlet you don’t have to load the full results into a variable. In fact, there will definitely be times when you don’t want to do it. You just want to scroll through the file, get the data you want and get out again. In Powershell you do this by piping the results of Get-Content into another cmdlet. Powershell will now take an entry from Get-Content and send it down the line before loading the next line, and so on.
$Count = 0 Measure-Command { Get-Content C:\Utils\dictionary.txt | Where { $_.Length -eq 5 } | % { $Count++} } Write-Host "Number 5 letter words found: $Count" $Count = 0 Measure-Command { [regex]$Search = "^.{5}$" Get-Content C:\Utils\dictionary.txt | Where { $_ -match $Search }| % { $Count++} } Write-Host "Number 5 letter words found: $Count"
And the results are:
Days : 0 Hours : 0 Minutes : 0 Seconds : 12 Milliseconds : 502 Ticks : 125025185 TotalDays : 0.000144705075231481 TotalHours : 0.00347292180555556 TotalMinutes : 0.208375308333333 TotalSeconds : 12.5025185 TotalMilliseconds : 12502.5185 Number 5 letter words found: 8636 Days : 0 Hours : 0 Minutes : 0 Seconds : 12 Milliseconds : 63 Ticks : 120636685 TotalDays : 0.000139625792824074 TotalHours : 0.00335101902777778 TotalMinutes : 0.201061141666667 TotalSeconds : 12.0636685 TotalMilliseconds : 12063.6685 Number 5 letter words found: 8636
There are 80,368 records in the dictionary folder, and it took both versions of the code about 12 seconds to go through them all and locate the 5 letter words. Using RegEx I was able to shave off about half a second off of the search, so if we extrapolate that out to 1,000,000 records we’d end up saving about 25 seconds, not too bad. Still, not a massive improvement and most scripts you run won’t have this many records so using the .Count property is perfectly valid.
Reverse the string
There are 2 ways to reverse the string, one is a simple loop where you loop through the string backwards and reconstruct it, and another is to use a .NET method called Reverse(). The problem with reverse is that it is a method that only works on arrays, not strings so we have to take our string and convert it to an array, reverse it and then convert it back. Here’s an example:
$a = "This is a Test" $arr = $a -split "" [array]::Reverse($arr) $a = $arr -join ""
But I wasn’t sure if that was really faster or not. So time to use the dictionary.txt again and reverse my 8600 5 letter words and measure the results! I decided to load that into a variable outside of the Measure-Command cmdlet, this will give us a clearer idea of how the variable reverse strategies work without adding the dictionary load into the mix. Here’s the testing script I came up with.
[regex]$Search = "^.{5}$" $dic = Get-Content C:\Utils\dictionary.txt | Where { $_ -match $Search } $Number = 0 Measure-Command { $dic | % { $Palindrome = $_ -split "" [array]::Reverse($Palindrome) $Palindrome = $Palindrome -join "" If ($_ -eq $Palindrome) { $Number ++ } } } Write-Host "`n$Number palindromes found" $Number = 0 Measure-Command { $dic | % { $Palindrome = "" $Word = $_ $Word.Length..1 | % { $Palindrome += $Word[$_ - 1] } If ($Word -eq $Palindrome) { $Number ++ } } } Write-Host "`n$Number of palindromes found"
And here’s the results:
Days : 0 Hours : 0 Minutes : 0 Seconds : 0 Milliseconds : 813 Ticks : 8134303 TotalDays : 9.4147025462963E-06 TotalHours : 0.000225952861111111 TotalMinutes : 0.0135571716666667 TotalSeconds : 0.8134303 TotalMilliseconds : 813.4303 24 palindromes found Days : 0 Hours : 0 Minutes : 0 Seconds : 3 Milliseconds : 535 Ticks : 35352974 TotalDays : 4.09177939814815E-05 TotalHours : 0.000982027055555556 TotalMinutes : 0.0589216233333333 TotalSeconds : 3.5352974 TotalMilliseconds : 3535.2974 24 of palindromes found
Pretty impressive results! The [array]::Reverse() method managed to reverse those 8600 strings in about three-quarters of a second, while looping through the string and rebuilding it took a whopping 3 and a half seconds!
Fun little palindrome exercise, and got to learn a little bit more about RegEx and performance using .NET methods.
Happy to inspire a fun diversion into perf. I’m fairly certain it’s the += that is adding the overhead, but I have not had the time to play with it myself.
One note about your post: the code you have posted is losing the quotes “” – they’re coming up as "
sorry, writing & quot is showing " 🙂 I meant to say they’re coming up as & quot (without the space).
LOL, no problem. Yes, WordPress always does that to me and haven’t figured out a fix yet (support didn’t even bother responding). I usually have to go back and re-edit and completely forgot to! Looks fine in preview too, which is a pain!