The Surly Admin

Father, husband, IT Pro, cancer survivor

Read Text Files Faster than Get-Content

This was a fun little script I threw together after a particular conversation came up at Spiceworks.  If you’ve worked with PowerShell long you’ve used Get-Content to read a file.  99% of the time, it’s fine and you just continue on with life.  This blog post is about that 1% when Get-Content is SLOW.  The .NET IO.Streamreader is where people turn to speed things up so I decided to create a  function around it that worked much like Get-Content does.  This is it’s story.

Specifications

The overriding requirement of this script was speed.  If it’s not fast then what’s the point?  Second was I wanted to emulate a lot of the functionality of Get-Content.  Get-Content is an amazing cmdlet, and because it’s trying to be so many things to so many different providers it ends up being a bit slow, especially for text files.  So I don’t want to emulate everything it can do, just the things most people use.

Parameters

The parameter sets I ended up settling on are Path (obviously), TotalCount (read only “x” lines of the text file), Tail (read last “x” lines of the text file) and Raw (read all the lines and output a single string).  I also wanted to fully support the pipeline, including taking input from Get-ChildItem.  Here’s the Param section I ended up with:


[CmdletBinding(DefaultParameterSetName="Normal")]
Param (
[Parameter(ParameterSetName="Normal",Position=0,ValueFromPipeline)]
[string[]]$Path,
[Parameter(ParameterSetName="Pipe",ValueFromPipelineByPropertyName)]
[string]$Fullname,
[Parameter(ParameterSetName="Raw")]
[Parameter(ParameterSetName="Pipe")]
[switch]$Raw,
[Parameter(ParameterSetName="Normal")]
[Parameter(ParameterSetName="Pipe")]
[ValidateScript({ $_ -gt 0 })]
[int64]$TotalCount,
[Parameter(ParameterSetName="Normal")]
[Parameter(ParameterSetName="Pipe")]
[ValidateScript({ $_ -gt 0 })]
[int64]$Tail
)

I used parameter sets because some of these parameters shouldn’t be run together.  Raw and Tail or TotalCount, for example, since the idea of Raw is to read the whole file in one fell swoop it doesn’t make sense to support limiting the lines.  Also because I wanted to support piping the path from a text file, or just the pipe, and I wanted to use the Path parameter name–in PowerShell certain parameter names have settled into the “approved use” lists and Path is the accepted one for putting in a file path–I had to separate Path and FullName into different parameter sets.  FullName is needed to pipe path’s from Get-ChildItem.

Begin/Process/End and living in the Moment

With this script, it really just lives in the moment and the current pipeline item, so I ended up not using the Begin or End scriptblocks and just stuck with Process.  There’s a little validation in there, but the real moment of truth is opening the file using System.IO.StreamReader.  I won’t go into why it’s better, since better writers than I already have, but here’s that key piece of code:


Try {
Write-Verbose "Reading $PathName"
$File = New-Object System.IO.StreamReader Argument $PathName
}
Catch {
Write-Error "Unable to read $PathName because $($Error[0])"
Exit 999
}

This includes a little bit of Error trapping using Try/Catch.  The next part is where things start getting a little weird, and remember I jumped these hoops all in the name of speed.


If ($Raw)
{
Write-Output $File.ReadToEnd()
}
Else
{
$RawData = New-Object TypeName System.Collections.ArrayList
Switch ($true)

First is the Raw parameter, luckily this one is easy and one of the methods that StreamReader supports is ReadToEnd() which allows you to read the whole file into a string with one easy command.  Keeping with the pipeline theme and living in the moment I just wrote this out to the pipeline and finished.

Next come TotalCount and Tail.  One thing I wanted to be able to do was use TotalCount and Tail together but still want to keep in the pipeline moment.  So I needed an If/Then structure that allowed me to do TotalCount–or not–and Tail–or not–and if not on either just read the file and output it.  I went with Switch and a little undocumented feature of it.  Normally you would use Switch, then a condition and then each line after that would be a condition result and a script block, but you don’t have to do it that way.  Instead I test for a condition which will always be true, in this case $true.  This will trigger the Switch block no matter what.  Next I use scriptblocks to test completely different conditions and since Switch will test all the conditions (unless you use Break to exit out of the Switch scriptblock) I can get the If/Then structure I need.

I also define a System.Collections.ArrayList since I’ll have to store the entire file in memory if Tail is used.  ArrayList is a much faster alternative to the typical PowerShell array.


Switch ($true)
{
{$TotalCount}
{
$Count = 0
While ($Line = $File.ReadLine())
{
Write-Output $Line
$RawData.Add($Line) | Out-Null
$Count ++
If ($Count -eq $TotalCount)
{
Break
}
}
If ($Count -eq 0)
{
Write-Warning "$Path was empty"
Break
}
}

First up is Total Count.  I just read through the lines and output the result to the pipeline.  Once I hit the right number of lines (or the end of the file) I break out of the While loop I’m using that’s reading the file.  Thanks to @concentrateddon for the loop structure.  Notice I’m also saving all the lines I read.  I might not need this if Tail wasn’t used, but if Tail was used I’ll need this for later.

The obvious question at this point is why not just read the whole file in and then do the first “x” lines and the last “x” lines?  That’d certainly be easier, but it comes back to the primary purpose of the script. Speed.  By outputting the lines as I go the results come out immediately, no waiting at all.  What if you only wanted the first 3 lines of a 3 million line log file?  Who wants to wait for 3 million lines to load and then only output the first three?!


Switch ($true)
{
{$TotalCount}
{ ... }
{$Tail}
{
While ($Line = $File.ReadLine())
{
$RawData.Add($Line) | Out-Null
}
If ($Tail -gt $RawData.Count)
{
Write-Warning "Tail = $Tail is larger then file size, reading whole file"
$Tail = $RawData.Count
}
For ($i = ($RawData.Count 1) ($Tail 1) ; $i -le $RawData.Count 1 ; $i++)
{
Write-Output $RawData[$i]
}
}

Having just said that, for Tail we need to read the whole file in so we can go backwards and just output the last “x” lines, even if they overlap with the first “x” lines from TotalCount.  That’s all this little bit of code does, plus a little error checking to keep things working–don’t want someone saying -Tail 99 on a 4 line text file!


Switch ($true)
{
{$TotalCount}
{ ... }
{$Tail}
{ ... }
Default
{
While ($Line = $File.ReadLine())
{
Write-Output $Line
}
Break
}

Now, assuming TotalCount and Tail weren’t used we still want to just read the file and output it as fast as possible. So here I just went with the same read loop and sent the output straight to the Pipeline, do not pass Go do not collect $200.

Conclusion

That’s the end of the script.  Fully supported getting a file path into the script in just about every imaginable way, support TotalCount, Tail and Raw.  If any interest is shown I might include -Wait as that shouldn’t be too difficult.  Famous last words!

If you’re interested in the script you can find it over on GitHub:  Get-ContentFast.ps1

Advertisement

June 1, 2015 - Posted by | Powershell - Performance | , , ,

No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: