Guy’s Scripting Ezine 131 PowerShell Cleans Files and Calculates Word Frequency

PowerShell Cleans Files and Calculates Word Frequency

It’s my prediction that at sometime during their career, all IT professionals will at least flirt with PowerShell.  On the surface, the purpose of this ezine is to introduce you to PowerShell’s ability to manipulate files, underneath the surface, my hidden agenda is to persuade you to learn by playing with a PowerShell script or two.

 ♣

This Week’s Secret

With PowerShell I could choose examples to impress and bamboozle you, alternatively, I could select examples that are elegant but simple.  As my aim is always to get people started, therefore, I am always going to err on the side of simple, however, don’t let my short examples lull you into believing that PowerShell is not a powerful ‘grown-up’ scripting language.

When you see a PowerShell script my advice is to dissect its contents by asking the following questions.  Are there any | pipeline symbols?  If so, use them to break the script into sections and to trace the flow of the commands.  Next ask, ‘Are there any brackets?’  Remember that the style of bracket is highly significant (parenthesis – important) {Curly Braces – probably block-quotes} or [Square – optional].

If scripting was a game of golf, then VBScript would be a links course with wide open but roughish fairways, whereas PowerShell would be a designer course with beautifully manicured, if narrow, tree-lined fairways.  My point is that with PowerShell if you stay on the fairway (copy and paste other peoples code) it’s easy, but if you get into the rough (try writing your own code) it can be hard going to hack your way back onto the fairway.  By the way, PowerShell has a fleet of excellent caddies, to call for their advice on any hole, sorry cmdlet, try get-help verb-noun.

This Week’s Mission

Sooner or later all scripting languages need to manipulate files.  For example we need to locate files and the read and write data.  Other tasks include creating, copying and moving files.  Continuing my theme of breaking PowerShell into bite-sized chunks, get into the habit of analysing each script for verbs such as get, set, new, find, split, join and out.  Nouns associated with the aforementioned verbs include file, location, content and item.  You may already know that PowerShell’s basic commands are verb-noun pairs, for example, set-location, get-content and out-file.  The good news is that learning as few as 10 verbs and 10 nouns is all you need for 90% of all your scripts.  The skill is selecting the correct combination for the particular job and then researching the .methods, .properties or -parameters.  For example, new-object, get-member, .Split and -replace.

Another area of consistency is that all PowerShell nouns are singular, so it’s always file, content and location, never: fileS, contentS or locationS.

Ah yes, This Week’s Mission is to get a file (from the web) read the contents, count the words then list the dozen most frequent words.  I cannot claim that it has a major ‘real world’ use, nevertheless it illustrates the capabilities of PowerShell and is also handy for analysing keywords on my pages.  Once we ‘get’ the web file, then we manipulate the stream of text, for example we can remove all the <tags> and break the text stream into individual words.  Finally we will index the words and obtain a list of the dozen most common words.

Guy Recommends: The Free IP Address Tracker (IPAT) IP Tracker

Calculating IP Address ranges is a black art, which many network managers solve by creating custom Excel spreadsheets.  IPAT cracks this problem of allocating IP addresses in networks in two ways:

For Mr Organized there is a nifty subnet calculator, you enter the network address and the subnet mask, then IPAT works out the usable addresses and their ranges. 

For Mr Lazy IPAT discovers and then displays the IP addresses of existing computers. Download the Free IP Address Tracker

Example 1 – Open an .htm file and count the characters

This PowerShell example creates a net.weblclient object.  The code then interrogates the .length property and displays the number of characters.  As usual, I want to split the project into easy stages, get each section working and only then assemble into the final script.

Pre-requisites

  1. You need to have already installed a local copy of PowerShell.
  2. If you have no internet access, then amend the $URL variable to a local file, for example:
    $URL = "C:\boot.ini"

Instructions

  1. Save the following script with a .ps1 extension, for example file.ps1
  2. Launch PowerShell and navigate to the folder where you saved the .ps1 file
  3. To call for your script file, type at the command line PS> .\file1.ps1
    N.B. The rhythm of the command is:  dot slash filename.
     

cls
# — You may like to check or amend the $URL variables  —
$URL = "https://computerperformance.co.uk/powershell/index.htm"
#
$web = New-Object net.webclient
$doc = $web.DownloadString($URL)
"Page analysed `t `t= " +$URL
"Total characters `t= " +$doc.length

Learning Points

Note 1:  PowerShell’s escape character is the tiny `.  On my keyboard this symbol is created by pressing the top left key, next to the 1.  Some call this character a grave, as indeed does Word’s symbol checker.  My main use in typing this character is to call for a tab in the output with `t.  Another use is to ‘escape’ the double quotes in my breakers variable.

Note 2: In PowerShell you just introduce a variable with a dollar sign, for example $URL.  No further declaration is compulsory, however you can declare the scope of variables. 

Note 3: Good news.  You do not have to explicitly open and close files because PowerShell takes care of that automatically as part of the this command: 
$doc = $web.DownloadString($URL) both fetches the file and opens it for processing.  Incidentally, get-content also takes care of opening the file.

Example 2 – Replace all the html <tags> with a blank space

Example 2 builds on Example 1 by taking the text stream and removing unwanted html formatting characters such as <p> and <h3>.  Observe how I have deliberately chosen two different replace methods.  $doc -replace and $doc.replace().  Removing blank lines proved more troublesome, but I solved it with -match.

Pre-requisites

  1. You need to have already installed a local copy of PowerShell.
  2. If you have no internet access, then amend the $URL variable to a local file, for example:
    $URL = "C:\boot.ini"

Instructions

  1. Save the following script with a .ps1 extension, for example file2.ps1
  2. Launch PowerShell and navigate to the folder where you saved the .ps1 file
  3. To call for your script file, type at the command line PS> .\file2.ps1
    N.B. The rhythm of the command is:  dot slash filename.
     

cls
# — Three $variables you may like to check —
$WordBreakers = " `",.-=:;"
$Keyword = "PowerShell"
$URL = "https://computerperformance.co.uk/powershell/index.htm"
#
$web = New-Object net.webclient
$doc = $web.DownloadString($URL)
"Page analysed `t `t= " +$URL
"Total characters `t= " +$doc.length
#
# —- ‘Cleaning the document’
$doc = $doc -replace "\<[^<]*\>", " "
"Removed HTML <tags> `t= " +$doc.length
$doc = $doc.replace("&nbsp", "")
"Removed &nbsp `t `t= " +$doc.length
$words = $doc.split($WordBreakers)
"Words after split `t= " +$words.length
$words = $words | ?{$_ -match ‘[a-z]’}
"Removed blank spaces `t= " +$words.length
# $doc | get-member

Learning Points

Note 1:  To appreciate how I researched the available methods, remove the hash # on the last line
$doc | get-member

Note 2:  The problems I faced with ‘cleaning’ this html document were: a) Removing html formatting.  b) Removing unnecessary blank lines.  I have to admit that I don’t fully understand this filter:  "\<[^<]*\>".  What it does is remove each pair of angled brackets along with the enclosed html tag.  The star * wildcard symbol is well-known, but I am unsure of the nuances of the caret ^ hat symbol.

Note 3:  Probably the most important method is .split.  Trace how the $Breakers variable controls the symbols used to break the stream into individual words.  Perhaps I have used too many, you could try simplifying $Breakers.

Note 4:  Once again I have used the `grave escape character. `nMost is not a typo, but an instruction to add a line break before Most.  Another more important use of the `grave character is to escape the double quotes in the .split() argument.  For example `".   Put it another way if you remove this ` you get an error and PowerShell halts.

Guy Recommends: Tools4ever’s UMRAUMRA The User Management Resource Administrator

Tired of writing scripts? The User Management Resource Administrator solution by Tools4ever offers an alternative to time-consuming manual processes.

It features 100% auto provisioning, Helpdesk Delegation, Connectors to more than 130 systems/applications, Workflow Management, Self Service and many other benefits. Click on the link for more information onUMRA.

Example 3 – Find the dozen most popular words in a file

The final section of the script indexes the file and this enables us to calculate the most popular 12 words.  For this to work, we need to create a hash-table which indexes each word and keeps a running count of the instances of each word.

Pre-requisites

  1. You need to have already installed a local copy of PowerShell.
  2. If you have no internet access, then amend the $URL variable to a local file, for example:
    $URL = "C:\boot.ini"

Instructions

  1. Save the following script with a .ps1 extension, for example file3.ps1
  2. Launch PowerShell and navigate to the folder where you saved the .ps1 file
  3. To call for your script file, type at the command line PS> .\file3.ps1
    N.B. The rhythm of the command is:  dot slash filename.

cls
# — Three $variables you may like to check —
$WordBreakers = " `",.-=:;"
$Keyword = "PowerShell"
$URL = "https://computerperformance.co.uk/powershell/index.htm"
#
$web = New-Object net.webclient
$doc = $web.DownloadString($URL)
"Page analysed `t `t= " +$URL
"Total characters `t= " +$doc.length
#
# —- ‘Cleaning the document’
$doc = $doc -replace "\<[^<]*\>", " "
"Removed HTML <tags> `t= " +$doc.length
$doc = $doc.replace("&nbsp", "")
"Removed &nbsp `t `t= " +$doc.length
$words = $doc.split($WordBreakers)
"Words after split `t= " +$words.length
$words = $words | ?{$_ -match ‘[a-z]’}
"Removed blank spaces `t= " +$words.length
#
# —– Indexing the unique words ————
$words | foreach {$hash=@{}} {$hash[$_] +=1}
$freq =$hash.psbase.keys | sort {$hash[$_]}
"`n12 Most popular words: `n"
$num =-1; While ($num -gt -14) {$freq[$num]; $num +=-1}
"`nFrequency keyword " +$Keyword +" `t= " +$hash["$Keyword"]

Learning Points

Note 1:  Here is where we create the hash-table.
$words | foreach {$hash=@{}} {$hash[$_] +=1}

What happens is the stream of data, $words, is piped into a table where each word is indexed and counted.

Note 2:  Once we have the hash table we can sort into word frequency with:
$freq =$hash.psbase.keys | sort {$hash[$_]}

Note 3: To let you into a secret this line gave me the most trouble:
$num =-1; While ($num -gt -14) {$freq[$num]; $num +=-1}

$num =-1 means the most popular word (Don’t ask me why it starts at negative one).  My problem was getting the simple logic of the ‘While’ loop to count backwards -1, -2 etc.  In the end I took this one line, made a new script and tested it until it produced the desired sequence.

Note 4:  In passing, observe and admire the brackets (parenthesis – compulsory) {Braces – block quote}.  Talking of punctuation, it’s confession time, in my original script I took my eye off the semi-colon; a fatal mistake.  However, all is corrected in the above Example 3 script.

Guy’s Challenges

Challenge 1:  Adjust $URL to analyse a different file.

Challenge 2:  Try different ‘Breakers’ in the .Split() argument.

Challenge 3:  Experiment with employing one of the replace methods to remove other words, for example "The".

Summary of PowerShell and Handling Files

This is an introduction to PowerShell and files.  There are many aspects of handling files, for instance, reading, writing, also manipulating by creating, copying and moving.  However, these scripts focus on reading the document, breaking the text stream into words and then calculating the dozen most popular words.

If you like this page then please share it with your friends

 


See more Windows PowerShell tutorials

PShell Home   • Introduction   • Dreams   • 3 Key Commands   • PowerShell Help About   • Get-Help

PowerShell v 3.0   • Set-ExecutionPolicy   • Get-Command   • Cmdlet scripts   • Import-Module

PowerShell Version Check   • Backtick   • PowerShell examples   • PowerShell ISE   • Get-Member

Please email me if you have a better example script. Also please report any factual mistakes, grammatical errors or broken links, I will be happy to correct the fault.