← Back to Archives

Calculating Word Frequency from the Command-line

system

Word frequency calculations can be very helpful... I was poking around trying to find a simple command-line tool that would calculate word frequency and came across a great solution (with lots of help from "GNU texutils: Putting the Tools Together" and this page which helps describe alias syntax).

So here's a beast of a command that can be used to calculate word frequencies from the command-line on a file, file.txt:

tr '[A-Z]' '[a-z]' < file.txt | tr -cd '[A-Za-z0-9_ \012]' | tr -s '[ ]' '\012' | sort | uniq -c | sort -nr

You can place it in your .cshrc or .tcshrc file (or whatever) as an alias:

alias wordfreq "tr '[A-Z]' '[a-z]' < \!^ | tr -cd '[A-Za-z0-9_ \012]' | tr -s '[ ]' '\012' | sort | uniq -c | sort -nr"

Note: The above is all one line. Note the addition of double quotes within the single quotes of the alias. Also note the !^ which tells the shell to insert the first argument after the alias right there in the command (the backslash (\) there is to make sure the shell doesn't do something funny with the exclamation mark). This essentially strips all punctuation, makes it one word per line and then uses the uniq and sort commands to do the dirty work.

So, for example, I can use this command on a (somewhat groomed) file that has the names of all the counties (and states where needed) in which machine-related problems were reported in the recent election:

%> wordfreq problem_counties.txt

188 philadelphia 130 orleans 94 palm_beach 81 franklin_oh 70 cook 69 broward 68 cuyahoga 62 miamidade 59 kings_ny 55 manhattan 52 wayne_mi 42 bernalillo 41 los_angeles 34 mahoning 31 lucas 31 allegheny 30 mercer_pn 30 maricopa 30 harris 26 orange [...]

UPDATE [2004-11-28 10:23]: For some reason or another, the alias above refuses to redirect output to a file. That is, the command:

%> wordfreq file.txt > output.txt

doesn't work (the output still goes to the terminal window (stdout)). I have a feeling this is because one of the commands in there is native to the shell so output redirection doesn't work. However, if you pipe it to one more dummy command (like cat) it works:

%> wordfreq file.txt | cat > output.txt

This allows you to use it on the command line and pipe it to head or more and when you like what you see, you can pipe to cat and redirect to a file.

If anyone has any clue why it won't redirect normally (even if you put the | cat in the alias itself), I'd love to know more.