In this interlude we will develop a program that reads a text and prints the most frequent words in that text. As in the previous interlude, the program here is quite simple, but it uses some more advanced features, such as iterators and anonymous functions.
The main data structure of our program is a table that maps each word found in the text to its frequency counter. With this data structure, the program has three main tasks:
Read the text, counting the number of occurrences of each word.
Sort the list of words in descending order of frequencies.
Print the first n entries in the sorted list.
To read the text, we can iterate over all its lines and, for each line, we iterate over all its words. For each word that we read, we increment its respective counter:
local counter = {}
for line in io.lines() do
for word in string.gmatch(line, "%w+") do
counter[word] = (counter[word] or 0) + 1
end
end
Here, we describe a “word” using the pattern ’%w+’,
that is, one or more alphanumeric characters.
The next step is to sort the list of words.
However, as the attentive reader may have noticed already,
we do not have a list of words to sort!
Nevertheless, it is easy to create one,
using the words that appear as keys in table counter:
local words = {} -- list of all words found in the text
for w in pairs(counter) do
words[#words + 1] = w
end
Once we have the list,
we can sort it using table.sort:
table.sort(words, function (w1, w2)
return counter[w1] > counter[w2] or
counter[w1] == counter[w2] and w1 < w2
end)
Remember that the order function must return true
when w1 must come before w2 in the result.
Words with larger counters come first;
words with equal counters come in alphabetical order.
Figure 11.1, “Word-frequency program” presents the complete program.
Figure 11.1. Word-frequency program
local counter = {}
for line in io.lines() do
for word in string.gmatch(line, "%w+") do
counter[word] = (counter[word] or 0) + 1
end
end
local words = {} -- list of all words found in the text
for w in pairs(counter) do
words[#words + 1] = w
end
table.sort(words, function (w1, w2)
return counter[w1] > counter[w2] or
counter[w1] == counter[w2] and w1 < w2
end)
-- number of words to print
local n = math.min(tonumber(arg[1]) or math.huge, #words)
for i = 1, n do
io.write(words[i], "\t", counter[words[i]], "\n")
end
The last loop prints the result,
which is the first n words and their respective counters.
The program assumes that its first argument is
the number of words to be printed;
by default, it prints all words if no argument is given.
As an example, we show the result of applying this program over this book:
$ lua wordcount.lua 10 < book.of
the 5996
a 3942
to 2560
is 1907
of 1898
in 1674
we 1496
function 1478
and 1424
x 1266
Exercise 11.1: When we apply the word-frequency program to a text, usually the most frequent words are uninteresting small words like articles and prepositions. Change the program so that it ignores words with less than four letters.
Exercise 11.2: Repeat the previous exercise but, instead of using length as the criterion for ignoring a word, the program should read from a text file a list of words to be ignored.
Personal copy of Eric Taylor <jdslkgjf.iapgjflksfg@yandex.com>