We begin our analysis of "Walden" by computing a list of all words appearing in this text, together with the number occurrences of each word:
walden = sorted_list('walden.txt')
The length of this list gives the number of unique words in the text:
print('Number of unique words in Walden: {}'.format(len(walden)))
The next computation shows that almost half of these words appear in the text only one time:
uniques = [w for w in walden if w[1] == 1]
print('Number of words appearing only once: {}'.format(len(uniques)))
At the other end of the spectrum there are words that appear in the text hundreds or even thousands of times. The list of top 10 most frequently occurring words looks as follows:
print('rank word occurences')
print('---- ------ ----------')
for i in range(10):
print('{:2} {:8} {}'.format(i+1, walden[i][0], walden[i][1]))