CS 261 Assignment #5

Due Saturday, March 25th by 11:59pm
Not accepted after 3/27 at Class Time

Introduction

For this assignment you'll revisit the word frequency program we worked with in lab #6, and use Binary Search to make it more efficient. As you'll recall, for each word we encountered from the text file, we did a linear search to see if it was in our list of known words. We could use Binary Search to do those checks more efficiently if we kept the list in alphabetical order. When we encountered a new word we'd need to insert it at just the right spot to keep the list ordered, rather than adding at the end, but Binary Search could help there as well! When the search ends, even if it didn't find the target word, it has zeroed in on where the word should go.

The diagram below shows how our revised program would process "the dog bit the cat". After reading the first three words, we would have scenario (a), where the list of words is in alphabetical order, and their corresponding counters are each 1. When we get to the fourth word, "the", we discover (via Binary Search) that it's already in the list of words, at position 2. The counter at position 2 would therefore be updated to reflect the fact that "the" has now been seen twice, giving us scenario (b). When we come across the word "cat", we discover it's not in our list of known words, so it gets added to the list at a position that keeps the list ordered. (Its counter is added at the same position in the list of counters.)


Snapshots of the list of words and their frequencies as "the dog bit the cat" is processed.

The Assignment

You'll finish implementing a class, WordStorage, that keeps track of the unique words encountered and their frequencies, then modify the WordCounter class from lab to make use of your new, more efficient word-counting code. You will also write up an analysis of the Big-O complexity of some of the key code in WordStorage. The specifics are given below.
  1. Start by downloading the WordFrequencies project. It's a BlueJ project, since it's based off of the lab project, but you should feel free to work in Eclipse instead of BlueJ if you prefer. The only code in WordStorage at the moment is a copy of our recursive Binary Search code from class, but the documentation for the finished code is online and describes the methods you're required to implement. The WordCounter class contains some code from our lab project.
  2. The WordStorage class will contain the list of words and the list of counters used to determine the word frequencies. If the constructor is passed true as an input it will create ArrayLists for both, otherwise it will create LinkedLists. This makes it easy to experiment with the choice of list on the performance of the program if you choose to.
  3. The processWord method in WordStorage should take a word and update the lists appropriately: If the word is already in the list the corresponding counter should be incremented, otherwise the word should be added to the list and a new counter inserted at the corresponding position. For full credit, binary search should be used both to check whether a word is in the list and to find the insertion position if it turns out that the word is not in the list. This will require maintaining the word list in alphabetical order.
  4. The maxFrequency method should return the largest value in the list of counters. For full credit, it should use an iterator to traverse the list of counters so that it's as efficient as possible.
  5. The mostFrequentWord method should return the word that occurred most frequently. For full credit, any list traversals performed when finding the word should be done via iterators as well. In the case of a tie it doesn't matter which word you return.
  6. The size method should return the number of unique words in the collection, and toString should return a string reporting the most frequently occurring word and its frequency (or a message indicating that no words have been processed).
  7. You'll need to update the countWords method in the WordCounter class so that it makes use of the WordStorage class to process the words.
  8. Finally, once you've got everything working, add a comment to the bottom of the WordStorage class that presents an analysis of the Big-O complexity of the processWord method. You don't need to give a T(n) function — just an overall Big-O estimate — but for full credit you should justify your answer. Doing it justice will take a paragraph or two rather than just a sentence. (You'll need to factor in the complexity of the methods called by processWord, and discuss the impact of the type of lists holding the words and counts, for example.)

Notes on Insertions

As mentioned in the introduction, Binary Search can determine whether a word is already in the list, but it can also help determine where to insert if the word isn't there. For example, after processing "the dog bit the cat" in the scenario above, imagine the next word is "zoo". A Binary Search on the list will return position 3. That's where "zoo" would be if it were in the list — it would be to the right of "dog". The search got us to the right neighborhood, but we still need to figure out whether to insert "zoo" at position 3 and push the current item to the right, or to insert "zoo" at position 4. In this case "zoo" belongs to the right, at position 4, but that won't always be the case. For example, a Binary Search would return position 0 for both "bone" and for "bat". The former should be inserted at position 1, but the latter at position 0. The Binary Search gets us close, but you need to compare the word being inserted with the word that's currently at the position where Binary Search stopped to decide exactly where to insert.

Grading

This assignment will be graded out of a total of 100 points.

Submitting

Before submitting, test your code thoroughly and double check for comments (including @param and @return directives) above each method. Don't forget to do the Big-O analysis for processWord! When you're convinced everything is ready to go, zip up your project folder and submit it via the Canvas submission tool for Assignment #5.


Brad Richards, 2023