CS 261 Lab #6

February 28th

Goals:

This week we'll tie together some related loose ends: You'll write some code that can use either ArrayLists or LinkedLists when processing collections of words, investigate the impact of your choice of lists on the complexity of list-based operations, and write some alternative code using iterators for practice.

Partners

In each lab this semester you will work with a randomly assigned partner. (I'll have Zoom randomly set up breakout rooms.) Please be kind in your interactions with your partners! Keep in mind that students in this class have a range of previous programming experience, and that some have been college students for longer than others. We're all in this together, and you have something to learn from your partner, no matter who they are or what their previous experiences have been. I expect that group members will collaborate and work together on each step of the lab.

Introduction

The program you'll be working with this week determines the frequency of words in the text read from a file. The basic idea is to keep a list of unique words we've encountered so far, as well as a corresponding list of integers. Each time we read a word from the file, we'll look to see if it's in our list of already encountered words. If so, we'll increment the corresponding counter for that word. If not, we'll add the new word to our list of unique words, and set up a counter for it as well. The diagrams below show the progression as the sentence "the dog bit the cat" is processed. After reading the first three words, we would have scenario (a), where each of the three words has been added to the list of words, and each of the corresponding counters is 1 since each word has occurred once so far. When we get to the fourth word, "the", we would notice that it's already in the list of words, at position 0. The counter at position 0 would therefore be updated to reflect the fact that "the" has now been seen twice, giving us scenario (b). When we come across the word "cat", we discover it's not in our list of known words, so it gets added to the end of the list, and a corresponding counter is added to the end of the word counts.

Snapshots of the list of words and their frequencies as "the dog bit the cat" is processed.

Exercise 1: Finishing the Word Frequency Code

Take a moment to introduce yourself to your partner(s). After social pleasantries are complete, pick one member of the team to be the "typer". They'll share their screen while editing the lab code in BlueJ. (I think BlueJ works better for these interactions than Eclipse. It's easier to see when sharing screens, and makes it easier to quickly test individual methods than Eclipse does.) Group members should contribute equally while working through the problems below and discuss all code to be written, though only the "typer" will be able to edit code. Resist the temptation to have both members work simultaneously in BlueJ — you're much more likely to "drift apart" over the course of the lab if you do so. The goal here is to have a partner who's engaged on exactly the same step of the lab as you are.
Start by downloading the lab project and extracting its contents. Open it in BlueJ and peruse the code to see what's there. The WordCounter class contains starter code for finding word frequencies, though at the moment it just reads through each word in a particular text file without actually calculating word frequencies. (Run main — it processes 9,618 words from Moby Dick, but currently reports it found 0 unique words.)
Finish the definition of indexOfMatch. It should return the index at which a particular string is found in a list of strings, or -1 if the string is not found. Do not use the built-in indexOf as part of your solution. You can write this with a while loop, or with a for loop that returns once a match is found, but you should use a counter to keep track of the index and call the list's get() method as you check for matches. Later we'll use this to look for matches in our list of unique words. (Note: Remember that you shouldn't compare strings with == since it doesn't actually look at the contents of the two strings.)
Finish the definition of indexOfLargest. It takes a list of Integers, and returns the index of the largest value in the list (or -1 if the list is empty). As in indexOfMatch, use a loop and the list's get() method to consider each value in the list. Later we'll use this to find the largest number in the list of word counts, as part of determining the most common word.
Now that you've got the building blocks you need, finish writing the body of the while loop in the countWords method. There's a to-do list there in comments that you can use to guide your work. You can use indexOfMatch to determine if a word is already in our list of unique words.
Run the main method and test your code before proceeding. It won't return immediately, but if it takes longer than about 30 seconds, stop the program and edit the main method so that it uses "chapter1.txt" instead of "chapters1-3.txt". My code found 3,040 unique words in "chapters1-3.txt", and the winning word occurred 514 times. Once you've got it working, write down the number of milliseconds the program took to run so we can compare that to the results of the experiments below.

Exercise 2: Investigating Performance

Before we do any computational experiments, go back and look at the code in indexOfMatch and indexOfLargest. They were written such that they can take any kinds of lists — anything that implements the List interface! But what is their complexity? You don't need to do a detailed T(n) estimate (unless you find it helpful), but see if you can predict what O(n) will be for each method (indexOfMatch and indexOfLargest) when passed a list of size n. Will it be the same for LinkedLists as for ArrayLists? Why or why not? If you're not sure, raise a virtual hand and talk it over with the TA or instructor before proceeding.
At the moment, the countWords method is using LinkedLists to store the words and counters. Change them both to be ArrayLists and re-run your program. (No other changes should be required.) Record the number of milliseconds it takes to run. Do these results make sense in terms of your analysis in the previous step? If you're not sure, raise a virtual hand and talk it over with the TA or instructor before proceeding.
Run two more experiments to see what happens if words is a LinkedList and wordCounts is an ArrayList, and vice versa. You should perform a total of four separate tests. Record the execution times for these variations as well. Think about why the results look the way they do before proceeding.

Exercise 3: Bring on the Iterators

Both indexOfMatch and indexOfLargest work their way through a list sequentially, doing a .get() at each position. From our discussions in class we know that calling get on a linked list is O(n), so traversing an entire list that way is O(n^2)! We can do better if we use iterators: Finish the definition of indexOfMatch_iter, and make use of the iterator that's being created in the first line of the method. (Look back at the code from class if you want to see an example of an iterator in use.) You'll still need to keep track of a counter yourself so that you know the position at which a match is found (e.g. with a for loop), but use .next() as you inspect the values in the list.
Change countWords so that it uses indexOfMatch_iter instead of the original indexOfMatch when searching for matches. Re-run your experiments with both lists being LinkedLists, and then both being ArrayLists. How do these results compare to your original experiments? Why?

At the end of the lab, look back on the execution times you recorded and summarize (for yourselves) what you learned. What were the main take-aways from the experiments? If you had to re-design this program from scratch, which kinds of lists would you use? Where is that decision most important?

Extras

If you've got extra time, consider trying the following:

Write an iterator-based version of indexOfLargest and see how much of a difference it makes.
When updating a word count, try using the set method rather than removing the old count and adding back the new. See how much of a difference that makes.
Modify wordCount so that it removes punctuation from words before looking for matches. (Otherwise things like "whale", "whale.", and "whale!" get treated as separate words.)
Do finer-grained timing tests. Can you tell how much time is spent in indexOfMatch vs. the calls to remove and add, for example?

Brad Richards, 2023