Assignment 0 - Programming in Scala

Overview
Part 1: Getting set up
Part 2: Counting words
Part 3: Removing stopwords
Part 4: Word count distribution
Part 5: Counting n-grams
Part 6: Reading a data file

Due: Tuesday, September 10, noon

This course will require you to complete several non-trivial programming assignments. As such, to be successful in this class you must be comfortable programming. The primary purpose of this assignment is for you to ensure that you have at least a minimum level programming competence.

All of the programming assignments for the rest of the course will be more difficult than this one. If you find this assignment excessively difficult, then this course may not be for you.

If you have questions, or want to discuss whether you think the course is a good fit for you, please do not hesitate to talk to me.

Overview

In this assignment, you will write code for reading texts from file and calculating basic statistics about that data. This code will be used in subsequent programming assignments since all future assignments will require working with texts.

In the root of your repository, create a file called Assignment0_README.md that contains:

A short overview of what you’ve done
A list of files relevant to this assignment
Any commands needed to demonstrate your programs

Send an email by noon on the due date to both me (dhg@cs.utexas.edu) and Lewis (lewfish@cs.utexas.edu) when your code is checked in, pushed, and ready for grading. The subject of your email should be:

nlpclass-fall2013 a0 completed lastname firstname

You are highly encouraged to make your code as modular as possible, to facilitate reuse. Functions like reading a file, cleaning it up, and counting things are going to be used all the time in this course. Having easy-to-call functions for these operations in your code will serve you well.

You may discuss programming assignments with your classmates. And google is an invaluable programming asset (in school and in the real world), so use it well. But avoid looking for exact answers on google or having classmates give you exact solutions. And please don’t post these assignments to StackOverflow asking people to do your work for you. The point of these programming assignments is to reinforce the material; copying and pasting from the internet defeats that purpose. The goal is to learn something, to practice programming, to develop critical thinking skills, and, hopefully, to have fun.

Part 1: Getting set up

Follow the instructions on the Assignment Requirements and Scala Environment Setup pages. At a minimum, you’ll need to:

Download and install Scala
Download and install SBT
Create a Github Account and register as a student
Create a PRIVATE GitHub repository for your code with the name
```
nlpclass-fall2013-lastname-firstname
```
and clone it to your computer
Add me (GitHub username dhgarrette) and Lewis (GitHub username lewfish) each as a “collaborator”
Create a Scala project in (the root of) your repository with nlpclass-fall2013 as a dependency

All of your code for this assignment will be located in a package called nlp.a0. This means that there should be a folder called a0 contained in a folder called nlp. So you should have this:

nlpclass-fall2013-lastname-firstname/src/main/scala/nlp/a0

For an example of this setup, see the program Example, which is set up in a similar way (though with a different package name). It can be run as:

$ cd nlpclass-fall2013-lastname-firstname
$ sbt "run-main nlpclass.Example"
[...]
This is an example Scala program.
[...]

Part 2: Counting words

Download Alice’s Adventures in Wonderland from Project Gutenberg.

Write an application that does the following:

Takes a file path of as a command-line argument.
Removes all punctuation and numbers and lowercases all words.
Prints the total number of words in the book.
Prints the number of distinct words in the book.
Prints each of the top 10 most frequent words along with its count and its percentage of the total.

The application should be in an object called WordCount in a package called nlp.a0. I should be able to run your program with something like this, and get this exact output (excluding sbt garbage):

$ cd nlpclass-fall2013-lastname-firstname
$ sbt "run-main nlp.a0.WordCount /Users/dhg/texts/alice.txt"
Total number of words: 30419
Number of distinct words: 3007
Top 10 words:
the            1818    5.98
and            940     3.09
to             809     2.66
a              690     2.27
of             631     2.07
it             610     2.01
she            553     1.82
i              545     1.79
you            481     1.58
said           462     1.52

Part 3: Removing stopwords

Stopwords are extremely frequent non-content words such as determiners, pronouns, and prepositions. You’ll notice that the top 10 words in the book are all stopwords. Because they are so frequent, stopwords don’t generally tell us much about the content of a document because they are generally the same across all documents.

Here, you will extend your program to allow for word counting that ignore stopwords. Update your program to:

Take a file of stopwords as a command-line parameter with the option --stopwords FILE. (For example: this one).
If the list of stopwords is present, then skip them in your top-10 display (but don’t exclude them from your total count).
Ensure that if the stopwords option is not present, that the program will run as in Part 2.

I should be able to run your program with something like this, and get this exact output (excluding sbt garbage):

$ sbt "run-main nlp.a0.WordCount alice.txt --stopwords english.stop"
Total number of words: 30419
Number of distinct words: 3007
Top 10 words:
alice          403     1.32
gutenberg      93      0.31
project        87      0.29
queen          75      0.25
thought        74      0.24
time           71      0.23
king           63      0.21
turtle         59      0.19
began          58      0.19
tm             57      0.19

This list is a bit more interesting since it shows us words that are actually relevant to the book.

Part 4: Word count distribution

The distributions of words in a document are always highly skewed: a few words appear with very high frequencies, but most words appear very few times. To get an idea of the shape of things, write a program called WordFreqFreq that prints the ten most frequent frequencies and the five least frequent frequencies.

Your output should be obviously be sorted by frequency frequency, but for frequencies with the same frequency frequency, you should sort by frequency. Confused yet?

I should be able to run your program with something like this, and get this exact output (excluding sbt garbage):

$ sbt "run-main nlp.a0.WordFreqFreq /Users/dhg/texts/alice.txt"
Top 10 most frequent frequencies:
1331 words appear 1 time
467 words appear 2 times
264 words appear 3 times
176 words appear 4 times
101 words appear 5 times
74 words appear 8 times
72 words appear 6 times
66 words appear 7 times
38 words appear 9 times
36 words appear 10 times

Bottom 5 most frequent frequencies:
1 word appears 631 times
1 word appears 690 times
1 word appears 809 times
1 word appears 940 times
1 word appears 1818 times

Note: your output needs to be grammatical.

So 44% of the words in the book appear only once (1331 out of 3007).

I plotted a graph of the frequency distribution:

Part 5: Counting n-grams

An n-gram is a sequence of n words. We will be talking a lot more about n-grams later in this course, but for now we’re just going to count them.

In future exercises, I’ll be giving you a trait and asking you to implement it. To make sure that this makes sense, here is a simple example.

In the nlpclass-fall2013 jar that your project should have as a dependency, there is a trait nlpclass.NGramCountingToImplement. It looks like this:

trait NGramCountingToImplement {

  /**
   * Given a vector of tokens, return a mapping from ngrams 
   * to their counts.
   */
  def countNGrams(tokens: Vector[String]): Map[Vector[String], Int]

}

Your task is to implement this trait. You should create a file that looks like this:

package nlp.a0

import nlpclass.NGramCountingToImplement

class NGramCounting(n: Int) extends NGramCountingToImplement {

  def countNGrams(tokens: Vector[String]): Map[Vector[String], Int] = {
     ???  // Your code here
  }

}

and implement the method countNGrams. Hint: See Vector.sliding in the API.

I’m going to test your class like this:

scala> sbt console
scala> val aliceText = ...
scala> val counts = new nlp.a0.NGramCounting(3).countNGrams(aliceText)
scala> counts(Vector("the", "white", "rabbit"))
res0: Int = 21

Now write a program called CountTrigrams that prints the top 10 most frequent trigrams along with their counts I should be able to run your program with something like this, and get this exact output (excluding sbt garbage):

$ sbt "run-main nlp.a0.CountTrigrams /Users/dhg/texts/alice.txt"
project gutenberg tm            57
the mock turtle                 53
i don t                         31
the march hare                  30
the project gutenberg           29
said the king                   29
the white rabbit                21
said the hatter                 21
said to herself                 19
said the mock                   19

Just for fun: of the 25,774 distinct trigrams, 23,294 (90.4%) appear only once, and more than 99.9% appear 12 times or fewer!

Part 6: Reading a data file

During this class, there are times when we will need to read data files that appear in particular formats. Thus, as a final exercise, we will write code to read one such file (that will be used in a future assignment).

The data will be in the form of files like this:

Outlook=Sunny,Temperature=Hot,Humidity=High,Wind=Weak,No
Outlook=Sunny,Temperature=Hot,Humidity=High,Wind=Strong,No
Outlook=Overcast,Temperature=Hot,Humidity=High,Wind=Weak,Yes
Outlook=Rain,Temperature=Mild,Humidity=High,Wind=Weak,Yes
Outlook=Rain,Temperature=Cool,Humidity=Normal,Wind=Weak,Yes
Outlook=Rain,Temperature=Cool,Humidity=Normal,Wind=Strong,No
Outlook=Overcast,Temperature=Cool,Humidity=Normal,Wind=Strong,Yes
Outlook=Sunny,Temperature=Mild,Humidity=High,Wind=Weak,No
Outlook=Sunny,Temperature=Cool,Humidity=Normal,Wind=Weak,Yes
Outlook=Rain,Temperature=Mild,Humidity=Normal,Wind=Weak,Yes
Outlook=Sunny,Temperature=Mild,Humidity=Normal,Wind=Strong,Yes
Outlook=Overcast,Temperature=Mild,Humidity=High,Wind=Strong,Yes
Outlook=Overcast,Temperature=Hot,Humidity=Normal,Wind=Weak,Yes
Outlook=Rain,Temperature=Mild,Humidity=High,Wind=Strong,No

Each line of the file represents a training instance for a classification task. Lines consist of a series of comma-separated fields. Each field (except the last) is in the format FEATURE=VALUE. The last field is a label for the instance.

Your job is to write a program called CountFeatures that takes a file path and prints a list, for each feature, of feature values and their counts, broken down by label, with everything sorted alphabetically. If the above data was found in a file called data1.txt, I should be able to run your program with something like this, and get this exact output (excluding sbt garbage):

$ sbt "run-main nlp.a0.CountFeatures /Users/dhg/texts/data1.txt"
No      5
Yes     9

Humidity
    No
        High            4
        Normal          1
    Yes
        High            3
        Normal          6
Outlook
    No
        Rain            2
        Sunny           3
    Yes
        Overcast        4
        Rain            3
        Sunny           2
Temperature
    No
        Cool            1
        Hot             2
        Mild            2
    Yes
        Cool            3
        Hot             2
        Mild            4
Wind
    No
        Strong          3
        Weak            2
    Yes
        Strong          3
        Weak            6

Critically, your program must be flexible enough to handle files with any features and any labels, as long as it conforms to the same format of comma-separated FEATURE=VALUE pairs follwed by a label. You can assume that no feature, value, or label will ever contain either a comma or an equals sign.

If I run it with a file data2.txt that contains:

word=loved,word=film,word=loved,word=actor,pos=loved,pos=loved,positive
word=film,word=bad,word=plot,word=worst,neg=bad,neg=worst,negative
word=worst,word=film,word=dumb,word=film,neg=worst,neg=dumb,negative
word=car,word=chase,word=fight,word=scene,word=film,neutral
word=great,word=plot,word=best,word=film,pos=great,pos=best,positive
word=best,word=actor,word=bad,word=plot,pos=best,neg=bad,negative
word=hated,word=terrible,word=film,neg=hated,neg=terrible,negative

Then I should be able to run your program with something like this, and get this exact output (excluding sbt garbage):

$ sbt "run-main nlp.a0.CountFeatures /Users/dhg/texts/data2.txt"
negative 4
neutral 1
positive 2

neg
    negative
        bad             2
        dumb            1
        hated           1
        terrible        1
        worst           2
pos
    negative
        best            1
    positive
        best            1
        great           1
        loved           2
word
    negative
        actor           1
        bad             2
        best            1
        dumb            1
        film            4
        hated           1
        plot            2
        terrible        1
        worst           2
    neutral
        car             1
        chase           1
        fight           1
        film            1
        scene           1
    positive
        actor           1
        best            1
        film            2
        great           1
        loved           2
        plot            1