STAT 39000: Project 5 — Fall 2020
Motivation: Becoming comfortable stringing together commands and getting used to navigating files in a terminal is important for every data scientist to do. By learning the basics of a few useful tools, you will have the ability to quickly understand and manipulate files in a way which is just not possible using tools like Microsoft Office, Google Sheets, etc.
Context: We’ve been using UNIX tools in a terminal to solve a variety of problems. In this project we will continue to solve problems by combining a variety of tools using a form of redirection called piping.
Scope: grep, regular expression basics, UNIX utilities, redirection, piping
You can find useful examples that walk you through relevant material in The Examples Book:
It is highly recommended to read through, search, and explore these examples to help solve problems in this project.
Don’t forget the very useful documentation shortcut ?
for R code. To use, simply type ?
in the console, followed by the name of the function you are interested in. In the Terminal, you can use the man
command to check the documentation of bash
code.
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/amazon/amazon_fine_food_reviews.csv
A public sample of the data can be found here: amazon_fine_food_reviews.csv
Answers to questions should all be answered using the full dataset located on Scholar. You may use the public samples of data to experiment with your solutions prior to running them using the full dataset.
Here are three videos that might also be useful, as you work on Project 5:
Questions
Question 1
What is the Id
of the most helpful review, according to the highest HelpfulnessNumerator
?
You can always pipe output to sort: write failed: standard output: Broken pipe sort: write error This is because |
-
Line of UNIX commands used to solve the problem.
-
The
Id
of the most helpful review.
Question 2
Some entries under the Summary
column appear more than once. Calculate the proportion of unique summaries over the total number of summaries. Use two lines of UNIX commands to find the numerator and the denominator, and manually calculate the proportion.
To further clarify what we mean by unique, if we had the following vector in R, c("a", "b", "a", "c")
, its unique values are c("a", "b", "c")
.
-
Two lines of UNIX commands used to solve the problem.
-
The ratio of unique `Summary’s.
Question 3
Use a chain of UNIX commands, piped in a sequence, to create a frequency table of Score
.
-
The line of UNIX commands used to solve the problem.
-
The frequency table.
Question 4
Who is the user with the highest number of reviews? There are two columns you could use to answer this question, but which column do you think would be most appropriate and why?
You may need to pipe the output to |
To create the frequency table, read through the
|
-
The line of UNIX commands used to solve the problem.
-
The frequency table.
Question 5
Anecdotally, there seems to be a tendency to leave reviews when we feel strongly (either positive or negative) about a product. For the user with the highest number of reviews (i.e., the user identified in question 4), would you say that they follow this pattern of extremes? Let’s consider 5 star reviews to be strongly positive and 1 star reviews to be strongly negative. Let’s consider anything in between neither strongly positive nor negative.
You may find the solution to problem (3) useful. |
-
The line of UNIX commands used to solve the problem.
Question 6
Find the most helpful review with a Score
of 5. Then (separately) find the most helpful review with a Score
of 1. As before, we are considering the most helpful review to be the review with the highest HelpfulnessNumerator
.
You can use multiple lines to solve this problem. |
-
The lines of UNIX commands used to solve the problem.
-
`ProductId’s of both requested reviews.
Question 7
For only the two ProductId
from the previous question, create a new dataset called scores.csv
that contains all ProductId
and Score
from all reviews for these two items.
-
The line of UNIX commands used to solve the problem.
OPTIONAL QUESTION
Use R to load up scores.csv
into a new data.frame called dat
. Create a histogram for each products' Score
. Compare the most helpful review Score
with those given in the histogram. Based on this comparison, point out some curiosities about the product that may be worth exploring. For example, if a product receives many high scores, but has a super helpful review that gives the product 1 star, I may tend to wonder if the product is not as great as it seems to be.
-
R code used to create the histograms.
-
3 histograms, 1 for each
ProductId
. -
1-2 sentences describing the curious pattern that you would like to further explore.