TDM 10100: Project 5 — Fall 2022
Motivation: R
differs from other programing languages that typically work best using vectorized functions and the apply suite instead of using loops.
Insider Knowledge
Apply Functions: are an alternative to loops. You can use apply()
and its varients (i.e. mapply(), sapply(), lapply(), vapply(), rapply(), and tapply()…) to manuiplate peices of data from data.frames, lists, arrays, matrices in a repetative way. The apply()
functions allow for flexiabilty in crossing data in multiple ways that a loop does not.
Context: We will focus in this project on efficient ways of processing data in R
.
Scope: r, data.frames, recycling, factors, if/else, for loops, apply suite
Dataset(s)
The following questions will use the following dataset(s) in anvil:
/anvil/projects/tdm/data/election/escaped2020sample.txt
Helpful Hint
A txt and csv file both sore information in plain text. csv files are always separated by commas. In txt files the fields can be separated with commas, semicolons, or tab.
To read in a txt file as a csv we simply add sep="|" (see code below)
myDF <- read.csv("/anvil/projects/tdm/data/election/escaped2020sample.txt", sep="|")
Questions
ONE
Read the dataset escaped2020sample.txt
into a data.frame called myDF
. The dataset contains contribution information for the 2020 election year.
The dataset has a column named TRANSACTION_DT
which is set up in the [month].[day].[year]
format.
We want to organize the dates in chronological order.
When working with dates, it is important to use tools specifically for this purpose (rather than using string manipulation, for example). We’ve provided you with the code below. The provided code uses the lubridate
package, an excellent package which hides away many common issues that occur when working with dates. Feel free to check out the official cheatsheet in case you’d like to learn more about the package.
library(lubridate, warn.conflicts = FALSE)
-
Use the
mdy
function (from thelubridate
library) on the columnTRANSACTION_DT
, to create a new column namednewdates
. -
Using
tapply
, add the values in theTRANSACTION_AMT
column, according to the values in thenewdate
column. -
Plot the dates on the x-axis and the information we found in part b on the y-axis.
Helpful Hint
tapply() helps us to compute statistical measures such as mean, median, minimum, maximum, sum, etc… for data that is split into groups. tapply() is most helpful when we need to break up a vector into groups, and compute a function on each of the groups.
If your You do not need to run this "fix" unless you have a cell like this, which should be running, but you are "stuck" on it: |
-
Code used to solve this problem.
-
Output from running the code.
TWO
The plot that we just created in question one shows us that the majority of the data collected is found in the years 2018-2020. So we will focus on the year 2019.
-
Create a new dataframe that only contains data for the dates in the range 01/01/2019-05/15/2019
-
Plot the new dataframe
-
What do you notice about the data?
-
Code used to solve this problem.
-
Output from running the code.
-
Answer to the questions above
THREE
Lets look at the donations by city and state
-
Find the sum of the total donations contributed in each state.
-
Create a new column that pastes together the city and state.
-
Find the total donation amount for each city/state location. In the output do you notice anything suspicious in the result? How do you think that occured?
-
Code used to solve this problem.
-
Output from running the code.
-
Answers to the questions above.
FOUR
Lets take a look who is donating
-
Find the type of data that is in the
NAME
columm -
Split up the names in the
NAME
column, to extract the first names of the donors. (This will not be perfect, but it is our first attempt.) -
How much money is donated (altogether) by people named
Mary
?
-
Code used to solve this problem.
-
Output from running the code.
-
Answer to the questions above
FIVE
Employment status
-
Using a
barplot
ordotchart
, show the total amount of donations made byEMPLOYED
vsNOT EMPLOYED
individuals -
What is the category of occupation that donates the most money?
-
Plot something that you find interesting about the employment and/or occupation columns
-
Code used to solve this problem.
-
Output from running the code.
-
1-2 sentences explaining what is was you chose to plot and why
-
Answering to the questions above
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |