TDM 20100: Project 6 — 2022
Motivation: awk
is a programming language designed for text processing. It can be a quick and efficient way to quickly parse through and process textual data. While Python and R definitely have their place in the data science world, it can be extremely satisfying to perform an operation extremely quickly using something like awk
.
Context: This is the second of three projects where we introduce awk
. awk
is a powerful tool that can be used to perform a variety of the tasks that we’ve previously used other UNIX utilities for. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner.
Scope: awk, UNIX utilities
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/craigslist/vehicles_clean.txt
-
/anvil/projects/tdm/data/donorschoose/Donations.csv
-
/anvil/projects/tdm/data/whin/weather.csv
Questions
Question 1
Use awk
to determine how many columns and rows are in the following dataset: /anvil/projects/tdm/data/craigslist/vehicles_clean.txt
.
Make sure the output is formatted as follows.
rows: 12345 columns: 12345
-
Code used to solve this problem.
-
Output from running the code.
Question 2
What are the possible "conditions" of the vehicles being sold: /anvil/projects/tdm/data/craigslist/vehicles_clean.txt
? Use awk
to answer this question. How many cars of each condition are in the dataset? Make sure to format the output as follows.
Condition Number of cars --------- -------------- AAA 12345 bb 99999
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Use awk
to determine the years (for example, 2020, 2021, etc) of the donations in the dataset: /anvil/projects/tdm/data/donorschoose/Donations.csv
?
The |
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Use awk
to determine the total donations (in dollars) by year: /anvil/projects/tdm/data/donorschoose/Donations.csv
?
Use printf
and the unix column
utility to format the output as follows.
Year Donations in dollars 2020 $1234.56 2021 $9999.99
-
Code used to solve this problem.
-
Output from running the code.
Question 5
Use awk
to determine the average temperature_high
by month: /anvil/projects/tdm/data/whin/weather.csv
. Make sure the output is sorted by month (you can use sort
for that). If you are feeling adventurous, try and use awk
to output a horizontal bar plot using just ascii and awk
. This last part is not required, but could be fun.
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |