STAT 19000: Project 4 — Spring 2022
Motivation: Up until this point we’ve utilized bits and pieces of the pandas library to perform various tasks. In this project we will formally introduce pandas and numpy, and utilize their capabilities to solve data-driven problems.
Context: By now you’ll have had some limited exposure to pandas. This is the first in a three project series that covers some of the main components of both the numpy and pandas libraries. We will take a two project intermission to learn about functions, and then continue.
Scope: python, pandas
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/stackoverflow/unprocessed/2021.csv
Questions
Question 1
The following is an example showing how to time creating a dictionary in two different ways.
from block_timer.timer import Timer
with Timer(title="Using dict to declare a dict") as t1:
my_dict = dict()
with Timer(title="Using {} to declare a dict") as t2:
my_dict = {}
# or if you need more fine-tuned values
print(t1.elapsed)
print(t2.elapsed)
There are a variety of ways to store, read, and write data. The most common is probably still csv
data. csv
data is simple, and easy to understand, however, it is a horrible format to read, write, and store. It is slow to read. It is slow to write. It takes up a lot of space.
Luckily, there are some other great options!
Check out the pandas
documentation showing the various methods used to read and write data: pandas.pydata.org/docs/reference/io.html
Read in the 2021.csv
file into a pandas
DataFrame called my_df
. Use the Timer
to time writing my_df
out to /scratch/brown/ALIAS/2021.csv
, /scratch/brown/ALIAS/2021.parquet
, and /scratch/brown/ALIAS/2021.feather
.
Make sure to replace "ALIAS" with your purdue alias. |
Use f-strings to print how much faster writing the parquet
format was than the csv
format, as a percentage.
Use f-strings to print how much faster writing the feather
format was than the csv
format, as a percentage.
You should now have 3 files in your $SCRATCH
directory: 2021.csv
, 2021.parquet
, and 2021.feather
.
Use the Timer
to time reading in the 2021.csv
, 2021.feather
, and 2021.parquet
files into pandas
DataFrames called my_df
.
Use f-strings to print how much faster reading the parquet
format was than the csv
format, as a percentage.
Use f-strings to print how much faster reading the feather
format was than the csv
format, as a percentage.
Round percentages to 1 decimal place. See here for examples on how to do this using f-strings.
Finally, how much space does each file take up? Use f-strings to print the size in MB.
There are a couple of options on how to get file size, here. |
-
Code used to solve this problem.
-
Output from running the code.
Question 2
If you haven’t already, please check out and walk through the 10 minute intro to pandas. It is a really great way to get started using pandas
.
A method is a function that is associated with a particular class. For example, mean
is a method of the pandas
DataFrame object.
# myDF is an object of class DataFrame
# mean is a method of the DataFrame class
myDF.mean()
Typically, when using pandas
, you will be working with either a DataFrame or a Series. The DataFrame class is what you would normally think of when you think about a data frame. A Series is essentially 1 column or row of data. In pandas
, both Series and DataFrames have methods that perform various operations.
Use indexing and the value_counts
method to get and print the count of Gender
for survey respondents from Indiana.
Next, use the plot
method to generate a plot. Use the rot
option of the plot
method to rotate the x-labels so they are displayed vertically.
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Let’s figure out whether or not YearsCode
is associated with ConvertedCompYearly
. Get an array of unique values for the YearsCode
column. As you will notice, there are some options that are not numeric values! In fact, when we read in the data, because of these values ("Less than 1 year", "More than 50 years", etc.), pandas
was unable to choose an appropriate data type for that column of data, and set it to "Object". Use the following code to convert the column to a string.
my_df['YearsCode'] = my_df['YearsCode'].astype("str")
Great! Now that column contains strings. Use the replace
method with regex=True
to replace all non numeric values with nothing!
my_df["YearsCode"] = my_df['YearsCode'].replace("[^0-9]", "", regex=True)
Next, use the astype
method to convert the column to "int64".
Finally, use the plot
method to plot the YearsCode
on the x-axis and ConvertedCompYearly
on the y-axis. Use the kind
argument to make it a "scatter" plot and set the logy=True
, so large salaries don’t ruin our plot.
Write 1-2 sentences with any observations you may have.
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Check out the LanguageHaveWorkedWith
column. It contains a semi-colon separated list of languages that the respondent has worked with. Pretty cool.
How many times is each language listed? If you get stuck, refer to the hints below. What languages have you worked with from this list?
You can start by converting the column to strings.
|
This function can be used to "flatten" a list of lists.
Output
[1, 2, 3, 4, 5, 6] |
You can apply any of the Python string methods to an entire column of strings in
|
Check out the |
You could use a dict to count each of the languages, or, since this is a |
-
Code used to solve this problem.
-
Output from running the code.
Question 5
pandas
really helps out when it comes to working with data in Python. This is a really cool dataset, use your newfound skills to do a mini-analysis. Your mini-analysis should include 1 or more graphics, along with some interesting observation you made while exploring the data.
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |