TDM 30100: Project 5 — 2023
Motivation: Documentation is one of the most critical parts of a project. There are so many tools that are specifically designed to help document a project, and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can’t go wrong with tools like Sphinx, or pdoc.
Context: This is the third project in a 3-project series where we explore thoroughly documenting Python code, while solving data-driven problems.
Scope: Python, documentation
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_book_authors.json
-
/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_book_series.json
-
/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_books.json
-
/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_reviews_dedup.json
Questions
The listed datasets are fairly large, and interesting! They are json
formatted data. Each row of a single json
file can be individually read in and processed. Take a look at a single row.
%%bash
head -n 1 /anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_books.json
This is nice, because you can individually process a single row. Anytime you can do something like this, it is easy to break a problem into smaller pieces and speed up processing. The following demonstrates how you can read in a single line and process it.
import json
with open("/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_books.json") as f:
for line in f:
print(line)
parsed = json.loads(line)
print(f"{parsed['isbn']=}")
print(f"{parsed['num_pages']=}")
break
In this project, the overall goal will be to implement functions that perform certain operations, write the best docstrings you can, and use your choice of pdoc
or sphinx
to generate a pretty set of documentation.
Begin this project by choosing a tool, pdoc
or sphinx
, and setting up a firstname-lastname-project05.py
module that will host your Python functions. In addition, create a Jupyter Notebook that will be used to test out your functions, and generate your documentation. At the end of this project, your deliverable will be your .ipynb
notebook and either a series of screenshots that captures your documentation, or a PDF created by exporting the resulting webpage of documentation.
Question 1 (1pt)
-
In your Jupyter Notebook, run the example codes above, understand the data selected
Question 2 (2 pts)
-
Write a function called
scrape_image_from_url
that accepts a URL (as a string), and returns abytes
object of the data. -
Write a function called
display_image_from_bytes
that display the image direclty without saving image to disk.
Make sure scrape_image_from_url
cleans up after itself and doesn’t leave any image files on the filesystem.
scrape_image_from_url
-
Create a variable with a temporary file name using the
uuid
package. -
Use the
requests
package to get the response.import requests response = requests.get(url, stream=True) # then the first argument to copyfileobj will be response.raw
-
Open the file and use the
shutil
packagescopyfileobj
method to copy theresponse.raw
to the file. -
Open the file and read the contents into a
bytes
object.You can verify a bytes object by:
type(my_object)
outputbytes
-
Use
os.remove
to remove the image file. -
Return the bytes object.
display_image_from_bytes
-
Convert the byte data into a readable image format using an image processing library
-
Open and display this readable image
|
You can verify your function works by running the following:
import shutil
import requests
import os
import uuid
import hashlib
url = 'https://images.gr-assets.com/books/1310220028m/5333265.jpg'
my_bytes = scrape_image_from_url(url)
m = hashlib.sha256()
m.update(my_bytes)
m.hexdigest()
display_image_from_bytes(my_bytes)
ca2d4506088796d401f0ba0a72dda441bf63ca6cc1370d0d2d1d2ab949b00d02
(image)
Question 3 (2 pts)
-
Write a Python function called
top_reviewers
that reads fileGoodreads_reviews_parsed.json
and returns the IDs of the top 5 users who have provided the most reviews.
The following shows how to test the function
filename = "Goodreads_reviews_parsed.json"
print(top_reviewers(filename))
|
Question 4 (2 pts)
-
Create a new function, that does something interesting with one or more of these datasets. Just like all the previous functions, make sure to include detailed and clear docstrings.
Question 5 (1 pt)
-
Generate your final documentation, and assemble and submit your deliverables:
-
Screenshots and/or a PDF exported from your resulting documentation web page
-
Project 05 Assignment Checklist
-
Jupyter
.ipynb
file with your codes, comments and outputs for the assignment-
firstname-lastname-project05.ipynb
.
-
-
Screenshots and/or a PDF exported from your resulting documentation web page.to show your outputs
-
A html file
.htm
with your newest set of documentation. -
Submit files through Gradescope
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |