建立co-actor网络,完成SQLite, 以及D3等环境搭建。
Q1 Collect data from TMDb to build a co-actor network
Build a co-actor network of highly rated movies using information from The Movie Database (TMDb). Through data collection and analysis, create a graph showing the relationships between actors based on their highly rated movies.
Follow the instructions found in Q1.py to complete the Graph class, the TMDbAPIUtils class, and the one global function. The Graph class will serve as a re-usable way to represent and write out your collected graph data. The TMDbAPIUtils class will be used to work with the TMDb API for data retrieval.
Tasks
- Implementation of the Graph class according to the instructions in Q1.py.
- a. The graph is undirected, thus {a, b} and {b, a} refer to the same undirected edge in the graph; keep only either {a, b} or {b, a} in the Graph object. A node’s degree is the number of (undirected) edges incident on it. In/ out-degrees are not defined for undirected graphs.
- Producing correct nodes.csv and edges.csv. a. If an actor’s name has comma characters (“,”), remove those characters before writing that name into the CSV files.
Q2 SQLite
Construct a TMDb database in SQLite, partition it, and combine information within tables to answer questions.
You will modify the given Q2.py file by adding SQL statements to it. We suggest testing your SQL locally on your computer using interactive tools to speed up testing and debugging, such as DB Browser for SQLite.
Important Notes
- If the final output asks for a decimal column, format it to two places using printf(). Do NOT use the ROUND() function, as in rare cases, it works differently on different platforms. If you need to sort that column, be sure you sort it using the actual decimal value and not the string returned by printf.
- A sample class has been provided to show example SQL statements; you can turn off this output by changing the global variable SHOW from True to False.
- In this question, you must only use INNER JOIN when performing a join between two tables, except for part g. Other types of joins may result in incorrect results.
Tasks
- Create tables and import data.
- a. Create two tables (via two separate methods, part_ai_1 and part_ai_2, in Q2.py) named movies and movie_cast with columns having the indicated data types:
- i. movies
- id (integer)
- title (text)
- score (real)
- ii. movie_cast
- movie_id (integer)
- cast_id (integer)
- cast_name (text)
- birthday (text)
- popularity (real)
- i. movies
- b. Import the provided movies.csv file into the movies table and movie_cast.csv into the movie_cast table
- i. Write Python code that imports the .csv files into the individual tables. This will include looping though the file and using the ‘INSERT INTO’ SQL command. Make sure you use paths relative to the Q2 directory.
- c. Vertical Database Partitioning. Database partitioning may help speed up queries. Create a new table cast_bio from the movie_cast table. Be sure that the values are unique when inserting into the new cast_bio table. Read this page for an example of vertical database partitioning.
- i. cast_bio
- cast_id (integer)
- cast_name (text)
- birthday (text)
- popularity (real)
- i. cast_bio
- a. Create two tables (via two separate methods, part_ai_1 and part_ai_2, in Q2.py) named movies and movie_cast with columns having the indicated data types:
- Create indexes. Create the following indexes. Indexes increase data retrieval speed; though the speed improvement may be negligible for this small database, it is significant for larger databases.
- a. movie_index for the id column in movies table
- b. cast_index for the cast_id column in movie_cast table
- c. cast_bio_index for the cast_id column in cast_bio table
- Calculate a proportion. Find the proportion of movies with a score between 7 and 20 (both limits inclusive). The proportion should be calculated as a percentage.
- a. Output format and example value: 7.70
- Find the most prolific actors. List 5 cast members with the highest number of movie appearances that have a popularity ] 10. Sort the results by the number of appearances in descending order, then by cast_name in alphabetical order.
- a. Output format and example row values (cast_name,appearance_count): Harrison Ford,2
- List the 5 highest-scoring movies. In the case of a tie, prioritize movies with fewer cast members. Sort the result by score in descending order, then by number of cast members in ascending order, then by movie name in alphabetical order.
- a. Output format and example values (movie_title,score,cast_count): Star Wars: Holiday Special,75.01,12 Games,58.49,33
- Get high scoring actors. Find the top ten cast members who have the highest average movie scores. Sort the output by average_score in descending order, then by cast_name alphabetically.
- a. Exclude movies with score < 25 before calculating average_score.
- b. Include only cast members who have appeared in three or more movies with score >= 25.
- i. Output format and example value (cast_id,cast_name,average_score): 8822,Julia Roberts,53.00
- Creating views. Create a view (virtual table) called good_collaboration that lists pairs of actors who have had a good collaboration as defined here. Each row in the view describes one pair of actors who appeared in at least 2 movies together AND the average score of these movies is >= 40.
For symmetrical or mirror pairs, only keep the row in which cast_member_id1 has a lower numeric value. For example, for ID pairs (1, 2) and (2, 1), keep the row with IDs (1, 2). There should not be any “self-pair” where cast_member_id1 is the same as cast_member_id2. Remember that creating a view will not produce any output, so you should test your view with a few simple select statements during development. One such test has already been added to the code as part of the auto-grading. NOTE: No any code that creates a ‘TEMP’ or ‘TEMPORARY’ view that you may have used for testing.
Optional Reading: Why create views? - Find the best collaborators. Get the 5 cast members with the highest average scores from the good_collaboration view, and call this score the collaboration_score. This score is the average of the average_movie_score corresponding to each cast member, including actors in cast_member_id1 as well as cast_member_id2.
- a. Order your output by collaboration_score in descending order, then by cast_name alphabetically.
- b. Output format and example values(cast_id,cast_name,collaboration_score): 2,Mark Hamil,99.32 1920,Winoa Ryder,88.32
- SQLite supports simple but powerful Full Text Search (FTS) for fast text-based querying (FTS documentation).
- a. Import movie overview data from the movie_overview.csv into a new FTS table called movie_overview with the schema: movie_overview id (integer) overview (text)
NOTE: Create the table using fts3 or fts4 only. Also note that keywords like NEAR, AND, OR, and NOT are case-sensitive in FTS queries. - b. Count the number of movies whose overview field contains the word ‘fight’. Matches are not case sensitive. Match full words, not word parts/sub-strings.
- i. Example: Allowed: ‘FIGHT’, ‘Fight’, ‘fight’, ‘fight.’ Disallowed: ‘gunfight’, ‘fighting’, etc.
- ii. Output format and example value: 12
- c. Count the number of movies that contain the terms ‘space’ and ‘program’ in the overview field with no more than 5 intervening terms in between. Matches are not case sensitive. As you did in h(i)(1), match full words, not word parts/sub-strings.
- i. Example: Allowed: ‘In Space there was a program’, ‘In this space program’ Disallowed: ‘In space you are not subjected to the laws of gravity. A program.’
- ii. Output format and example value:
- a. Import movie overview data from the movie_overview.csv into a new FTS table called movie_overview with the schema: movie_overview id (integer) overview (text)
Q3-D3 Visualizing Wildlife Trafficking by Species
This will utilize a dataset provided to ensure the global trade of wildlife is both legal and sustainable.
Using species-related data, you will build a bar chart to visualize the most frequently illegally trafficked species between 2015 and 2023. Using D3, you will get firsthand experience with how interactive plots can make data more visually appealing, engaging, and easier to parse.
Important Notes
- Setup an HTTP server to run your D3 visualizations as discussed in the D3 lecture (OMS students: watch lecture video. Campus students: see lecture PDF.). The easiest way is to use http.server for Python 3.x. Run your local HTTP server in the hw1-skeleton/Q3 folder.
- We have provided sections of skeleton code and comments to help you complete the implementation. While you do not need to remove them, you need to write additional code to make things work.
- All d3*.js files are provided in the lib folder and referenced using relative paths in your html file. For example, since the file “Q3/Q3.html” uses d3. The 3 files that are referenced are:
- a. lib/d3/d3.min.js
- b. lib/d3-dsv/d3-dsv.min.js
- c. lib/d3-fetch/d3-fetch.min.js
- In your html / js code, use a relative path to read the dataset file. Since Q3 requires reading data from the q3.csv file, the path must be “q3.csv”.
- Load the data from q3.csv using D3 fetch methods. We recommend d3.dsv().
- VERY IMPORTANT: Use the Margin Convention guide to specify chart dimensions and layout.
Tasks
Q3.html: When run in a browser, should display a horizontal bar plot with the following specifications:
- The bar plot must display one bar for each of the five most trafficked species by count. Each bar’s length corresponds to the number of wildlife trafficking incidents involving that species between 2015 and 2023, represented by the ‘count’ column in our dataset.
- The bars must have the same fixed thickness, and there must be some space between the bars, so they do not overlap.
- The plot must have visible X and Y axes that scale according to the generated bars. That is, the axes are driven by the data that they are representing. They must not be hard-coded. The x-axis must be a [g] element having the id: “x_axis” and the y-axis must be a [g] element having the id: “y_axis”.
- Set x-axis label to ‘Count’ and y-axis label to ‘Species’. The x-axis label must be a [text] element having the id: “x_axis_label” and the y-axis label must be a [text] element having the id: “y_axis_label”.
- Use a linear scale for the X-axis to represent the count (recommended function: d3.scaleLinear()). Only display ticks and labels at every 500 interval. The X-axis must be displayed below the plot.
- Use a categorical scale for the Y-axis to represent the species names (recommended function: d3.scaleBand()). Order the species names from greatest to least on ‘Count’ and limit the output to the top 5 species. The Y-axis must be displayed to the left of the plot.
- Set the HTML title tag and display a title for the plot. Those two titles are independent of each other and need to be set separately. Set the HTML title tag (i.e., [title] Wildlife Trafficking Incidents per Species (2015 to 2023)[/title]). Position the title “Wildlife Trafficking Incidents per Species (2015 to 2023)” above the bar plot. The title must be a [text] element having the id: “title”.
- Add username TTX598 to the area beneath the bottom-right of the plot. The username must be a [text] element having the id: “credit”
- Fill each bar with a unique color.