Georgia Tech的Data and Visual Analytics的作业,还是和上学期这门课一样的难。工程量巨大,SQlite, Python, JavaScript以及第三方的软件都有涉及到。
Part 1: Collecting and visualizing Twitter data
Q1
You will use the Twitter REST API to retrieve (1) followers, (2) followers of followers, (3) friends and (4) friends of friends of a user on Twitter (a Twitter friend is someone you follow and a Twitter follower is someone who follows you).
a. The Twitter REST API allows developers to retrieve data from Twitter. It uses the OAuth mechanism to authenticate developers who request access to data. Here’s how you can set up your own developer account to get started:
Check the developer agreement checkbox and click on ‘Create your Twitter application’. Once your request is approved, you can click ‘Keys and Access Tokens’ to view your ‘API key’ and ‘API secret’. You will also need to generate your access token by clicking the ‘Create my access token’ button. After this step, you are ready to make authenticated API calls to fetch data.
Important notes and hints:
- Twitter limits how fast you can make API calls. For example, the limit while making GET calls for friends is 15 requests per 15 minutes.
- Refer to the rate limits chart for different API calls.
- Set appropriate timeout intervals in the code while making requests.
- An API endpoint may return different results for the same request.
b. Search for followers of the Twitter screen name “PoloChau”. Use the API to retrieve the first 10 followers. Further, for each of them, use the API to find their 10 followers.
- Read the documentation for getting followers of a Twitter user.
- You code will write the results to followers.csv.
- Grading distribution is given in the boilerplate code.
Note: followerscreenname represents the source and username represents the target for an edge in a directed graph. You will be adding these column headers to the CSV file in a later question.
c. Search for friends of the Twitter screen name “PoloChau “. Use the API to retrieve the first 10 friends. Further, for each of the 10 friends, use the API to find their 10 friends.
- Read the documentation for getting friends of a Twitter user.
- You code will write the results to friends.csv.
- Grading distribution is given in the boilerplate code.
Note: username represents the source and friendscreenname represents the target for an edge in a directed graph. You will be adding these column headers to the CSV file in a later question.
If a user has fewer than 10 followers or friends, the API will return as many as it can find. Your code should be flexible to work with whatever data the API endpoint returns.
Q2
Visualize the network of friends and followers obtained previously using Gephi.
Note: Make sure your system fulfils all requirements for running Gephi.
a. Go through the Gephi quickstart guide.
b. Insert Source, Target as the first line in both followers.csv and friends.csv. Each line in both files now represents a directed edge with the format source, target. Import all the edges contained in these files using Data Laboratory.
Note: Remember to check the “create missing nodes” option while importing since we don’t have an explicit nodes file.
c. Visualize the graph and submit a snapshot of a visually meaningful view of this graph.
Here are some general guidelines for a visually meaningful graph:
- Keep edge crossing to a minimum, and avoid as much node overlap as possible.
- Keep the graph compact and symmetric if possible.
- Whenever possible, show node labels. If showing all node labels create too much visual complexity, try showing those for the “important” nodes.
- Using colors, sizes, thicknesses, etc. to convey information.
- Using nodes’ spatial positions to convey information (e.g., “clusters” or groups).
Experiment with Gephi’s features, such as graph layouts, changing node size and color, edge thickness, etc. The objective of this task is to familiarize yourself with Gephi and hence is a fairly open ended task.
d. Using Gephi’s builtin functions, compute and report the following metrics for your graph:
- Average node degree
- Diameter of the graph
- Average path length
Briefly explain the intuitive meaning of each metric in your own words. You will learn about these metrics in the “graphs” lectures.
Part 2: Using SQLite
The following questions help refresh your memory about SQL or get you started with SQLite , which is a lightweight, serverless embedded database that can easily handle up to multiple GBs of data. SQLite is great for building prototypes and sharing data (all data stored in a single crossplatform file).
a. Import data: Create an SQLite database called rt.db .
Note : You can use SQLite’s built in feature to import data from files (https://www.sqlite.org/cli.html#section_3 : .separator STRING and .import FILE TABLE)
b. Build indexes: Create two indexes that will speed up subsequent join operations:
An index called movies_primary_index in the movies table for the movie_id attribute
An index called movies_secondary_index in ratings table for the movie_id attribute
c. Find the total number of movies that are reviewed by at least 500 reviewers and with average ratings >= 3.5.
Output format:
movie_count
d. Finding most reviewed movies: List all the movies with at least 2500 reviews. Sort the movies by the review count (high to low) then by their names (alphabetical order) for those who may have the same review counts.
Output format:
movie_id, movie_name, review_count
e. Finding best films: Find the top 10 movies (highest average ratings). Sort the movies by their average ratings (high to low) then by their names (alphabetical order).
Output format:
movie_id, movie_name, avg_rating
f. Finding the best movies with the most reviews: Find the top 8 movies with the highest average ratings that are rated by at least 1000 users. Sort the results by the movies’ average rating (from high to low), then by the movies’ names (alphabetical order), and then genres (alphabetical order).
Output format:
movie_name, avg_rating, review_count, movie_genre
g. Creating views: Create a view (virtual table) called common_interests from the data, such that: for each movie with exactly 10 reviews, show its reviewers in pairs, for all unique reviewer combinations. User IDs should be ranked in ascending order, and within a pair, the first user ID should be strictly smaller than the second ID. For example, movie M has 10 reviews, rated by reviewers 1,2,3,4,5,6,7,8,9,10. You would show “(1, 2, M)”, “(1, 3, M)”, …, “(1, 10, M)”, “(2, 3, M)”, … , “(2, 10, M)”, etc. This example has 45 such pairs.
The view should have the format:
common_interests(user_id1, user_id2, movie_name)
Full points will only be awarded for queries that use joins.
Note: Remember that creating a view will produce no output, so you should test your view with a few simple select statements during development.
h. Calculate the total number of such pairs created from the view made in part g.
Output format:
common_interest_count
i. SQLite supports simple but powerful Full Text Search (FTS) for fast textbased querying (FTS documentation).
Import the movie overview data from movieoverview.txt into a new FTS table (in rt.db) called movie_overview with the schema:
movie_overview(id integer, name text, year integer, overview text, popularity decimal)
j. Explain your understanding of FTS performance in comparison with a SQL ‘like’ query and why FTS may perform better (hint: try SQLite’s EXPLAIN command). Write down your explanation in fewer than 50 words in “fts.txt”.
Part 3: D3 Warmup and Tutorial
- Go through the D3 tutorial here.
- Complete steps 01-09 (Complete through “09. The power of data()”).
- This is a simple and important tutorial which lays the groundwork for Homework 2.
Note: We recommend using Mozilla Firefox or Google Chrome, since they have relatively robust builtin developer tools.
Part 4: OpenRefine
a. Watch the videos on the OpenRefine’s homepage for an overview of its features. Download OpenRefine (latest release : 2.6 r.c2 )
b. Import Dataset:
- Launch OpenRefine. It opens in a browser (127.0.0.1:3333).
- Download the dataset
- Choose “Create Project” > This Computer > “menu.csv”. Click “Next”.
- You will now see a preview of the dataset. Click “Create Project” in the upper right corner.
c. Clean/Refine the data:
Note: OpenRefine maintains a log of all changes. You can undo changes. See the “Undo/Redo” button on the upper left corner.