C++代写:CSE100 Six Degrees of Kevin Bacon

代写算法类作业,需要用到BFS和Union-Find两个算法。这个作业比较坑的地方是需要对算法进行优化,否则计算不出结果。

Overview

The assignment is sub-divided under the following headings:

  • Part 0: Survey and Consent Form
  • Part 1: Checkpoint: Pathfinder
    • You will write a program to play (and win) the generalized Kevin Bacon trivia game. Your program will take as input ANY two actors/actresses and find the shortest sequence of shared movies between them. Part of this program (unweighted shortest path) is due at the checkpoint deadline.
  • Part 2: Final Submission: Pathfinder (Complete) + Actor Connections — Deadline: Monday, November 28
    • In addition to completing the weighted shortest path version of Pathfinder, you will write a program that starts with one actor in the graph and adds new movies step-by-step to find the earliest movie year to connect the start actor to the destination actor.
  • Part 3: Final Submission: Additional Extension – Deadline: Monday, November 28
    • You will write a program that will compute some other interesting statistic or action on a different graph of your choosing. Particularly complex extensions will earn a star point

Abstract

You will explore the “Six Degrees of Kevin Bacon” game. Specifically, you will design and implement the data structures and algorithms necessary to explore degrees of separation between Hollywood actors that act in the same movies. The main focus of this assignment is to design and implement efficient graph-based data structures and use them to explore large graphs.

What is the “Six Degrees of Kevin Bacon” game?
“Six Degrees of Kevin Bacon” is a parlor game based on the “six degrees of separation” concept, which posits that any two people on Earth are six or fewer acquaintance-links apart. That idea eventually morphed into this parlor game wherein movie buffs challenge each other to find the shortest path between an arbitrary actor and prolific Hollywood actor Kevin Bacon. It rests on the assumption that any individual involved in the Hollywood, California, film industry can be linked through his or her film roles to Kevin Bacon within six steps. The game requires a group of players to try to connect any such individual to Kevin Bacon as quickly as possible and in as few links as possible. It can also be described as a trivia game based on the concept of the small world phenomenon.

Link coming up Soon!!

Part 1: Checkpoint: Pathfinder (Unweighted)

We have provided you a tab-separated file movie_casts.tsv that contains the majority of actors/actresses found in IMDb and the movies they have played in. Specifically, the file looks like this (“TAB” denotes a single tab character):

Actor/Actress<TAB>Movie<TAB>Year
50 CENT<TAB>BEEF<TAB>2003
50 CENT<TAB>BEFORE I SELF DESTRUCT<TAB>2009
50 CENT<TAB>THE MC: WHY WE DO IT<TAB>2005
50 CENT<TAB>CAUGHT IN THE CROSSFIRE<TAB>2010
50 CENT<TAB>THE FROZEN GROUND<TAB>2013
50 CENT<TAB>BEEF III<TAB>2005
50 CENT<TAB>LAST VEGAS<TAB>2013
50 CENT<TAB>GUN<TAB>2010
...

The first column contains the name of the actor/actress, the second column contains the name of a movie they played in, and the last column contains the year the movie was made. Each line defines a single actor→movie relationship in this manner (except for the first line, which is the header). You may assume that actor→movie relationships will be grouped by actor name, but do not assume they will be sorted.

Note that multiple movies made in different years can have the same name, so use movie year as well as title when checking if two are the same. Also note that some actors have a “(I)” appended to their name - so “Kevin Bacon” is really “Kevin Bacon (I)”. Make sure you DO NOT format the names of actors or movies beyond what is given in the tab-separated input file. In other words, each actor’s name should be taken exactly as the actor’s name appears in the movie_casts.tsv file: you do not have to (and should not) mess with it. During grading, the actor’s name in the test file will match the actor’s name in the movie_casts.tsv file.

In your graph, each actor/actress will define a single node. Two nodes (i.e., actors) will be connected by an undirected edge if the corresponding actors played in the same movie.

Multiple undirected edges can exist between the same two nodes (which would imply that the two actors played in multiple movies together). Once you load the movie_casts.tsv file, you should expect to find 11,794 actors or nodes, 14,252 movies, and 4,016,412 directed edges - note that if we implement our graph with directed edges, every undirected edge will be represented by two directed edges. You may NOT use any pre-built data structures, like the Boost Graph Library (BGL), besides what is provided in the C++ STL data structures.

For this part of PA4, you will write a program called pathfinder (in pathfinder.cpp) to find paths between actors. It will take 4 command-line arguments:

  1. The first argument is the name of a text file containing the movie casts in the same format as movie_casts.tsv. This file is quite large (6.4M), so you should create smaller versions to test your implementation as a first step.
  2. The second argument is a lower-case character equal to u or w - u means “build the graph with unweighted edges”, while w means “build the graph with weighted edges” (where weights are computed with the formula described later).
  3. The third argument is the name of a text file containing the pairs of actors for which you will find paths. The first line of the file is a header, and each row contains the names of two actors separated by a single tab character.
  4. The fourth argument is the name for your output text file, which will contain the shortest path between each pair of actors given in the input pairs file in argument 3. The first line of the output will be a header, and each row will contain a path for the corresponding pair of actors in the input pairs file (in the same order). Each path will be formatted as follows: ()–[#@]–>()–[#@]–>()….etc where the movie listed between each pair of actors is one where they both had a role. Note that any given pair of actors may have played in multiple movies together (like Matt Damon and Ben Affleck), and if we are interested in a shortest weighted path, you must display the particular movie between each pair of actors that yields the minimum total path weight when combined with all other movies in the path.
    Examples of the input pairs and output paths files are given in test_pairs.tsv and out_paths_unweighted.tsv, respectively. So calling your program with:
> ./pathfinder movie_casts.tsv u test_pairs.tsv out_paths_unweighted.tsv

where test_pairs.tsv contains:

Actor1/Actress1 Actor2/Actress2
BACON, KEVIN (I)<TAB>HOUNSOU, DJIMON
BACON, KEVIN (I)<TAB>KIDMAN, NICOLE
BACON, KEVIN (I)<TAB>WILLIS, BRUCE
BACON, KEVIN (I)<TAB>GIAMATTI, PAUL
HOUNSOU, DJIMON<TAB>50 CENT

should produce an output file out_paths_unweighted.tsv containing the following (although the particular movies may not match, the total path weights should match your output):

(actor)--[movie#@year]-->(actor)--...
(BACON, KEVIN (I))--[ELEPHANT WHITE#@2011]-->(HOUNSOU, DJIMON)
(BACON, KEVIN (I))--[SUPER#@2010]-->(MCKAY, COLE S.)--[FAR AND AWAY#@1992]-->(KIDMAN, NICOLE)
(BACON, KEVIN (I))--[SUPER#@2010]-->(MORENO, DARCEL WHITE)--[LAY THE FAVORITE#@2012]-->(WILLIS, BRUCE)
(BACON, KEVIN (I))--[A FEW GOOD MEN#@1992]-->(MOORE, DEMI)--[DECONSTRUCTING HARRY#@1997]-->(GIAMATTI, PAUL)
(HOUNSOU, DJIMON)--[IN AMERICA#@2002]-->(MARTINEZ, ADRIAN (I))--[MORNING GLORY#@2010]-->(50 CENT)

For the CHECKPOINT submission you are only required to have the unweighted portion of pathfinder working. This means we will test your program with all 4 arguments, except the second argument will always be a u. The weighted portion will be graded on the FINAL submission This is all that will be graded for the checkpoint submission.

The complete pathfinder program (as described below) will be graded at the final submission, so even if you don’t get the “unweighted edges” version of your program working for the checkpoint, you still need to get the whole thing working (weighted path find and unweighted) for the final submission.

Files Required:

  1. pathfinder.cpp
  2. Makefile (supporting at least make pathfinder)
  3. Any additional files needed to execute Part 1

Part 2: Final Submission: Pathfinder (Complete) + Actor Connections

In this part, the first thing you will do is complete pathfinder by implementing the “weighted edges” version of your program (which is needed for pathfinder for the Final Submission). In this version of your program, you can treat unweighted edges as weighted edges with a weight of 1 (i.e., a “dummy” weight), while truly weighted edges will have a weight equal to the age of the movie (because we will want to choose newer movies over older movies when connecting two actors). If we are defining an edge between two actors that played in a movie made in year Y, then the weight of that edge will be:

weight = 1 + (2015 - Y)

Note that we are using 2015 instead of 2016, which is because the dataset only contains movies released in 2015 and earlier. Don’t accidentally use 2016! Calling your program with:

> ./pathfinder movie_casts.tsv w test_pairs.tsv out_paths_weighted.tsv

should produce an output file out_paths_weighted.tsv containing the following (although the particular movies may not match, the total path weights should match your output):

(actor)--[movie#@year]-->(actor)--...
(BACON, KEVIN (I))--[ELEPHANT WHITE#@2011]-->(HOUNSOU, DJIMON)
(BACON, KEVIN (I))--[R.I.P.D.#@2013]-->(HUSS, TOBY (I))--[LITTLE BOY#@2015]-->(CHAPLIN, BEN)--[CINDERELLA#@2015]-->(MARTIN, BARRIE (II))--[PADDINGTON#@2014]-->(KIDMAN, NICOLE)
(BACON, KEVIN (I))--[R.I.P.D.#@2013]-->(BELTRAN, JONNY)--[THE WEDDING RINGER#@2015]-->(ROGERS, MIMI (I))--[CAPTIVE#@2015]-->(WILLIS, BRUCE)
(BACON, KEVIN (I))--[R.I.P.D.#@2013]-->(HOWARD, ROSEMARY (II))--[THE AMAZING SPIDER-MAN 2#@2014]-->(GIAMATTI, PAUL)
(HOUNSOU, DJIMON)--[THE VATICAN TAPES#@2015]-->(SCOTT, DOUGRAY)--[TAKEN 3#@2014]-->(HARVEY, DON (I))--[THE PRINCE#@2014]-->(50 CENT)

In this part of the assignment, you will implement a program called actorconnections. For a given movie database and a list of actor pairs, the actorconnections program should answer the following query for every actor pair (X, Y) in the given list: “After which year did actors X and Y become connected?”

By connected, we mean that there exists a path between actors X and Y in the equivalent movie graph (similar to that constructed in Part 1) with the exception that the movie graph under consideration only includes movies that were made until (i.e., before and including) a certain year.

actorconnections will take 4 command-line arguments:

  1. The first argument is the name of a text file containing the movie casts in the same format as movie_casts.tsv. Again, this file is quite large (6.4M), so you should create smaller versions to test your implementation as a first step.
  2. The second argument is the name of a text file containing the names of actor pairs on each line separated, with the two actor names are tab-separated (same format as test_pairs.tsv).
  3. The third argument is the name of your output text file, which should contain in each line an actor pair followed by the year (tab-separated) after which the corresponding actor pair became connected (you will do all actor pairs specified in the file from step 2, one on each line). If two actors are never connected or if one or both of the actors is not in the movie cast file given to you, simply append 9999 in the corresponding line of the output file. To further clarify, if the second argument was a file containing the actor pair “BLANCHETT, CATE” and “REEVES, KEANU” and they only became connected after adding a movie made in 1997, your program should output the actor pair and 1997 in their line of the output file. If they never became connected even after adding all the movies from in the movie cast file to your graph, you should output 9999 on that line.
  4. The fourth argument should be specified as either bfs or ufind. This option determines which algorithm will be used in your program. If the fourth argument is not given, by default your algorithm should use the union-find data structure (i.e., the equivalent of specifying ufind as the fourth argument). We will test your code with both flags.

Your output text file should look like the following:

Actor1<TAB>Actor2<TAB>Year
BLANCHETT, CATE<TAB>REEVES, KEANU<TAB>1997
KNAPP, DANIEL<TAB>WILLIS, BRUCE<TAB>9999
...

Calling your program with:

> ./refactorconnections movie_casts.tsv test_pairs.tsv out_connections_bfs.tsv ufind

should run your code (using the union-find algorithm) to produce an output file out_connections_bfs.tsv containing the following:

Actor1<TAB>Actor2<TAB>Year
BACON, KEVIN (I)<TAB>HOUNSOU, DJIMON<TAB>1992
BACON, KEVIN (I)<TAB>KIDMAN, NICOLE<TAB>1991
BACON, KEVIN (I)<TAB>WILLIS, BRUCE<TAB>1990
BACON, KEVIN (I)<TAB>GIAMATTI, PAUL<TAB>1992
HOUNSOU, DJIMON<TAB>50 CENT<TAB>2003

We would like you to implement the actorconnections program using both BFS and union-find and allow the user to select between the two by specifying the appropriate value for the fourth argument to your executable:

  1. BFS: To answer queries about the connection between actor pairs using BFS, we recommend you start with an empty graph containing only actor names and incrementally add movies in increasing order of the year of the movie. Every time you add a new set of movies made in a specific year, actors that were not connected before may become connected, which can be determined by running BFS on the updated graph.
  2. Union-Find: Alternatively, the disjoint-set (i.e., “union-find”) data structure allows you to keep track of all connected sets of actors without maintaining the corresponding graph structure. You might still consider adding movies incrementally, and if a movie creates a path between two actors that were not connected before, two disjoint sets would be merged into a single set in your union-find data structure. You should be able to then query your data structure about the connectivity of any specific actor pairs. The performance of your implementation will naturally depend on the efficiency of your Union-Find data structure. We will go over these topics in lecture as well.
    We will not test you on the corner case where both the actors are the same. You can handle it however you want.

Once you have completed and tested both implementations, compare the run time of each implementation on a file containing actor pair pairs that you generate yourself. The file should be in the same format as test_pairs.tsv. See how the run times compare when you repeat the same query multiple times. For example, in your input file (specified as the second argument), have the same actor pair appear 100 times and calculate the time it took to answer all 100 queries using your BFS implementation and then using your union-find implementation. To time your code, you can use the timing classes given to you in PA2. Analyze the timing results for the two implementations and write your analysis in the file Report.pdf.
In addition, answer the following questions in Report.pdf:

  1. Which implementation is better and by how much?
  2. When does the union-find data structure significantly outperform BFS (if at all)?
  3. What arguments can you provide to support your observations?

Files Required:

  1. pathfinder.cpp
  2. actorconnections.cpp
  3. extension.cpp
  4. Makefile (supporting at least make pathfinder, make actorconnections, and make extension)
  5. extension.txt
  6. Report.pdf
  7. Any additional files needed to execute Parts 1-3

Part 3: Final Submission -: Additional Extension

In Part 3, you should choose any additional extension in which you report some interesting graph properties or solve some interesting problem on a data set different from the kevin bacon movie data set that we provided. There are a number of extensions you can make to this assignment, and rather than choosing a data set and a problem for you, we’ve left it open for you to decide what you want to do. Do you want to pull in data from movie reviews to augment your weighting so you prefer more highly rated movies? Do you want to analyze friend circles from the facebook ego network and make interesting conclusions? Do you want to provide k friend recommendations to a user based on the number of mutual/common friends? Do you want to study on how tweets evolve over time?
Pretty much any cool non-trivial graph extension is fair game here. If you are concerned about whether or not your idea is “non-trivial” enough, ask on Piazza. It is important that you solve a problem that is considerably different from path-finder / actorConnections. You may not solve the same problem on a different graph.

These videos give you some leads and classes of questions that you might answer using graphs. This extension is really about your interests and your creativity. Feel free to come up with a completely new idea/ solve similar problems on other datasets/ tweak the ideas described in the videos. We really encourage you to keep looking and research. Google is your friend here.

You need to implement your idea in a program called extension (in a file called extension.cpp) and it is worth 3 points and potentially a star point. For full credit, you will need to provide a short write-up in the file extension.txt which includes the problem you solved, how you solved it, and how you tested it. Note that this file will be in plain text format. Include in the write-up how the grader should run your program. If this is not provided, you will receive a 0 for this part. You must also include a tsv/csv/data file for a smaller version of the graph for us to test the program. The data file(s) that you submit must be small and must contain no more than 100 vertices. Also provide a link to where we can download the entire data set in extension.txt. If you did some pre-processing to the data set, you must include that code as well in your submission and tell. You must clearly give all steps to get the data set and reproduce the results that you have provided. DO NOT PUSH THE ENTIRE DATA SET TO GIT.

REMEMBER TO ADD THE LINE !fileName TO THE .gitignore FILE. (where the fileName is the fileName of the smaller version of the graph on which we can test your code).

Since this is a challenging extension that requires a substantial amount of work, the extension may be deemed Star Point worthy. Again, we won’t say yes or no for any particular problem. The graders will deem it Star Point worthy based on the difficulty of the problem attempted, correctness and quality of write-up/testing.

Super challenging, complex extensions may even receive more than 1 star point.

Guidelines to Submit the Assignment

When you have completed the milestone you were working on (either Checkpoint or Final), you MUST submit it. Before submitting, please insure that any main methods you implement return 0 upon completion. You will do this by (1) pushing your changes to GITHUB (i.e. push the required files ONLY) and (2) Submitting on Vocareum, in that order.

For detailed instructions look at the ppt.

Format

We are giving very specific instructions on how to format your programs’ output because our auto-grader will parse your output in these formats, and any deviation from these exact formats will cause the autograder to take off points. There will be no special attention given to submissions that do not output results in the correct format. Although we will still give partial credit to the correctness of your results, if you do not follow the exact formatting described here, you are at risk of losing all the points for that portion of the assignment. NO EXCEPTIONS.