Python代写:CSC108H Exploring the Twitterverse

探索Twitterverse信息,完成相关函数的开发。

Twitterverse

Goals of this Assignment

  • Practice working with structured files.
  • Practice building and using dictionaries.
  • Write unittests for a function.
  • Use top-down design to break a problem down into subtasks and implement helper functions to complete those tasks.

Note the last goal in the list – in this assignment, you will be doing significantly more design of functions than in the previous ones. We will give you a smaller number of functions to write, and you will write helper functions to solve the subtasks of each of those functions.

Introduction: The Twitterverse

This handout page explains the problem you will be solving, the data you will be working with, and the tasks you need to complete. Please read it carefully – you will need to read all of the information here to complete the assignment successfully.

Twitter is a social networking website where users can post very short messages known as “tweets”. Each Twitter user can choose to “follow” other users, which means that they see those users’ tweets. A Twitter user sees the tweets of users they are “following”, and their tweets are seen by their
“followers” (the users who follow them).

All the “follow” connections define a network among Twitter users, and it’s quite interesting to look for patterns in the connections. Tools like Twiangulate (http://twiangulate.com/search/) let you explore questions like “what connections do two users have in common?”. In this assignment, you’ll write functions that let you ask questions (or “queries”) about a Twitter dataset.

While it is possible to get data directly from Twitter, we have simplified things for you and provided the data already stored in a file.

Your Tasks

For this assignment, you are required to:

  1. Write the 6 required functions described later in this handout.
  2. Write helper functions as needed. All helper functions should have complete docstrings, but there are no other requirements for your helper functions.
  3. Write unittests for one of the required functions, all_followers.

This is your first experience designing a program of this size. You will have to do far more planning than in the previous two assignments. It is likely you will need to break most of the required functions down into subtasks and write helper functions for those subtasks.

The Twitter Data File

A Twitter data file contains a series of one or more user profiles, one after the other. Each user profile has the following elements, in this order:

  • A line containing a non-blank, non-empty username. You may assume that usernames are unique (i.e. a single username will not occur more than once in the file), and that usernames do not contain any whitespace. Usernames are case sensitive.
  • A line for the user’s actual name. If they did not provide a name, this line will be blank.
  • A line for the user’s location, or a blank line if they did not provide one.
  • A line for the URL of a website, or a blank line if they did not provide one.
  • Zero or more lines for the user’s bio, then a line with nothing but the keyword ENDBIO on it. This marks the end of the bio, and is not considered part of it. (You may assume that no bio has the string ENDBIO within it.) If the user did not provide a bio, the ENDBIO line will come immediately after the website line, with no blank line in between.
  • Zero or more lines each containing a single username of someone that this user is following, then a line with the keyword END on it. (You may assume that no one has END as their username.) A user cannot follow themselves. You may assume that every user that appears in this section of the file has a user profile in the Twitter data file.

A blank line is one that contains only whitespace characters. For this assignment, whitespace characters include spaces, tabs, and newlines.

Notice that the keywords ( ENDBIO and END ) act as separators in this file. All of their letters are capitalized, and the keywords contain no punctuation. You may assume that no keyword separators will appear as any part of the content of a user profile.

Examples

Here is a sample user profile that might occur among many in a file:

tomCruise
Tom Cruise
Los Angeles, CA http://www.tomcruise.com
Official TomCruise.com crew tweets. We love you guys!
Visit us at Facebook!
ENDBIO
katieH
NicoleKidman
END

The file data.txt in the provided starter files is a small example of a complete Twitter data file (and was made by hand). The file rdata.txt is a much larger example (and is made from real data extracted from Twitter). These should help you confirm your understanding of the file format and will be useful in testing your program.

Cycles in the data

Although a user cannot follow themselves, there can be “cycles” such as this: user A can be following B who is following A. This is the shortest possible cycle. Of course, cycles can be longer.

The Query File

Note that the word “query” just means “question”. In computer science, we use it to specify a request for information.

For this assignment, a query will be provided in a file. Below we will review the high-level parts of the query, look at an example, and then describe the query file format.

Overview

A query has three components: a search specification, a filter specification, and a sorting specification.

The search specification describes how to generate a list of Twitter usernames, starting with an initial username (a list of length one) and then finding their followers or people they are following, then people that are those people’s followers or who they are following, and so on. When processing the search specification, don’t try to do anything to avoid cycles. For instance, if the search specification says to find the people who user A is following, and from there the people they are following, you could find yourself back at user A. Don’t try to avoid that.

After processing the search specification, we have a list of Twitter usernames. Its length could be zero. For example, if the initial username is ‘dianelynnhorton’ and the search specification contains a single ‘followers’ keyword, then the number of results of the search will be zero if ‘dianelynnhorton’ has no followers.

The filter specification describes how to filter the list of usernames produced by the search specification. The filtering can be based on

  • whether or not they are following a particular user,
  • whether or not a particular user is their follower,
  • whether or not their name contains a particular string (case-insensitive), or
  • whether or not their location contains a particular string (case-insensitive).

After processing the filter specification, we have a possibly reduced list of usernames.

Once the search results have been found and filtered, the sorting specification describes how the results should be sorted. Prior to the sorting step, the order of usernames does not matter.

Example query

Here is an example query:

SEARCH
tomCruise
following
following
following
FILTER
following c
location-includes a
SORT
popularity

The search specification in this particular query has four steps.

  1. Start with a list containing the username to start the search from; i.e. [‘tomCruise’]. Let’s call that list L1.
  2. The search keyword ‘following’ says to replace each username p in L1 with the usernames of the users who p is following. This yields a new list, L2.
  3. For the next ‘following’ keyword, we start with L2 and repeat the same operation as in the previous step, yielding another list, L3.
  4. For the final ‘following’ keyword, we start with L3 and repeat that operation one last time, yielding list L4.

Notice that each step yields a list of zero or more usernames that is the input to the next step. There should be no duplicates in the final results list. Duplicates should be removed after each step.

The Twitter data file diagram_data.txt in the provided files contains the follower/following relationships as represented by this diagram.

For those relationships, the search specification above would yield this list of usernames: [‘i’, ‘j’, ‘h’, ‘k’, ‘tomCruise’] . Make sure that you can see how the four lists, ending with this final one, are generated. Notice that the final list contains the users you can get to in three “steps” of the “following” relationship, starting from ‘tomCruise’ .

The final list generated by the search specification becomes the input to the filter specification. For our current example, the filter specification says that the list should be filtered in this way: a user should be kept only if they are following user ‘c’ and has a location that includes the string ‘a’ . Based on the query above, the final filtered list would be [‘tomCruise’] .

The sorting specification says to order the users according to their popularity.

Now let’s look in detail at the exact format and meaning of a query.

Overall format of a query

The format of the query data file is as described below, in this order:

  • A line containing the keyword SEARCH.
  • The search specification.
  • A line containing the keyword FILTER.
  • The filter specification.
  • A line containing the keyword SORT.
  • The sorting specification.

Notice that the keywords above all act as separators in this file and are in all capital letters. There are other keywords (such as following which can be part of a search specification) that are not separators; they are not capitalized.

Format and meaning of the search specification

The format of a search specification is as shown below, in this order:

  • A line containing a single username
  • Zero or more lines, in any order, each containing one of:
    • the keyword followers
    • the keyword following

Each step in the search specification involves creating a new list of users by going through each user in the current list and finding their followers or who they are following. Importantly, at each step, we want to create the new results list by replacing each user from the current list with the users that they are following or who are their followers.

The possible search operations for a given user are:

  • followers: this will get all usernames for users who are following the given user (i.e. that have the given username in the value list paired with the key ‘following’ in their data in the Twitterverse dictionary – more on this later).
  • following : this will get all usernames for users who the given user is following (i.e. the value list paired with the key ‘following’ in the data associated with the given user in the Twitterverse dictionary).

Note that, if the list of usernames is [‘A’] , and we perform a ‘followers’ or ‘following’ operation,
‘A’ will not be in the resulting list, because users cannot follow themselves. If ‘A’ is one of multiple usernames in the current list, then ‘A’ can appear in the results list, but only if one of the other usernames produces ‘A’ from the search operation.

The final list of results should not contain any duplicates. Duplicates should be removed after each operation is performed. The order of the final list of results from the search operation does not matter.

Format and meaning of the filter specification

The format of a filter specification is:

  • Zero or more lines, in any order, each containing one of the following:
    • the keyword name-includes, a space, and a string to match
    • the keyword location-includes, a space, and a string to match
    • the keyword bio-includes, a space, and a string to match
    • the keyword follower, a space, and a string which is a valid username
    • the keyword following, a space, and a string which is a valid username

Each of the lines has exactly one space within it (the space separating the keyword from the next string), and each keyword will appear at most once per filter specification.

The filters are “additive” in that each filter step should be applied to the list that resulted from the previous filter step. That is, you start with the list from the search specification, filter it based on the first filter step to get a new list L1, filter L1 in the second filter step to get L2, and so on.

The ‘following’ and ‘follower’ filters will add users in the new list if they are following a particular user, or if a particular user is their follower. These filters are defined as:

  • The ‘following’ filter adds only users who are following provided username.
  • The ‘follower’ filter adds only users who the provided username is following.

The filters that have ‘includes’ as part of the keyword will do a simple substring search on the string representing the relevant data for the user in the Twitterverse dictionary. This means that if the given string occurs anywhere in a user’s name (for a ‘location-includes’ filter) or a user’s location (for a ‘location-includes’ filter), then that user is to be kept in the list. The substring search should be case-insensitive. For example, if the filter specifies users whose locations include the string “USA”, then users with location “USA”, “Modesto, California, USA”, “kansas, usa” or “Musala, Bulgaria” would be kept, and users with location “United States of America” would be excluded. This is far from perfect, but don’t try to improve on it.

The order of the final list of results from the filter operation does not matter.

Format and meaning of the sorting specification

The output from your program must include every user whose username is still in the list after the search and filter specifications have been processed.

The sorting specification describes how the final list of results should be ordered. It consists of one line containing one of these keywords: ‘username’ , ‘name’ , ‘popularity’ .

The users in the final results list must be ordered as indicated by the keyword of the sorting specification. Output that is sorted by username or name should be in alphabetical order, so that strings starting with ‘a’ are at the beginning of the output. Although usernames are unique, names are not. If any users have the same name, sort them by username. When sorting users by popularity, the most popular users should come first. A user’s popularity is defined to be the number of followers they have. (Not the number of people they are following!) If any users are tied for popularity, sort them by username.

We can use Python’s built-in sort and/or string comparison for sorting by username. However, we’ll have to find another way to sort by username name or popularity. There is more information on sorting is later in the “Sorting your output” section.

Data Structures

To increase the readability of our code, we will use dictionaries to represent both the Twitter data and the query in our program. In this section, we’ll look at the format of the data and query dictionaries.

The Twitterverse Data Dictionary

The type of the Twitterverse dictionary is Dict[str, Dict[str, object]]. (We use the type object here to indicate that the values in the inner dictionary can have different types.) More specifically, each key in the Twitterverse dictionary is the username of a Twitter user, and the value associated with that key is a dictionary containing additional information about the user. Each of these inner dictionaries will contain the keys the ‘name’, ‘location’, ‘web’, ‘bio’ and ‘following’. The value associated with key is a list of zero or more strings, and the values associated with the rest of the keys are strings (and may be the empty string).

For example, if the following user information is included in our Twitter data file…

tomCruise
Tom Cruise
Los Angeles, CA
http://www.tomcruise.com
Official TomCruise.com crew tweets. We love you guys!
Visit us at Facebook!
ENDBIO
katieH
NicoleKidman
END

… then this key/value pair will be in our Twitterverse dictionary.

'tomCruise': {'name': 'Tom Cruise',
      'bio': 'Official TomCruise.com crew tweets. We love you guys!\nVisit us at Facebook!',
      'location': 'Los Angeles, CA',
      'web': 'http://www.tomcruise.com',
      'following': ['katieH', 'NicoleKidman']}

Notice that the newlines at the end of each line are removed from the data stored in the Twitterverse dictionary, with the exception of the bio information, which needs to keep its inner newlines so that we could reconstruct the bio format the user chose. In general, you should remove leading and trailing whitespace from each line.

The usernames in the following list should be in the same order as in the data file.

And another example: if the following user information is included in our Twitter data file (note the blank lines for name, location, and web, and zero lines for bio and following)…

quietTweeter


ENDBIO
END

… then this key/value pair will be in our Twitterverse dictionary.

'quietTweeter': {'name': '',
         'bio': '',
    'location': '',
         'web': '',
   'following': []}

Note that dictionaries are not ordered, so don’t worry if your dictionary output is in a different order, as long as the items are the same.

The Query Dictionary

We will also use dictionaries to represent queries in our program. The query dictionary contains three items; the key ‘search’, whose value is a search specification dictionary, the key ‘filter’, whose value is a filter specification dictionary , and the key ‘sorting’, whose value is a sorting specification

The search specification dictionary contains the keys ‘username’ and ‘operations’, whose values are the string representing the username to start at, and a list of strings representing the operations to perform, respectively. Note that the list that is the value associated with the key ‘operations’ may be empty, if there is only one line between the SEARCH and FILTER keywords in the query file. If there is only one line between SEARCH and FILTER, then it represents the username to start at, and there are no operations to be performed in the search step. The operations must be performed in the order that they are in the list.

The filter specification is a dictionary with strings representing filter keywords as keys, and the strings to match as values. Note that this dictionary may be empty if there are no filters to apply, and that each filter keyword will appear at most once in the filter section of a query file.

The sorting specification is a single string value representing the order to sort the results in.

For example, if our query file looks like this…

SEARCH
tomCruise
following
following
following
FILTER
following
KatieH
location-includes USA
SORT
popularity

… then the query dictionary will have these key/value pairs.

{ 'search': {'username': 'tomCruise', 'operations': ['following', 'following', 'following']},
  'filter': {'following': 'KatieH', 'location-includes': 'USA'},
  'sorting': {'sort-by': popularity'}}

The main program

Once you have correctly implemented the functions in twitterverse_functions.py, execution of the main program (twitterverse_program.py) will:

  1. Read a data file and produce the Twitterverse data dictionary.
  2. Read a query file and produce the query dictionary.
  3. Compute the results of the search operations.
  4. Apply the filters to the search results.
  5. Sort the final results.

You may try running this file on the provided data and query files, as well as your own files that you have created for testing. Note that just running the your code. twitterverse_program.py is NOT sufficient to test

Required Testing ( unittest )

Write and submit a set of unittests for the all_followers function in the file test_all_followers.py . You should create Twitterverse dictionaries as inputs for your test cases inside the test_all_followers.py file. Do NOT call open in your test file, or anywhere else in your code.

Files to Download

All of the files included in the download for the assignment are listed in this section. These files must all be placed in the same folder.

Please download the Assignment 3 files and extract the zip archive. The following paragraphs explain the files you have been given.

  • Python Code
    • The starter code for this assignment is in the file twitterverse_functions.py. This is the primary file you will be editing.
    • The main program file is twitterverse_program.py. It is complete and must not be changed.
    • The starter code for the required unittests is in the file test_all_followers.py. You will edit and submit this file as well.
    • We are providing a checker as usual: a3_checker.py. It is complete and must not be changed.
  • Sample Twitter Data and Sample Queries
    • We have provided a number of text files containing Twitter data, and queries on that data. All files in the provided zip file that contain data in their filename are Twitter data files, and all files in the provided zip file that contain query in their filename are query files. Put these files in the same directory as your other A3 files.
    • The query files and the file query1.txt, query2.txt and query3.txt are queries on the data in data.txt, and the file query4.txt is a query on the data in rdata.txt.

We are providing a type-check module that can be used to test whether your functions in twitterverse_functions.py a3_checker.py have the correct parameter and return types. To use the type checker, place in the same folder (directory) as your twitterverse_functions.py and run it.

If the type-checks pass: the output will tell you that the typechecker passed (and what it means for the typechecker to pass!). If the typechecker passes, then the parameters and return types match the assignment specification for each of the functions.

If the type-checks fail: Look carefully at the message provided. One or more of your parameter or return types does not match the assignment specification. Fix your code and re-run the tests. Make sure the tests pass before submitting.

We will do a much more thorough job of testing your code once you hand it in, so be sure that you have thoroughly tested it yourself. With the functions in this assignment, there are many more possible cases to test (and cases where your code could go wrong). If you want to get a great mark on the correctness of your functions, do a great job of testing your functions under all possible conditions. Then we won’t be able to find any errors that you haven’t already fixed!