Java代写:CSCI1300 Working with real data



In this assignment you will have the opportunity to apply what you’ve learned this semester about programming to an actual problem and actual data. For this assignment we will use social media data collected during the 2014 Carlton Complex Wildfire in Eastern Washington State. This data set was part of my dissertation research on the integration of public social media communication into emergency response.
The development of information and communication technologies (ICTs) has changed how members of the public communicate and share information with each other during crisis and disaster events. Researchers in the field of crisis informatics look at social media communications for insight into how these technologies are reshaping the information space surrounding a disaster and provide new ways for the public to participate in both sharing of information and response. My research focuses on the challenges faced by emergency responders as they work to leverage these channels as part of their emergency communications plan and also the solutions being developed to support the monitoring of an often complex and unwieldy information space as events unfold.
I work with an innovative group of emergency responders who are part of the social media in emergency management community (SMEM) that have pioneered a new form of digital volunteerism within the emergency response community called a Virtual Operational Support Team (VOST). Members of VOST teams have a mix of social media communication skills and training in public information work and emergency response protocols. During a disaster, a VOST team extends the resources of the emergency response team on the ground coordinating public social media communications and gathering relevant situational awareness information for the incident management team.
This dataset was taken from the 2014 Carlton Complex Wildfire. The fire started on July 14th, when a lightning storm moved through the Methow Valley in Eastern Washington State. On July 17th, adverse weather conditions caused the fire to grow explosively overnight from approximately forty-nine thousand acres to over a hundred and fifty thousand acres. This rate of fire growth is somewhat unprecedented and the fires burned through the towns of Pateros and Twisp resulting in large-scale evacuations and the destruction of over 300 homes. The fire also destroyed critical infrastructure resulting in widespread power and cellular outages in many place for over a week. The data set for this fire starts on July 17th when Portland NIMO, a federal Type I team, was assigned to the fire and the NIMO VOST was activated until July 27th when the team stood down. The fire ultimately grew to 256 thousand acres making it the largest wildfire in Washington State history (eclipsed by the 2015 Okanogan Complex in the same area a year later).
As a researcher on CU’s Project EPIC, my role on the VOST was to provide analytical support to the public information team on the ground using data collected through the Twitter API. I developed Python scripts that expanded the links to embedded content and massaged the data in useful ways for analysis in Tableau, a data visualization tool. At the end of each day, I worked on a comprehensive summary that was forwarded to the public information team as a reference for the morning briefing the following day.
Twitter is a particularly interesting platform for analysis during a disaster because the Twitter stream can show you what is relevant in the moment across a wide variety of sources. The ability to retweet information reinforces its currency and acts as recommendation to others in a Twitterer’s network or following the conversation. In addition, the ability to embed links and media provides visibility to what is being shared across multiple social media platforms simultaneously.

Data Set Description

The full dataset for this fire contains over 24 thousand tweets and related information. I hava created multiple data extract files from this dataset so that you can work with information on a more manageable scale.
As part of the analysis, we coded the most commonly occurring sources of information (Twitter accounts and URL domains) using the following values:

Source CategoryDescription
Business/orgBusiness or Organization (e.g. Sierra Club)
Business/org - localLocal Business or Organization (e.g. Methow Chamber of Commerce)
EM/Fire TweeterAny account used primarily to share information about fire/disater but not an official account (e.g. personal account for fire personnel, member of SMEM community)
FundraisingAccount used primarily for fundraising purposes (e.g. GoFundMe). This is important for the public information team because they are on the lookout for notential fraud.
IndividualAccount belonging to an individual who is either not from the geographic region or we can’t tell.
Individual - localIndividual local to the fire.
Individual - pnwIndividual from Pacific Northwest Resion but no evidence that they are local.
Individual - WAResident of Washington State but no evidence that they are local to the fire.
MediaOfficial account for media source or personal accounts for media personnel.
News TweeterAccounts that mass tweet links to trending news topics. It is often a fine line between these types of accounts and spam. These have been excluded from the extracts for the project.
Official - CivicOfficial agency not related to emergency response organizations. Typically official city government agencies and personal accounts for civic figures. (e.g. the Mayor)
Official EM/FireOfficial accounts used to share public information surrounding disaster.
Official OtherOfficial organizations that don’t fall into response or civic organizations.
Social MediaSocial Media Sources (e.g. YouTube, Instragram, etc.)
Spam/OtherSources resulting either from noise related to search terms or that are hijacking trending hash tags.
VOSTPersonal accounts for individual VOST team members.
UnknownA tweet that has not yet been.

Tweet extracts

Each row in the tweet extracts is an individual tweet and contains the following columns:

Column NameDescription
RowRow identifier in data set.
TextThe content of the tweet. (max length 120 characters)
Original TweetLink to the original tweet.
Local DateThe date translated to local time zone. (format month/day)
Local HourHour translated to local time zone.
Local MinuteMinute translated to local time zone.
Is RetweetBoolean value, true if tweet is a retweet, false otherwise.
Retweet CountNumber of times tweet was retweeted at end of data collection.
Screen NameUser Screen Name on Twitter.
User ClassIndicates Twitter account type (see table above)
User IDUnique ID for user on Twitter.
User LinkLink to user account on Twitter.
CoordinatesLatitude/longitude values for geocoded tweets.
URLFully expanded link for embedded coutent. If you see a hyperlink in a tweet then this is the link to it.
URL DomainThe URL domain.
URL Domain ClassSource classification for domain. (e.g. media, official em/fire)
Media Screen NameTwitter source for embedded content.
Media URLLink to embedded content. If you see the photo/video in the tweet then this is the link to it.
Media URL User ClassUser class for source.

Individual Tweet Extracts Include:

  • allTweets.csv: all tweets in collection
  • geocodedTweets.csv (all tweets that were geocoded in the collection)
  • individualLocalTweets.csv (sources most likely to contain individual and local info)
  • offlEMandFireTweets.csv (Tweets coming from official sources and EM / Fire Tweeter accounts)
  • noRetweets.csv (all original tweets no retweets)

NOTE: Spam/Other/News Tweeter sources filtered out
Other data files:

  • twitterers.csv : All Twitter accounts (user class, user ID, User Link & Records)
  • domains.csv : All domains (URL domain, URL domain class & Records)
  • URLs.csv : All expanded URLs (URL, URL domain, URL domain class, & Records)
  • socialMediaURLs.csv : All social media links (URL, URL domain & Records)
  • mentions.csv : All mentions (Mention, Mention User Class, & Records)
  • offlMentions.csv : Mentions of official accounts by User Class

What Your Program Needs to Do

In this project, your program needs to extract interesting information from the data and display it for
the user. Some ideas for interesting information include:

  • Create a bounding box and compute what percentage of geocoded tweets fall within this area (you can create multiple bounding boxes e.g. 50 miles, 100 mile etc.). A bounding box is rectangular area defined by a north and south latitude and an east and west longitude. The coordinates for the center ofthe fire are (48.211 latitude, -120.103 longitude). I will provide you with the code to calculate the east, west, north, and south boundaries for a bounding box. Unless, of course, you want to take this on yourself and then we will applaud your efforts!
  • Look at links to social media to see what platforms were the most popular for sharing information (e.g. YouTube, Facebook, Instagram, etc). What were the most popular posts
  • What sorts of media do people tend to embed in their tweets and what are the most popular sources of information?
  • Who tweeted the most (top ten vs. top per user class) and what class of account are they?
  • What sources were mentioned and retweeted the most during the fire?

Start with a description of what your program does

There is no COG for this assignment, the TAs will be grading everyone’s project by hand. Your TA needs to know what your program does when they run it. The first thing your program needs to do is print a welcome message to the user that concisely explains program functionality For example, your program might print something like:

Welcome. This program calculates the percentage of geocoded tweets that
fall within a specified distance from the 2014 Carlton Complex Wildfire.

Get user input from at least one menu

There should be a menu in your introduction that asks for input from the user. You are welcome to
add additional menus if you need additional input from the user. For example, after displaying the
welcome message you could display a menu to ask the user to specify the distance:

Enter the distance:
1) Within 50 miles
2) Within 100 miles
3) Within 200 miles
4) Within 500 miles
5) Within 1000 miles

Present results and ask for another query

Using the input from the user, display the results in a neatly formatted message such as:

The percentage of geocoded tweets that fell within 50 miles: 52%

After displaying the message, your program needs to ask the user if they would like to perform any more calculations. If the user says Yes, you should display the first menu again. If the user says No, you should display an exit message and exit the program. The details of the exit message are described below.

A final message

After the user selects No, and you exit your loop, you need to print another message to the user. In this message, briefly explain the easiest, hardest, and most and least enjoyable portions of this project. Then, exit the program.

Implementation Details/ Technical Requirements

Store data from the files in an object

Your program needs to have at least one class. A technical requirement of this project is that you create a class to support the functionality of the program. The class(es) you create will depend on the problem and data you are working with. For instance, if you are working with individual tweets you may need a Tweet class. If you are working geocoded tweets you may also want a Geocode class that stores the latitude /longitude data.
The first thing your program needs to do, even before displaying the welcome message, is input the data from the txt files. Data should be read in from the files and stored in the appropriate variable in your class to support what your program does. You should structure your program to read in all data only one time.

Other requirements:

  1. All variables in your class need to be private and accessed through public methods. For example, if one of the class variables is latitude then you will need getLatitude() and setLatitude() methods.
  2. You need at least three objects. For example if you create a class Tweet, then you need at least three instances of Tweet in your program.
  3. You are welcome to generate new data files to support your program’s functionality. For example, if you are working with the URL domains extract, you may want to limit your analysis to domains that occur at least 25 times. The data in these sub-extracts is sorted by count, so you can import the .csv file into Excel and delete the rows that fall below 25. You can also write a program or talk to us about the specific slice of data you are interested in.
  4. If you store data in an array, you can create an array that is larger than you need and leave some of it unused. Look at the arrays in the AppleFarmer class for an example of what you might do for this assignment. You will need to keep track of how much ofthe array is used. The technique for doing this is the same as using the Curren tDay variable in AppleFarmer.
  5. The easiest way to read the .csv files is to use getline() for each line in the file and then use stringstream to parse the line. There are examples of how to do both of these things in notes provided on the Moodle.
  6. When you submit your program, include all data files you used in your project directory.