This assignment continues our dive into the world of real data sets and the interesting questions that can be asked and answered with data science.
In order to do this assignment, here are the concepts that you will need to know:
- File handling - the csv file has to be read
- HashMaps and ArrayLists - once the file is read, we have to store the data in a data structure that will make data retrieval easy
If you have ever taken a flight, you are likely to have been concerned about whether or not your flight is going to be delayed. The Bureau of Transportation Statistics provides publicly available data sets on every aircraft that has taken off in the United States. The csv that we provided has data from 2016.
We would like you to answer questions using this data. Are you frustrated with the airport that is closest to you? Do you think others couldn’t possibly have it worse than you? What is the likelihood that the flight you have to take to visit your family during Thanksgiving will get there on time? Here’s the assignment that will help answer these questions using your newly acquired programming skills.
Using the flights.csv file that we have supplied, we want you to write Java code to read the file and then answer the following questions. We have also supplied another small csv file that has the details on the cancellation codes.
Please do not create one giant main method to answer these. Remember to make your code DRY.
There is a specific manner in which we want you to answer these questions. We need you to use the FormattedOutput.java class and use that to write your answers to a file. The file needs to call answers.txt and that needs to be submitted along with the actual Java code.
- Which carrier has the highest percentage of cancelled flights? Output the 2-letter Carrier ID and the chance of a cancelled flight, as a percentage (Example: AA,1.22%).
- What’s the most common cause of cancellations? Output the one-letter code.
- Which plane (tail number) flew the furthest (most miles)? Output the complete tailnumber (Example: N775AJ).
- Which airport is the busiest by total number of flights in and out? Use the number OriginAirportID (Example: 12478).
- You need planes to put people on! Which airport is the biggest “source” of airplanes? Use the difference between arrivals and departures to compute this value. Output the OriginAirportID (Example: 12478).
- Which airport is the biggest “sink” of airplanes? Again, use the difference between arrivals and departures, outputting the OriginAirportID (Example: 12478).
- How many American Airlines (Unique Carrier ID ‘AA’) flights were delayed by 60 minutes or more? If a flight was delayed departing and arriving, only count that as 1. Output an integer.
- What was the largest delay that was made up (arrived early/on time)? Output the Day of Month (the number), departure delay (as a number), and the tail-number. Example: (10,30,N947JB).
- Come up with a question of your own and answer it!
We have provided a smaller data set flights_small.csv that is used for our validation tests.
Please use this data set and a corresponding unit test that we have written to validate that your code is working as we expect it to.
To validate your solution:
- Use your code to read flights_small.csv.
- Write your answers to answers.txt. Please use our FormattedOutput.java file for this.
- Run our ValidationTest.java file as a unit test and ensure it passes.
Submit all of your code. Make sure that you have run your code on the full input file called “flights.csv”.
Please also submit a formatted output file named answers.txt that has the answers to all eight questions.
Please note that in order to be extra vigilant about this, when you submit this assignment we have a unit test that deliberately fails and reminds you to run the code on the main flights.csv.
If, upon running the autograder you see an error message that says “org.opentest4j.AssertionFailedError: Reminder to submit your code with the proper data set, flights.csv! This test will always fail, just as a reminder” remember that this is just a reminder and does not indicate any failure in your code. We just want to be extra sure that you do not get a 0 simply because you submitted your answers using the smaller csv file.
- Clarification on “sink” and “source” of planes. There are several discussions in the forum about what to count as “sink” or “source”.
- A: “Source” means the greatest (departures - arrivals) value while “sink” means the greatest (arrivals - departures) value.
- How to round the response for question 1? The example for the question rounds the answer to two digits after the comma. The JUnit test, however, does not even perform rounding and only extracts up to 1 decimal point. (Or maybe it isn’t a rounding issue, but something else.)
- A: Yes, the correct answer is 1.2903225806451613%. Please do not round the value. Please add the full decimal value.”
- How to deal with erroneous data? Some planes are flagged as canceled, yet have ArrTime/DepTime. Some flights have a DepTime, but no ArrTime. Similarly, in flights_small.csv, some flights are not flagged as canceled (Cancelled == 0), but they have a CancellationCode.
- A: “ Erroneous data like this is a common issue with Data Science. For flights that do not have complete information, you should ignore them altogether. For flights that were canceled but still departed, simply treat them as any other cancellations.
- Some flights are marked as cancelled but seem to have taken off, or were diverted but did not land. How should we handle those?
- A: Unless a flight took off and landed where it was supposed to land, do not include it in calculations regarding delays
- when to NOT include cancellations and diverted (haven’t confirmed yet. Waiting for TA’s confirmation. )
- 1) Do not include cancelled flights (cancelled == 1) for everything except questions #1 and #2
- 2) Do not include diverted (diverted == 1) for questions 7 and 8 because those deal with delays and arvind said “ _Unless a flight took off and landed where it was supposed to land, do not include it in calculations regarding delays_ “