CS177 Project Part 2: Data Collection and Distribution Selection
In this part of the project, I want you to do the following:
- Collect some empirical data about passenger loading and unload
times from "the real world"
- Find a suitable theoretical distribution family for representing
this data and choose its parameters via the MLE method
- Test your theoretical distribution against the empirical data for
"goodness of fit"
1. Data collection
Choose one or more locations in Riverside that are popular locations
for passenger pickup/dropoff, such as:
- The curb in front of the Material Science & Engineering
building along Aberdeen Drive
- The bus stop area in front of Sproul Hall along West Campus Drive
- The main entrance to the Mission Inn Hotel in downtown Riverside
along Mission Inn Avenue
- The main entrance to the Riverside Medical Clinic building at
7117 Brockton Avenue
- Some other location(s) of your own choosing. I recommend places
like mall entrances, health care facilities, restaurants, medical
facilities, movie theaters, etc The Metrolink station in downtown
Riverside would be a great location IF you pay attention to the train
schedule....
Plan to spend at least one hour doing data collection, although it
doesn't need to be a single block to time. It is important to do it
when there is a significant amount of activitat at the target
location. NOTE: you can share raw data with your classmates, so
it is a good idea to coordinate your schedules so different people
aren't taking the same measurements at the same time. (Why bother?)
For each vehicle, you will want to record information such as the
following:
- its arrival time at the target location
- whether its purpose was to pick up or drop off passengers
- the number of passengers who entered/left the vehicle
- whether the driver was one of the passengers (sometimes the
driver gets out of his/her car and goes into campus, and the passenger
shifts into the driver's seat and takes the car away)
- whether there were any packages or luggage involved, or just
people
- whether any of the people was elderly or disabled (requiring
assistance from others, needed crutches, a cane, walker or wheelchair,
etc)
- its departure time from the target location
- figure out something sensible to do about vehicles waiting at the
curb for a long time until their passengers arrive, rather than
swooping in to pick up a passenger already waiting for them at the
rendezvous site
Ideally, you would like hundreds of data samples, so that when you
split up the data into categories (pickup vs. dropoff, how many
passengers involved, etc) you still have plenty of data for each
specific category. In particular, remember that some of the
"goodness of fit" tests do NOT allow you to use the same set of
empirical data for MLE parameter fitting and testing, so you really
need twice as much as you think you do.
2. Choosing a theoretical distribution
Apply the techniques we talked about in lecture to pick a suitable
theoretical distribution family, then use Maximum Likelihood to
optimize the parameters of the distribution. Since you are allowed to
share data with other students, it is instructive to see if the same
theoretical distribution works equally well to data collected by
different individuals, or obtained from different locations.
3. Test for "goodness of fit" using (at least) Chi-Squared and K-S
test
How many distributions do you need to model the data? Does
one-size-fit-all? Or do you need to change parameters for each
location? What about the type of passengers, presence of luggage, etc?
Is this something that can be modeled as an independent random
selection, or is there evidence to say that these
parameters are correlated (i.e., car with many passengers is probably
followed by another car with many passengers).
4. Final thoughts
What do you think your data tells you about how to handle a much larger
scale pickup/dropoff point? If instead of one car-at-a-time, suppose
you have a location (such as an airport terminal) where a dozen cars
are trying to pick up and drop off passengers simultaneously. Which
features from your data do you think will be preserved in this larger
system? Which ones are going to change dramatically?