The data used to prediction is the New York City taxi data from January, 2016 to June, 2016 and New York City weather data from the same time duration.
The taxi data has features about pickup time detailed to seconds. Considering the traffic condition could be affected by weekday and hour, so I deconstructed the time into weekday and hour as two categorical data. I also found a data about the weather in New York City which can be merged to the taxi data by date, so I kept the date.
The taxi data has features about pickup location and drop off location, so first I tried to plot the location data on the map.
Some points fall very far from New York City, to remove outliers, I identified the area of latitude 38 to 42, longitude -78 to -70 as the main area of trip from or to New York City. After removing geospatial outliers, I tried to use k-means to cluster the points to get the cluster labels.
Another feature from the geospatial data is the trip distance, as a numerical data.
The data still have some outliers from trip distance over 300 km and less than 0.2 km, and trip duration over 3 million seconds and less than 60 seconds. After using quantile method to remove the outliers, the data distribution is in line with general situation.
Without outliers, it is easier to find correlations between other feathers and trip duration. I use a series of box chart to find correlations. Vender id is not relevant, store_and_fwd_flag is showing correlation. Weekday is relevant, trip duration on Friday, Saturday and Sunday are shorter, on other weekdays are higher. Pick up hour is also relevant.
Weather data is showing temperature, snow depth and precipitation. To simplify the data, I concluded them as have snow or have rain. After adding weather data, I still need to see the correlation between them and trip duration.
The features used in the final prediction are pick up weekday, pick up hour, pick up clusters, drop off clusters, have snow and have rain. I tried linear regression and random forest regression, showing R2 score 0.539 and 0.739 respectively, so I applied random forest regression for the data test prediction.