As the Submission about Data Digital Tools & Big Data II, we analyzed the new york taxi data.

We were given the data about Taxi Infomation in New York, in 2016.
In this data, there is information,

  • the test data
    • id
    • passenger_count
    • pickup_longitude
    • pickup_latitude
    • dropoff_longitude
    • dropoff_latitude
    • trip_duration
    • pick_month
    • pick_day
    • pick_hour
    • drop_hour
    • trip_distance

from the data, I made new information,

  • new data columns
    • pick_month
    • pick_day
    • pick_hour
    • drop_hour
    • trip_distance

Erased some columns, and fix the data.

Reading the data

Data matrix. I compared the data from each columns.
the number of passenger count.

the data pick up taxi per day.

it decreases 31th.

the data pick up taxi per hour.

peple use many at 18-21.

linear regression

First, I tried simply linear regression.

I set “drop hour” as X axis and “trip duration” as Y.
R2 = -1.18832232287591

I set “trip distance” as X axis and “trip duration” as Y.
R2 = 0.02318791130349851

I can’t find good score by linear regrassion.I can’t find good score by linear regrassion, then, I tried to make k-nearest neighbors algorithm.

K_nearest neighbors.

The K-NN algorithm works by comparing an input data point to the K closest data points in the training set, where K is a positive integer specified by the user. The algorithm then predicts the class (in the case of classification) or the value (in the case of regression) of the input data point based on the classes or values of the K nearest neighbors.

I set “Trip_distance”, “Pick up hour” and “Trip_disitance” as X axis, and set “passenger count” as Y axis.

knn.score = 0.7202628926879543

*still continure writing.