Machine Learning for Hackers Chapter 1, Part 2: Cleaning date and location data
Introduction
In the previous post, I loaded the raw UFO data into a Pandas data frame after cleaning up some irregularities in the text file. Since we’re ultimately concerned with analyzing UFO sightings over time and space, the next step is to clean those variables and prepare them for analysis and vizualization.
Some Python techniques to note in this part are:
- Like in the last part, Python string methods are going to come in really handy, and be a simple, expressive solution to a lot of problems.
- When those aren’t enough, Python has a pretty straightforward set of functions for implementing regular expressions.
- The
map()
method in Pandas can be used to “vectorize” functions along a Series (i.e. a data frame column) and is similar to R’sapply
. In general, using a NumPyufunc
(vectorized function) is preferable, but not all operations can be expressed inufunc
s. This is especially true for non-numeric operations, such as for strings or dates.
Cleaning dates: mapping and subsetting.
The first two columns of the data are dates in YYMMDDD
format, and
Pandas imported them as integers. R has a function, as.Date
that will
operate on a vector ...