Machine Learning for Hackers Chapter 1, Part 2: Cleaning date and location data

Introduction

In the previous post, I loaded the raw UFO data into a Pandas data frame after cleaning up some irregularities in the text file. Since we’re ultimately concerned with analyzing UFO sightings over time and space, the next step is to clean those variables and prepare them for analysis and vizualization.

Some Python techniques to note in this part are:

  • Like in the last part, Python string methods are going to come in really handy, and be a simple, expressive solution to a lot of problems.
  • When those aren’t enough, Python has a pretty straightforward set of functions for implementing regular expressions.
  • The map() method in Pandas can be used to “vectorize” functions along a Series (i.e. a data frame column) and is similar to R’s apply. In general, using a NumPy ufunc (vectorized function) is preferable, but not all operations can be expressed in ufuncs. This is especially true for non-numeric operations, such as for strings or dates.

Cleaning dates: mapping and subsetting.

The first two columns of the data are dates in YYMMDDD format, and Pandas imported them as integers. R has a function, as.Date that will operate on a vector of date strings, converting them to numeric dates. In Python, the strptime function in the datetime module performs the same function, but it not vectorized the way as.Date is. (Note that R also has a strptime that converts date strings to POSIX class object). Therefore, we have to use the map method.

def ymd_convert(x):
'''
Convert dates in the imported UFO data.
Clean entries will look like YYYMMDD. If they're not clean, return NA.
'''
try:
cnv_dt = dt.datetime.strptime(str(x), '%Y%m%d')
except ValueError:
cnv_dt = np.nan

return cnv_dt

ufo['date_occurred'] = ufo['date_occurred'].map(ymd_convert)
ufo['date_reported'] =
ufo['date_reported'].map(ymd_convert)

Notice that map here is like R’s apply function (this is a little confusing, since Python also has an apply method that is not like R’s). Since series—columns in Pandas data frames—are  just NumPy ndarrays underneath, only NumPy ufuncs will operate on them in a vectorized (fast, elementwise) fashion. Base Python functions, and any more complicated functions you create from them, will have to be explicitly mapped. This is a little different from R, where, since the fundamental object in the language is the vector, functions are more likely vectorized than not. Nonetheless, NumPy ufuncs do cover the gamut of mathematical operations, and for other cases, the map method is easy enough to implement.

Then we just get rid of the rows with one date or the other not in proper YYYMMDD format.

# Get rid of the rows that couldn't be conformed to datetime.
ufo = ufo[(notnull(ufo['date_reported'])) &
(notnull(ufo['date_occurred']))]

The subsetting of the data frame is done by indexing it with a boolean vector. Since the df[ ] operation returns rows, the

One can also subset an R data frame this way. R though, also has a subset function, with the syntax:

ufo = ufo[!is.na(ufo[ , 'date.reported']) & !is.na(ufo[ , 'date.occurred']), ]

being equivalent to:

ufo = subset(ufo, where = !is.na(date_reported) & !is.na(date.occurred))

The general subset syntax is: df.new = subset(df.orig, where = condition, select = columns). Since subset looks for the variables referenced in the where and select arguments in the df.orig environment, there’s no need to call them as df.orig[ , 'var'] or df.orig$var. There are other useful commands that work like this: with, within, and transform, for example.

I find the subset function in R more expressive and easier to read than the boolean masking method, and I miss there being a Pandas equivalent.

Cleaning locations: string functions and regular expressions

Cleaning the date variables was relatively easy. Locations are trickier, and the authors don’t do a particularly thorough job of it. (No knock on them, reading several pages of text cleaning would be deadly boring, and they’te just illustrating some techniques). I’ll suggest a slightly better method that will pick up some extra data, but even that could probably be improved if we were concerned about getting every bit of information out of this dataset.

The authors assume that valid U.S. locations are going to be in “City, ST” format (e.g., “Iowa City, IA”). Anything else is going to be dropped as either an international record, or not worth cleaning.

They write a function that takes a location record and checks that it fits this pattern by seeing if R’s strsplit function splits it into two elements at a comma. If so, the function returns a vector containing the two elements, otherwise it returns a vector with two NAs (though not quite, see the note below). They then use R’s lapply to apply the function elementwise, and collect the resulting vectors in a list. Then there are some tricks to get the list into an Nx2 matrix, and then put each column of the matrix into a variable in the data frame as USCity and USState.

Note: the authors wrap strsplit in tryCatch assuming that the former will throw an error if there are no commas in the string. My testing shows that’s not the case, and strsplit will just return the original string. The tryCatch wrapper doesn’t have any effect, and that line of code doesn’t appear to drop locations without commas as the authors intend. This isn’t really a problem, since they later subset on records with valid U.S. states, and that ultimately drops the no-comma location records.

It’s easy to write a similar function in Python, using the split method of string objects.

def get_location(l):
split_location = l.split(',')
clean_location = [x.strip() for x in split_location]
if len(split_location) != 2:
clean_location = ['', '']
return clean_location

This is near-direct translation of the authors’ get.location function. Note the strip method and the list comprehension replace the gsub function the authors use to remove beginning and trailing white space from the extracted city and states.

But a quick look at the data shows that there are lots of valid U.S. locations that will get dropped with this method. Specifically, the city part of the location contains commas in many records, so the split methods will return more than two elements and we will drop them as invalid. Let’s check out some cases with the following code:

multi_commas = ufo['location'].map(lambda x : x.count(',') \> 1)
print 'Number of entries w/ multiple commas', sum(multi_commas)
print ufo['location'][multi_commas][:10][/sourcecode]

This returns:

Number of entries w/ multiple commas 1055
1473 Aquaduct (near, over desert, before entering California), CA
1985 Redding (northeast of, out over Millville, approximately), CA
2108 Farmington (SE of, deserted area, Hwy 44), NM
2160 Stouthill (community, nearest city 30 miles, TN), TN
2242 Highway 71 between Clearmont, Missouri and Maryville, Missou, MO
2257 Bayfield (near, Lake Superior, south shore), WI
2287 Unidentified object sig, (VIC, Australia),
2297 Garfield, (VIC, Australia),
2384 Northeast Cape AFS, St Lawrence Island,, AK
2458 Flisa, Solør, Hedemark (Norway),[/sourcecode]

So there are over a thousand location records with more than one comma, and out of the first ten, seven are valid U.S. locations.

To save these records, I’ll try another method, using regular expressions to search for locations that end with “, ST”-type patterns. Since we’re going to ultimately use map to check this pattern for every row in the data, I’ll compile the pattern first, which typically speeds up repeated searches.

us_state_pattern = re.compile(', [A-Z][A-Z]\$', re.IGNORECASE)

Then, I’ll create a function that takes a location record as input, and applies the regex search to it.

def get_location2(l):
    strip_location = l.strip()
    us_state_search = us_state_pattern.search(strip_location)
    if us_state_search == None:
        clean_location = ['', '']
    else:
        us_city = strip_location[ :us_state_search.start()]
        us_state = strip_location[us_state_search.start() + 2: ]
        clean_location = [us_city, us_state]

return clean_location[/sourcecode]

To follow this, note that if the regex pattern isn’t found, then the search method returns None, otherwise it returns a search object with several useful attributes. One of them is start, which indicates where in the string the pattern starts. To extract the city, we just take all the characters in the string up to start. The state will start 2 characters later (since we don’t want the comma or space in front). The function, like the previous one, finally returns a two element list with either a city and a state, or two blanks for records that didn’t match the pattern.

I again use map to apply this function elementwise to the location column:

location_lists = ufo['location'].map(get_location2)

This returns a series of two-element lists. I use list comprehensions to extract the first and second elements out to individual lists, which I assign to us_city and us_state variables in the data frame. It sounds complicated, but in Python it’s just two fairly readable lines of code:

ufo['us_city'] = [city for city, st in location_lists]
ufo['us_state'] = [st.lower() for city, st in location_lists]

The last step in cleaning the location data is to weed out any locations that fit the “City, ST” pattern, but were not in U.S. states–Canadian provinces for example. The authors do this in a straightforward way by making a list of the 50 U.S. states and using R’s match function to see where the U.S. state variable matches a state in the list. They then subset the data frame to records where there is a match.

Note: The authors leave D.C. out of the list of states. It looks like there are about 90 records with D.C. in the state column. Unfortunately a couple of these aren’t Washington, D.C., but are South American “Distrito Capitals.” I’ll add D.C. into the list and subsequent analyses, keeping in mind there are a few false positives. (This may be true for other states as well, like I said at the start, this cleaning isn’t 100% accurate.)

NumPy has an equivalent to the match function, though the name is a little more awkward: in1d. Below, I assign anything records in us_state that doesn’t have a match in the state list a blank string, then drop them out of the data.

ufo['us_state'][-np.in1d(ufo['us_state'].tolist(), us_states)] = ''
ufo['us_city'][-np.in1d(ufo['us_state'].tolist(), us_states)] = ''

ufo_us = ufo[ufo['us_state'] != '']

The to_list is necessary because Pandas requires a list argument to [ ], and in1d returns a NumPy array.

And that’s that. In the next post I’ll start exploring the data graphically.

Comments