In the previous post, I loaded the raw UFO data into a Pandas data frame after cleaning up some irregularities in the text file. Since we’re ultimately concerned with analyzing UFO sightings over time and space, the next step is to clean those variables and prepare them for analysis and vizualization.
Some Python techniques to note in this part are:
- Like in the last part, Python string methods are going to come in really handy, and be a simple, expressive solution to a lot of problems.
- When those aren’t enough, Python has a pretty straightforward set of functions for implementing regular expressions.
map()method in Pandas can be used to “vectorize” functions along a Series (i.e. a data frame column) and is similar to R’s
apply. In general, using a NumPy
ufunc(vectorized function) is preferable, but not all operations can be expressed in
ufuncs. This is especially true for non-numeric operations, such as for strings or dates.
Cleaning dates: mapping and subsetting.
The first two columns of the data are dates in
YYMMDDD format, and
Pandas imported them as integers. R has a function,
as.Date that will
operate on a vector of date strings, converting them to numeric dates.
In Python, the
strptime function in the
datetime module performs the
same function, but it not vectorized the way
as.Date is. (Note that R
also has a
strptime that converts date strings to POSIX class object).
Therefore, we have to use the
def ymd_convert(x): ''' Convert dates in the imported UFO data. Clean entries will look like YYYMMDD. If they're not clean, return NA. ''' try: cnv_dt = dt.datetime.strptime(str(x), '%Y%m%d') except ValueError: cnv_dt = np.nan return cnv_dt ufo['date_occurred'] = ufo['date_occurred'].map(ymd_convert) ufo['date_reported'] = ufo['date_reported'].map(ymd_convert)
map here is like R’s
apply function (this is a little
confusing, since Python also has an
apply method that is not like
R’s). Since series—columns in Pandas data frames—are just NumPy
ndarrays underneath, only NumPy
ufuncs will operate on them in a
vectorized (fast, elementwise) fashion. Base Python functions, and any
more complicated functions you create from them, will have to be
explicitly mapped. This is a little different from R, where, since the
fundamental object in the language is the vector, functions are more
likely vectorized than not. Nonetheless, NumPy
ufuncs do cover the
gamut of mathematical operations, and for other cases, the
is easy enough to implement.
Then we just get rid of the rows with one date or the other not in
# Get rid of the rows that couldn't be conformed to datetime. ufo = ufo[(notnull(ufo['date_reported'])) & (notnull(ufo['date_occurred']))]
The subsetting of the data frame is done by indexing it with a boolean
vector. Since the
df[ ] operation returns rows, the
One can also subset an R data frame this way. R though, also has a
subset function, with the syntax:
ufo = ufo[!is.na(ufo[ , 'date.reported']) & !is.na(ufo[ , 'date.occurred']), ]
being equivalent to:
ufo = subset(ufo, where = !is.na(date_reported) & !is.na(date.occurred))
subset syntax is:
df.new = subset(df.orig, where = condition, select = columns).
subset looks for the variables referenced in the
select arguments in the
df.orig environment, there’s no need to call
df.orig[ , 'var'] or
df.orig$var. There are other useful
commands that work like this:
transform, for example.
I find the
subset function in R more expressive and easier to read
than the boolean masking method, and I miss there being a Pandas equivalent.
Cleaning locations: string functions and regular expressions
Cleaning the date variables was relatively easy. Locations are trickier, and the authors don’t do a particularly thorough job of it. (No knock on them, reading several pages of text cleaning would be deadly boring, and they’te just illustrating some techniques). I’ll suggest a slightly better method that will pick up some extra data, but even that could probably be improved if we were concerned about getting every bit of information out of this dataset.
The authors assume that valid U.S. locations are going to be in “City, ST” format (e.g., “Iowa City, IA”). Anything else is going to be dropped as either an international record, or not worth cleaning.
They write a function that takes a location record and checks that it
fits this pattern by seeing if R’s
strsplit function splits it into
two elements at a comma. If so, the function returns a vector containing
the two elements, otherwise it returns a vector with two
not quite, see the note below). They then use R’s
lapply to apply the
function elementwise, and collect the resulting vectors in a list. Then
there are some tricks to get the list into an
Nx2 matrix, and then put
each column of the matrix into a variable in the data frame as
Note: the authors wrap
tryCatchassuming that the former will throw an error if there are no commas in the string. My testing shows that’s not the case, and
strsplitwill just return the original string. The
tryCatchwrapper doesn’t have any effect, and that line of code doesn’t appear to drop locations without commas as the authors intend. This isn’t really a problem, since they later subset on records with valid U.S. states, and that ultimately drops the no-comma location records.
It’s easy to write a similar function in Python, using the
method of string objects.
def get_location(l): split_location = l.split(',') clean_location = [x.strip() for x in split_location] if len(split_location) != 2: clean_location = ['', ''] return clean_location
This is near-direct translation of the authors’
strip method and the list comprehension replace the
function the authors use to remove beginning and trailing white space
from the extracted city and states.
But a quick look at the data shows that there are lots of valid U.S. locations that will get dropped with this method. Specifically, the city part of the location contains commas in many records, so the split methods will return more than two elements and we will drop them as invalid. Let’s check out some cases with the following code:
multi_commas = ufo['location'].map(lambda x : x.count(',') \> 1) print 'Number of entries w/ multiple commas', sum(multi_commas) print ufo['location'][multi_commas][:10][/sourcecode]
Number of entries w/ multiple commas 1055 1473 Aquaduct (near, over desert, before entering California), CA 1985 Redding (northeast of, out over Millville, approximately), CA 2108 Farmington (SE of, deserted area, Hwy 44), NM 2160 Stouthill (community, nearest city 30 miles, TN), TN 2242 Highway 71 between Clearmont, Missouri and Maryville, Missou, MO 2257 Bayfield (near, Lake Superior, south shore), WI 2287 Unidentified object sig, (VIC, Australia), 2297 Garfield, (VIC, Australia), 2384 Northeast Cape AFS, St Lawrence Island,, AK 2458 Flisa, Solør, Hedemark (Norway),[/sourcecode]
So there are over a thousand location records with more than one comma, and out of the first ten, seven are valid U.S. locations.
To save these records, I’ll try another method, using regular
expressions to search for locations that end with “, ST”-type patterns.
Since we’re going to ultimately use
map to check this pattern for
every row in the data, I’ll compile the pattern first, which typically
speeds up repeated searches.
us_state_pattern = re.compile(', [A-Z][A-Z]\$', re.IGNORECASE)
Then, I’ll create a function that takes a location record as input, and applies the regex search to it.
def get_location2(l): strip_location = l.strip() us_state_search = us_state_pattern.search(strip_location) if us_state_search == None: clean_location = ['', ''] else: us_city = strip_location[ :us_state_search.start()] us_state = strip_location[us_state_search.start() + 2: ] clean_location = [us_city, us_state] return clean_location[/sourcecode]
To follow this, note that if the regex pattern isn’t found, then the
search method returns
None, otherwise it returns a search object
with several useful attributes. One of them is
start, which indicates
where in the string the pattern starts. To extract the city, we just
take all the characters in the string up to
start. The state will
start 2 characters later (since we don’t want the comma or space in
front). The function, like the previous one, finally returns a two
element list with either a city and a state, or two blanks for records
that didn’t match the pattern.
I again use
map to apply this function elementwise to the location column:
location_lists = ufo['location'].map(get_location2)
This returns a series of two-element lists. I use list comprehensions to
extract the first and second elements out to individual lists, which I
us_state variables in the data frame. It
sounds complicated, but in Python it’s just two fairly readable lines of code:
ufo['us_city'] = [city for city, st in location_lists] ufo['us_state'] = [st.lower() for city, st in location_lists]
The last step in cleaning the location data is to weed out any locations
that fit the “City, ST” pattern, but were not in U.S. states–Canadian
provinces for example. The authors do this in a straightforward way by
making a list of the 50 U.S. states and using R’s
match function to
see where the U.S. state variable matches a state in the list. They then
subset the data frame to records where there is a match.
Note: The authors leave D.C. out of the list of states. It looks like there are about 90 records with D.C. in the state column. Unfortunately a couple of these aren’t Washington, D.C., but are South American “Distrito Capitals.” I’ll add D.C. into the list and subsequent analyses, keeping in mind there are a few false positives. (This may be true for other states as well, like I said at the start, this cleaning isn’t 100% accurate.)
NumPy has an equivalent to the
match function, though the name is a
little more awkward:
in1d. Below, I assign anything records in
us_state that doesn’t have a match in the state list a blank string,
then drop them out of the data.
ufo['us_state'][-np.in1d(ufo['us_state'].tolist(), us_states)] = '' ufo['us_city'][-np.in1d(ufo['us_state'].tolist(), us_states)] = '' ufo_us = ufo[ufo['us_state'] != '']
to_list is necessary because Pandas requires a list argument to
[ ], and
in1d returns a NumPy array.
And that’s that. In the next post I’ll start exploring the data graphically.