In this post I shall show how to cluster geographic data using K-means. First we must import some libraries and get the dataset. For now I'm not really going to worry about using anything other than the Lat/Long values, in later post I'll get to more complex analysis.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
import requests
import geopandas as gp
from shapely.geometry import Point

I'll be using this L.A. Crime dataset.

r=requests.get('https://data.lacity.org/resource/7fvc-faax.geojson')

data=r.json()

features=data['features']

This particular dataset requires a little bit of cleaning before going any further. It happens that some of the coordinates are assigned to 0,0. In this case I'm merely using the dataset as an example for clustering, so I'm going to remove the points. In other instances, it may be better to use a different method.

obj=[]
for i in features:
    corr=i['geometry']['coordinates']
    lon=corr[0]
    lat=corr[1]
    if lat!=0 or lon!=0:
        obj.append((lon,lat))
    else:
        pass
obj    

df=pd.DataFrame(obj, columns=['lon','lat'])
geometry = [Point(xy) for xy in zip(df.lon, df.lat)]
crs = {'init': 'epsg:4326'}
gdf = gp.GeoDataFrame(df,crs=crs, geometry=geometry)

Next I'm going to project the data into NAD83 / California zone 5

gdf = gdf.to_crs({'init': 'epsg:26945'})

As always before going forward I should plot my data, to insure things are okay. Try to imagine how you would like to divide out the clusters so you can compare your ideas with K-means.

f, ax = plt.subplots(1, figsize=(20, 15))
ax.set_title("LA Crime", fontsize=40)
ax.set_axis_off()
#LA.plot(ax=ax,  edgecolor='grey')
gdf.plot(ax=ax, color='black')
plt.show()

Everything looks good, now let's do some clustering. We'll need to extract the xy values from the geometry field and contruct a numpy array.

a=pd.Series(gdf['geometry'].apply(lambda p: p.x))
b=pd.Series(gdf['geometry'].apply(lambda p: p.y))
X=np.column_stack((a,b))

Next I'm going to use the elbow method to determine the optimal number of clusters. At some point increasing the number of clusters will only result in marginal gains or the loss of insight. If we selected for the lowest possible variance, then each point would have its own cluster. We're going to look for here is the location of the graph with the greatest change in the within cluster sum of squares (it should look like an elbow).

wcss = []
for i in range(1, 14):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 14), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

It would appear the optimal number of clusters is 4, so that's what we'll go with. This is only a suggestion, depending upon the problem it may be logical not to heed the elbow method's guidelines, for example if we were trying to use it as a classifier on 3 known categories while the elbow method suggests 2.

kmeans = KMeans(n_clusters = 4, init = 'k-means++', random_state = 5,  max_iter=400)
y_kmeans = kmeans.fit_predict(X)
k=pd.DataFrame(y_kmeans, columns=['cluster'])
gdf=gdf.join(k)

Our data has been clustered and joined back to the GeoDataFrame. Now we're ready to plot the results.

f, ax = plt.subplots(1, figsize=(20, 15))
ax.set_title("LA Crime", fontsize=40)
ax.set_axis_off()
gdf.plot(column='cluster',cmap='Set2', ax=ax, markersize=80)
plt.show()

Do you agree with the results? Are they what you expected? Perhaps we should try increasing the mumber of clusters to see what happens.

kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 5,  max_iter=400)
y_kmeans = kmeans.fit_predict(X)
k5=pd.DataFrame(y_kmeans, columns=['cluster5'])
gdf=gdf.join(k5)

f, ax = plt.subplots(1, figsize=(20, 15))
ax.set_title("LA Crime", fontsize=40)
ax.set_axis_off()
gdf.plot(column='cluster5',cmap='Set2', ax=ax, markersize=80)
plt.show()

What do you think- Is K-means the tool for the job? What is the optimal number of clusters? There's not a simple answer. Problems are often domain specific, perhaps crime data would benefit from having a high number of clusters or maybe clustering simply isn't a useful insight. K-means tends to favor "round" clusters, perhaps that's why the middle section is being dividing in half rather than producing a large oval cluster. In a later post I'll be revisiting this dataset for further analysis to see what additional insights can be gain by using different algorithms and increasing the dimensionality. I'll leave you with a famous quote about statistics.

"Essentially, all models are wrong, but some are useful"- George Box

Computational geometry is the study of algorithms which relate to geometry and often serves as the bedrock for many GIS functionalities. On the surface a problems in CG can look quite simple, yet when trying to write code for it can quickly a daunting yet fun challenge. Thankfully many problems in GC have been solved and there are mature libraries full of solutions, CGAL is probably the most promonent. There are Python bindings to it and it is used in applications like QGIS. That being said, being able to solve problems in CG isn't needed to be able to understand and use the existing concepts and solutions.

Common problems in CG can appear as such: Which points are in which polygons? To the human eye such a quesition would seem trivial, yet coding it may not come as intutively. For this intro I'm going to focus on one of the most fundamental concepts in CG- the convex hull. Imagine you had some nails on a board and tied a rubber band around them, that would produce the shape of a convex hull. It is the minimum bounding area for a set of spatial features (points, polygon or line) and it must be convex.

from shapely.geometry import MultiPoint

mp=MultiPoint([(-30, 100), (30, -20), (-30, 2),(-52, 42), (-53,12), (71,-25), (-11, 50) ,(-31, 40)])
mp

mp.convex_hull

Any general purpose GIS software (ArcMap, QGIS...) should have support for producing convex hulls. Reasons for using this functionality include converting points to polygons and creating simpler shapes to increase processing speeds. I hope this information is useful, as I plan to continue adding small posts about various algorithms in machine learning and computational geometry.

As always the first step is to import some packages, right now I'm going to use requests and folium.

import requests
import folium

You're going to have to get an API key if you don't aready have one by going to the Google Geocoder API. After that we need to contruct our URL to send the request.

The first part of the URL will appear as such, go ahead try clicking on it and see what happens.

https://maps.googleapis.com/maps/api/geocode/json?address=time+square+new+york+city&key

When making a request you should have an API key, but be careful not to make too many requests, you might hit a query limit.

api_key= # Place key here

url="https://maps.googleapis.com/maps/api/geocode/json?address=time+square+new+york+city&key="

r = requests.get(url+api_key)

Now that we have the response object r we should check the status code. It looks like we have a 200, which is good. For more information on status code you can refer to status codes.

r.status_code

200

Now we're going to use Requests built in JSON decoder.

rjson=r.json()
rjson

{'results': [{'address_components': [{'long_name': 'Theater District',
     'short_name': 'Theater District',
     'types': ['neighborhood', 'political']},
    {'long_name': 'Manhattan',
     'short_name': 'Manhattan',
     'types': ['political', 'sublocality', 'sublocality_level_1']},
    {'long_name': 'New York',
     'short_name': 'New York',
     'types': ['locality', 'political']},
    {'long_name': 'New York County',
     'short_name': 'New York County',
     'types': ['administrative_area_level_2', 'political']},
    {'long_name': 'New York',
     'short_name': 'NY',
     'types': ['administrative_area_level_1', 'political']},
    {'long_name': 'United States',
     'short_name': 'US',
     'types': ['country', 'political']}],
   'formatted_address': 'Theater District, New York, NY, USA',
   'geometry': {'bounds': {'northeast': {'lat': 40.7641791,
      'lng': -73.9790779},
     'southwest': {'lat': 40.75373, 'lng': -73.9908822}},
    'location': {'lat': 40.759011, 'lng': -73.9844722},
    'location_type': 'APPROXIMATE',
    'viewport': {'northeast': {'lat': 40.7641791, 'lng': -73.9790779},
     'southwest': {'lat': 40.75373, 'lng': -73.9908822}}},
   'place_id': 'ChIJgzD7uFVYwokRXCoEdvGu-aA',
   'types': ['neighborhood', 'political']}],
 'status': 'OK'}

Now all we need to do is access the information we want from the JSON and do something with it. In this case I'm going to make a simple map with a marker for Time Square.

lat = rjson['results'][0]['geometry']['location']['lat']
lng = rjson['results'][0]['geometry']['location']['lng']
address = rjson['results'][0]['formatted_address']

map_1 = folium.Map(location=[lat, lng],
                   zoom_start=14,
                   tiles='Stamen Toner')

folium.Marker([lat, lng], popup=address).add_to(map_1)

map_1

Well that's all folks! Geocoding can be pretty easy, but when dealing with large batches of data and funky addresses, it can quickly become a headache. For more information on usage limits go to Usage Limits. Always make sure to validate the data or at least be aware that it may not aways be accurate.

Some would argue Geospatial analysis didn't come into being until computers were available, I would argue otherwise. Note I'm not using the term GIS in this context, because it's too broad and ambiguous. John Snow is credited as the father of modern Epidemiology (not the Game of Thrones character), though arguably he is also the grandfather of geospatial analysis. In 1854 John Snow was investigating a cholera outbreak in Soho, London. To do this he plotted death locations on a map and discovered a trend, everything was near the broad street pump. At the time the widely help belief was the disease was transmitted via "foul air", John disagreed and thought it was due to contaminated water and his map helped him prove it. Today we still have John's original map and data; so why not replicate his study using more modern tools.

Here's Dr. Snow's original map, if you'd like to read more into his work go checkout this wikipedia article

Snow's original Map

All my work was done using QGIS. In this first map I generated Voronoi Polygons using the pump locations. For each polygon the pump will be the centroid and all the space inside will be closed to that particular pump. Next I did a point in polygon analysis to get the death counts for each polygon.

This is just a basic heat map, it's clear to see that the majority of the fatalities are around the broad street pump. Well the evidence appears to be rather incriminating, I think it's safe to remove the Broad Street pump's handle now.

Making maps with Python¶

import os
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gp
import pyproj

Method 1: Folium (based upon Leaflet.js)¶

First we'll start with Folium, which use Leaflet, which is excellent for simple interactive maps and showcasing data.

import folium
m=folium.Map(
    location=[36.5236, -87.6750],
    tiles='Stamen Toner',
    zoom_start=4
)
m

Now let's change the basemap and add some vector layers

m=folium.Map(
    location=[36.5236, -87.6750],
    tiles='Mapbox Bright',
    zoom_start=4
)
folium.Marker([45.3288, -100.6625], popup='<i>I"m a marker!</i>').add_to(m)
folium.CircleMarker(
    location=[37.5215, -112.6261],
    radius=50,
    popup='I"m a circle',
    color='#3186cc',
    fill=True,
    fill_color='#3186cc'
).add_to(m)
m

Using Geopandas

While Folium looks nice, I tend to prefer using Geopandas. Being built upon Pandas means it's lighting fast and the syntax is similar to Pandas. So you can easily manipulate fields and process data quickly.

import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gp
usa=gp.read_file('united-states/us-albers-counties.json')
plt.style.use('fivethirtyeight')

It's really easy to plot data using Geopandas.

usa.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fe0ab41b748>

Okay so we may need to clean this up some. Like doing away with the pesky axis, enlarging the map and creating a choropleth map.

f, ax = plt.subplots(1, figsize=(15, 10))
ax.set_title("US counties")
ax.set_axis_off()
plt.axis('equal');
usa.plot(ax=ax)
plt.show()

Before proceeding let's take a quick look at our data.

usa.head()

unemployment=pd.read_csv('unemployment.csv')
unemployment.head()

Now I'm going to merge our two datasets together to create a choropleth map.

usa['FIPStxt']=pd.to_numeric(usa['fips'].str[1:])
#=pd.merge(unemployment, usa, how='right', on='FIPStxt')
US_jobs = usa.merge(unemployment, on='FIPStxt')

Let's go ahead and plot everything to make sure the merge was successful.

f, ax = plt.subplots(1, figsize=(15, 10))
ax.set_title("US unemployment rate 2016 by counties" , fontsize=17)
ax.set_axis_off()
plt.axis('equal')
US_jobs.plot( column='Unemployment_rate_2016' ,cmap='OrRd', ax=ax,  edgecolor='grey')
plt.show()

So it turns out that not everything made it through the merge. So let's pick a state that made it through the merge, focus on that and do some number crunching. I'm going to go with California.

CA_jobs=US_jobs.loc[US_jobs['iso_3166_2'] == 'CA'] 
f, ax = plt.subplots(1, figsize=(15, 10))
ax.set_title("California unemployment rate 2016 by counties", fontsize=17)
ax.set_axis_off()
plt.axis('equal')
CA_jobs.plot( column='Unemployment_rate_2016' ,cmap='OrRd', ax=ax,  edgecolor='grey')
plt.show()

Great! Let's dig a little further and do some graphing with Matplotlib. First I'm going to list all the columns so we know which column headers to pick.

list(CA_jobs.columns.values)

['id',
 'fips',
 'state_fips',
 'county_fips',
 'name',
 'lsad',
 'census_area',
 'state',
 'iso_3166_2',
 'geometry',
 'FIPStxt',
 'State',
 'Area_name',
 'Rural_urban_continuum_code_2013',
 'Urban_influence_code_2013',
 'Metro_2013',
 'Civilian_labor_force_2007',
 'Employed_2007',
 'Unemployed_2007',
 'Unemployment_rate_2007',
 'Civilian_labor_force_2008',
 'Employed_2008',
 'Unemployed_2008',
 'Unemployment_rate_2008',
 'Civilian_labor_force_2009',
 'Employed_2009',
 'Unemployed_2009',
 'Unemployment_rate_2009',
 'Civilian_labor_force_2010',
 'Employed_2010',
 'Unemployed_2010',
 'Unemployment_rate_2010',
 'Civilian_labor_force_2011',
 'Employed_2011',
 'Unemployed_2011',
 'Unemployment_rate_2011',
 'Civilian_labor_force_2012',
 'Employed_2012',
 'Unemployed_2012',
 'Unemployment_rate_2012',
 'Civilian_labor_force_2013',
 'Employed_2013',
 'Unemployed_2013',
 'Unemployment_rate_2013',
 'Civilian_labor_force_2014',
 'Employed_2014',
 'Unemployed_2014',
 'Unemployment_rate_2014',
 'Civilian_labor_force_2015',
 'Employed_2015',
 'Unemployed_2015',
 'Unemployment_rate_2015',
 'Civilian_labor_force_2016',
 'Employed_2016',
 'Unemployed_2016',
 'Unemployment_rate_2016',
 'Median_Household_Income_2015',
 'Med_HH_Income_Percent_of_State_Total_2015']

Now I'm going to make a time series plot for all unemployment rates from 2007 to 2016 for all the California counties. Given that there are 58 of them I'll forgo adding county labels.

f, ax = plt.subplots(1, figsize=(20, 25))
ax.set_title("California unemployment time series by counties", fontsize=20)
CA_jobs2 = CA_jobs.filter(regex='Unemployment_rate')
plt.plot(pd.Series(list(range(2007,2017))), CA_jobs2.T)
plt.show()

Well, this is for now the end of my first blog post. I might come back add some more map making methods later on, but for now the plan is to keep updating things and add more posts. I hope the material here has been informative. Please feel free to contact me about any potential improvements. Thank you free reading my blog, now go make some awesome maps!

	id	fips	state_fips	county_fips	name	lsad	census_area	state	iso_3166_2	geometry
0	0	01001	01	001	Autauga	County	594.436	Alabama	AL	POLYGON ((-86.49415438373838 32.34371720414687...
1	1	01009	01	009	Blount	County	644.776	Alabama	AL	POLYGON ((-86.57505906480648 33.76424986583291...
2	2	01017	01	017	Chambers	County	596.531	Alabama	AL	POLYGON ((-85.18234276927693 32.86994691485243...
3	3	01021	01	021	Chilton	County	692.854	Alabama	AL	POLYGON ((-86.51727000690069 33.01947666871411...
4	4	01033	01	033	Colbert	County	592.619	Alabama	AL	POLYGON ((-88.14114253405342 34.5809123676929,...

	FIPStxt	State	Area_name	Rural_urban_continuum_code_2013	Urban_influence_code_2013	Metro_2013	Civilian_labor_force_2007	Employed_2007	Unemployed_2007	Unemployment_rate_2007	...	Civilian_labor_force_2015	Employed_2015	Unemployed_2015	Unemployment_rate_2015	Civilian_labor_force_2016	Employed_2016	Unemployed_2016	Unemployment_rate_2016	Median_Household_Income_2015	Med_HH_Income_Percent_of_State_Total_2015
0	1000	AL	Alabama	NaN	NaN	NaN	2,175,612	2,089,127	86,485	4.0	...	2,152,289	2,020,675	131,614	6.1	2,168,608	2,038,775	129,833	6.0	44,833	100.0
1	1001	AL	Autauga County, AL	2.0	2.0	1.0	24,383	23,577	806	3.3	...	25,303	23,971	1,332	5.3	25,649	24,297	1,352	5.3	56,580	126.2
2	1003	AL	Baldwin County, AL	3.0	2.0	1.0	82,659	80,099	2,560	3.1	...	87,756	82,882	4,874	5.6	89,931	85,061	4,870	5.4	52,387	116.8
3	1005	AL	Barbour County, AL	6.0	6.0	0.0	10,334	9,684	650	6.3	...	8,646	7,881	765	8.8	8,302	7,584	718	8.6	31,433	70.1
4	1007	AL	Bibb County, AL	1.0	1.0	1.0	8,791	8,432	359	4.1	...	8,526	7,957	569	6.7	8,573	8,004	569	6.6	40,767	90.9

K-means clustering

Computational Geometry Part I: Convex Hull

Geocoding with the Google Maps API

Where are all the GIS jobs?

The Ghost Map

Making a map with Python

Making maps with Python¶

Method 1: Folium (based upon Leaflet.js)¶

Using Geopandas

SuperGIS

K-means clustering

Computational Geometry Part I: Convex Hull

Geocoding with the Google Maps API

Where are all the GIS jobs?

The Ghost Map

Making a map with Python

Making maps with Python¶

Method 1: Folium (based upon Leaflet.js)¶

Using Geopandas