In this tutorial, we want to give you tbe opportunity to do some explorative data analysis on your own. For this purpose we will use a real-world dataset of crime in Chicago. The full dataset is accessible at https://catalog.data.gov/dataset/crimes-2001-to-present-398a4, but we also provide a subset of the data in compressed form at http://www.dbs.ifi.lmu.de/Lehre/KDD/WS1819/tutorials/ChicagoCrime2017.csv.xz.
You do not need to unpack the file, as pandas
can also read from compressed CSV files.
The meaning of each column is described at https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2 .
If you need any help, feel free to contact your tutor or the assistants.
In this tutorial, we will need the following packages. Please refer to the Python introduction when you need to know how to install Python packages.
%matplotlib inline
import numpy
import pandas
from matplotlib import pyplot as plt
import seaborn
from pandas.plotting import parallel_coordinates
# Set default figure size
from matplotlib import rcParams
rcParams['figure.figsize'] = (15, 15)
# For wide screens
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))
Here, please insert the path of the downloaded file. A relative path will be relative to the notebook.
file_path = 'ChicagoCrime2017.csv.xz'
Use pandas.read_csv
to read the CSV file into a pandas DataFrame
.
df = pandas.read_csv(file_path)
df.info()
displays type and shape information about the table.
df.info()
df.head()
shows the first lines
df.head()
For some rows there where missing values.
They have been replaced by NaN
in the parser.
We remove these rows using the following line.
df.dropna(how='any', inplace=True)
From the output of df.info()
above, you can see that the columns where converted to one of the following types: bool
, float64
, int64
, object
.
In the following, we will pre-process the data for the visualizations.
Perform the following operations on the columns:
The following columns are identifiers, which we will just keep as they are: 'ID', 'Case Number'
(The ID
is unique per row, whereas the Case Number
can be used to find multiple crimes connected to one case).
We set the ID
as index, and drop Case Number
.
# TODO: SOLUTION
df.set_index('ID', drop=True, inplace=True)
del df['Case Number']
df.head()
As loaded, the table has three time-related attributes: 'Year'
, 'Updated On'
, and 'Date'
. As we are only looking at year 2017, the 'Year'
column can be dropped.
Also the 'Updated On'
is not of use for us.
Taking a closer look at 'Date'
, we find out that this column carries not only information about the day, but also the time.
We will create four columns from this column: month
, day
, ẁeekday
, hour
.
You can use the following snippet to parse the date as given into a datetime
.
datetime_column = pandas.to_datetime(df['Date'], format='%m/%d/%Y %I:%M:%S %p')
Consult https://pandas.pydata.org/pandas-docs/stable/categorical.html#string-and-datetime-accessors for information on how to work with columns of type datetime
.
# TODO: SOLUTION
del df['Year']
del df['Updated On']
del df['Date']
df['month'] = datetime_column.dt.month
df['day'] = datetime_column.dt.day
df['weekday'] = datetime_column.dt.weekday_name.astype('category')
df['hour'] = datetime_column.dt.hour
df.head()
The given data provides location information on several levels:
Beat, Community Area, District, Ward
have been parsed as float datatype. For visually more pleasing labels, you can first convert them to integers again, before making them categorical.Apply the following operations on the location-related columns:
Column | Operation |
---|---|
'Longitude' | Keep |
'Latitude' | Keep |
'Location' | Drop |
'X Coordinate' | Drop |
'Y Coordinate' | Drop |
'Block' | Categorical |
'Beat' | Categorical |
'Community Area' | Categorical |
'District' | Categorical |
'Ward' | Categorical |
Hints:
df.drop(columns=<column_name>, inplace=True)
to remove a column from a table (working in-place).df[<column_name>] = df[<column_name>].astype('category')
to convert a column to categorical data type.# TODO: SOLUTION
del df['Location']
del df['X Coordinate']
del df['Y Coordinate']
df['Block'] = df['Block'].astype('category')
for categorical_column in ['Beat', 'Community Area', 'District', 'Ward']:
df[categorical_column] = df[categorical_column].astype(numpy.int32).astype('category')
df.head()
The crime is described using the following columns:
Apply the following operations on the location-related columns:
Column | Operation |
---|---|
'IUCR' | Drop |
'Primary Type' | Categorical |
'Description' | Categorical |
'FBI Code' | Drop |
'Arrest' | Categorical |
'Domestic' | Categorical |
'Location Description' | Categorical |
# TODO: SOLUTION
del df['IUCR']
del df['FBI Code']
for categorical_column in ['Primary Type', 'Description', 'Arrest', 'Domestic', 'Location Description']:
df[categorical_column] = df[categorical_column].astype('category')
df.head()
Now that we have finished preprocessing the data, you can show some information about the resulting data frame using df.head()
, df.info()
.
In addition, you can use df.describe()
to obtain some measures of central tendency and dispersion about the numeric columns.
Can you already see something insightful from these simple statistics?
# TODO: SOLUTION
df.head()
# TODO: SOLUTION
df.info()
# TODO: SOLUTION
df.describe()
# Something to see:
# hour:
# 25%-percentile = 9
# 75%-percentile = 18
# => 50% of the crimes happen between 9 and 18 (at day)
# month: from quantiles alone: perfect uniform over months
Use df.groupby(by=columns).size()
to count how often certain combinations in the given columns occur, i.e. use columns=['Primary Type']
to get a list of the most frequent primary crime type.
df.groupby(by=['Primary Type', 'Description']).size().sort_values(ascending=False)
To visualize the distribution of categorical attributes, we will first use equi-width histograms. Create such for all categorical attributes.
What form do these distribution exhibit? Are there distinct peaks? Can you guess reasons for these? What insights do you gain? How useful are the plots for the region information attributes?
Hint:
seaborn.catplot(data=df, x=<column_name>, kind='count')
Ward
order=<list-of-values>
keyword of seaborn.catplot
. What are useful options for ordering the value?# TODO: SOLUTION
for cat_var in ['month', 'hour', 'weekday', 'Domestic', 'Arrest', 'District', 'Ward', 'Location Description']:
if cat_var == 'weekday':
# Sort by week
order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
elif cat_var in ['District', 'Ward', 'Location Description']:
# Sort by frequency
order = df.groupby(by=cat_var).size().sort_values(ascending=False).index
else:
# Do not sort explicitly
order = None
seaborn.catplot(data=df, x=cat_var, kind='count', order=order)
plt.xticks(rotation=90)
# Discoveries:
# - month: bimodal distribution; peaks at January and July (=> maybe vacation time)
# - hour: most crimes during (late) daytime
# - weekday: slightly more crime on weekends
# - most crimes are non-domestic
# - most crimes do not lead to an arrest
# - There are some districts, and wards with more crime than other
# - 'Ward' is already hardly readible.
# - Sorting increases readability.
df.quantile(q=<p-values>)
, where p-values
is a list of quantiles you want to compute. The return value is a dataframe whose index contains the $p$ values, and the columns the corresponding quantiles for each numeric(!) column in the original dataframe.num
equi-distant values between $0$ and $1$, you can use numpy.linspace(start=0.0, stop=1.0, num=num)
.seaborn.scatterplot(data=quantiles, x=<p-values>, y=<column_name>)
Questions:
# TODO: SOLUTION
ps = numpy.linspace(start=0.0, stop=1.0, num=100)
quantiles = df.quantile(q=ps)
columns = quantiles.columns
quantiles['p'] = ps
for column in columns:
plt.subplots()
seaborn.scatterplot(data=quantiles, x=ps, y=column)
# Discoveries:
# - Non-uniform distribution of location; BUT: Spatial data should be visualiased on a map
# - Steps for discrete attributes
Another option to display the distribution of data values of one attribute is the boxplot. It also allows for easy comparison between the distribution in different subsamples.
Tasks:
'hour'
for different values of 'Primary Type'
. (Is there are "typical" time for some kind of crime?)'hour'
for different values of 'weekday'
.order
keyword to pass some explicit order. What are meaningful orderings? Do they facilitate understanding the plot?Hint:
plt.xticks(rotation=90)
to rotate the tick labels on the x-axis.# TODO: SOLUTION
plt.subplots()
group = df.groupby(by='Primary Type').agg({'hour': 'median'}).sort_values(by='hour', ascending=False)
order = group.index
seaborn.boxplot(data=df, x='Primary Type', y='hour', order=order)
plt.xticks(rotation=90)
plt.tight_layout()
plt.subplots()
order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
seaborn.boxplot(data=df, x='weekday', y='hour', order=order)
plt.xticks(rotation=90)
plt.tight_layout()
plt.subplots()
group = df.groupby(by='Location Description').agg({'hour': 'median'}).sort_values(by='hour', ascending=False)
order = group.index
seaborn.boxplot(data=df, x='Location Description', y='hour', order=order)
plt.xticks(rotation=90)
plt.tight_layout()
# Observations
# - Prostitution is concentrated in the evening
# - Public indecency is quite narrow around noon; BUT: Also small sample size
# - Most categories cover the complete data range with their 25%-75% quantile.
# - On weekends crime starts earlier
Compared to a boxplot, violin plots replace the box by a kernel density estimate of the data distribution. Thereby, they reveal more details about the distribution of the data. You can create violin plots using
seaborn.violinplot(data=df, x=<column_name_a>, y=<column_name_a>)
Create a violin plot for
x='Primary Type'
, y='hour'
; restrict 'Primary Type' to the 5 most common types.Hint:
df.groupby(by='Primary Type').agg({'Primary Type': 'count'})
# TODO: SOLUTION
for bw in [None, 1.0, 0.1, 0.01]:
plt.subplots()
order = df.groupby(by='Primary Type').agg({'Primary Type': 'count'}).sort_values(by='Primary Type', ascending=False).index
order = order[:5]
seaborn.violinplot(data=df, x='Primary Type', y='hour', order=order, bw=bw)
plt.xticks(rotation=90)
plt.tight_layout()
# Observation:
# - Deceptive Practice more spiky; three strong peaks
# - 5am minimum in all types
# - bandwidth:
# - too large: too much smoothing; no details
# - too small: problems with discrete hours
seaborn.pairplot(data=data, kind='scatter', diag_kind='scatter')
df.sample(frac=<fraction>)
, e.g. frac=0.01
.hue
to which you can pass the name of a (categorical) column. Afterwards, the points are coloured by this attribute.plot_kws={<key>: <value>}
to pass arguments to the plot function, e.g. plot_kws={'alpha': .2}
to plot the points with transparency.vars=<list-of-column-names>
to give an explicit order, or use only a subset of the columns. Experiment with different orderings.seaborn.pairplot(data=df.sample(frac=.001), kind='scatter', diag_kind='scatter', plot_kws={'alpha': .2} )
When scattering many points, the scatter matrix may get cluttered, and it hard to detect areas where many points lie, as they overlap. To cope with this situation, one can replace the scatter plots in each cell by 2-D histograms.
For this we will use seaborn.PairGrid
. A pair grid defines the "matrix" as known from the scatter plot, and provides interfaces to map different visualizations to the specific cells.
For easier usage, we provide you the following method, which plots a equi-width 2-D histogram into a cell.
Moreover, it determines the number of bins in $x$ and $y$ direction to not exceed the number of different values, in order to allow displaying categorical variables.
Be aware that the order in that case does not necessarily convey any meaning!
def hist2d(x, y, color, label, max_bins=50, **kwargs):
x_bins = numpy.unique(x).size
x_bins = min(x_bins, max_bins)
y_bins = numpy.unique(y).size
y_bins = min(y_bins, max_bins)
bins = (x_bins, y_bins)
plt.hist2d(x=x, y=y, bins=bins)
Next, create the pairgrid g
and map the method hist2d
to the cells using g.map(upper)
. As before, you can pass the keyword argument vars
to the PairGrid
constructor to select only a subset of columns.
You may also want to start with only a fraction of the data (i.e. using df.sample(frac=<fraction>))
as value for the data
parameter).
However, you should be able to run the visualization even for the full dataset.
(longitude, latitude)
. What can you observe? Also compare the plot against the shape of Chicago, e.g. from OpenStreeMap or WikipediaHint:
plt.subplots_adjust(wspace=0.02, hspace=0.02)
to compact the grid.# TODO: SOLUTION
frac = 0.01
selection = df.sample(frac=frac)
g = seaborn.PairGrid(data=selection)
g.map(hist2d)
plt.subplots_adjust(wspace=0.02, hspace=0.02)
# Observations:
# - location reflects shape of Chicago.
# - there is a clear crime hotspot at the harbour
# - one can re-discover previous observations, e.g. the 5 am crime minimum
We will also take a look into a (simple) pixel-oriented technique.
Here, we have one subplot for each attribute, and each data points maps to one single pixel per window, where the colour indicates the value.
First, we create a (n_features x n_features)
matrix of subplots by
figure, axes = plt.subplots(nrows=n_features, ncols=n_features)
In row $r$, we display the whole dataset sorted by the $r$th feature. In order to avoid artifacts by having internal presorting in the data set, shuffle the rows first. You can do both using
sorted_df = df.sample(frac=1.0).sort_values(by=df.columns[r])
To visualize the $c$th feature in column $c$ (of the subplots matrix), we have to bring the values into a rectangular array.
We use width=512
, and infer the height by height = int(numpy.ceil(n_samples / width))
.
You can use the following method to bring a column from a data frame into a rectangular array.
def prepare_cell(df_column, width=512):
n_samples = len(df_column)
height = int(numpy.ceil(n_samples / width))
buffer = numpy.empty(shape=(height, width))
buffer[:] = float('nan')
if hasattr(df_column, 'cat'):
df_column = df_column.cat.codes
buffer.flat[:n_samples] = df_column.values
# buffer = buffer.reshape((height, width))
return buffer
Finally, you can use axes[r, c].imshow(buffer, aspect='auto')
to show the data in buffer
in row $r$, column $c$ of the matrix.
Questions:
Hints:
axes[r, c].set_xticks([])
, axes[r, c].set_yticks([])
to disable tick labels on the axes.axes[r, 0].set_ylabel(df.columns[r])
and axes[-1, c].set_xlabel(df.columns[c])
to add x and y labels to you matrix.plt.subplots_adjust(hspace=.02, wspace=.02)
to adjust the space between the cells.# TODO: SOLUTION
n_features = len(df.columns)
f, axes = plt.subplots(nrows=n_features, ncols=n_features)
for r, primary_column in enumerate(df.columns):
sorted_df = df.sample(frac=1.0).sort_values(by=primary_column)
for i, column in enumerate(df.columns):
buffer = prepare_cell(df_column=sorted_df[column])
axes[r, i].imshow(buffer, aspect='auto')
axes[r, i].set_xticks([])
axes[r, i].set_yticks([])
axes[r, 0].set_ylabel(primary_column)
axes[-1, r].set_xlabel(primary_column)
plt.subplots_adjust(hspace=.02, wspace=.02)
# Observations:
# - We can see the hierachy in Primary Type and Description
# - Block, Beat, District, Ward, Community Area are all different resolutions of the region hierarchy. We can clearly see this relation in the pattern.
# - The regions are located in some latitude, longitude area
# - There are some crime types (primary type and description) from which crimes are more often classified Domestic
# - There are some pattern between Description (i.e. high-resolution crime type) and location description, indicating that some crime types occur at typical locations
# - There are some pattern between Arrest and Primary Type, i.e. there are some crime types that typically lead to arrests
# ---
# - If we disable sorting, we cannot see patterns beside randomly looking noise.