In this tutorial, we want to give you tbe opportunity to do some explorative data analysis on your own. For this purpose we will use a real-world dataset of crime in Chicago. The full dataset is accessible at https://catalog.data.gov/dataset/crimes-2001-to-present-398a4, but we also provide a subset of the data in compressed form at http://www.dbs.ifi.lmu.de/Lehre/KDD/WS1819/tutorials/ChicagoCrime2017.csv.xz.
You do not need to unpack the file, as pandas
can also read from compressed CSV files.
The meaning of each column is described at https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2 .
If you need any help, feel free to contact your tutor or the assistants.
In this tutorial, we will need the following packages. Please refer to the Python introduction when you need to know how to install Python packages.
%matplotlib inline
import numpy
import pandas
from matplotlib import pyplot as plt
import seaborn
from pandas.plotting import parallel_coordinates
# Set default figure size
from matplotlib import rcParams
rcParams['figure.figsize'] = (15, 15)
# For wide screens
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))
Here, please insert the path of the downloaded file. A relative path will be relative to the notebook.
file_path = 'ChicagoCrime2017.csv.xz'
Use pandas.read_csv
to read the CSV file into a pandas DataFrame
.
df = pandas.read_csv(file_path)
df.info()
displays type and shape information about the table.
df.info()
df.head()
shows the first lines
df.head()
For some rows there where missing values.
They have been replaced by NaN
in the parser.
We remove these rows using the following line.
df.dropna(how='any', inplace=True)
From the output of df.info()
above, you can see that the columns where converted to one of the following types: bool
, float64
, int64
, object
.
In the following, we will pre-process the data for the visualizations.
Perform the following operations on the columns:
The following columns are identifiers, which we will just keep as they are: 'ID', 'Case Number'
(The ID
is unique per row, whereas the Case Number
can be used to find multiple crimes connected to one case).
We set the ID
as index, and drop Case Number
.
# TODO: SOLUTION
df.set_index('ID', drop=True, inplace=True)
del df['Case Number']
df.head()
As loaded, the table has three time-related attributes: 'Year'
, 'Updated On'
, and 'Date'
. As we are only looking at year 2017, the 'Year'
column can be dropped.
Also the 'Updated On'
is not of use for us.
Taking a closer look at 'Date'
, we find out that this column carries not only information about the day, but also the time.
We will create four columns from this column: month
, day
, ẁeekday
, hour
.
You can use the following snippet to parse the date as given into a datetime
.
datetime_column = pandas.to_datetime(df['Date'], format='%m/%d/%Y %I:%M:%S %p')
Consult https://pandas.pydata.org/pandas-docs/stable/categorical.html#string-and-datetime-accessors for information on how to work with columns of type datetime
.
# TODO: SOLUTION
del df['Year']
del df['Updated On']
del df['Date']
df['month'] = datetime_column.dt.month
df['day'] = datetime_column.dt.day
df['weekday'] = datetime_column.dt.weekday_name.astype('category')
df['hour'] = datetime_column.dt.hour
df.head()
The given data provides location information on several levels:
Beat, Community Area, District, Ward
have been parsed as float datatype. For visually more pleasing labels, you can first convert them to integers again, before making them categorical.Apply the following operations on the location-related columns:
Column | Operation |
---|---|
'Longitude' | Keep |
'Latitude' | Keep |
'Location' | Drop |
'X Coordinate' | Drop |
'Y Coordinate' | Drop |
'Block' | Categorical |
'Beat' | Categorical |
'Community Area' | Categorical |
'District' | Categorical |
'Ward' | Categorical |
Hints:
df.drop(columns=<column_name>, inplace=True)
to remove a column from a table (working in-place).df[<column_name>] = df[<column_name>].astype('category')
to convert a column to categorical data type.# TODO: SOLUTION
del df['Location']
del df['X Coordinate']
del df['Y Coordinate']
df['Block'] = df['Block'].astype('category')
for categorical_column in ['Beat', 'Community Area', 'District', 'Ward']:
df[categorical_column] = df[categorical_column].astype(numpy.int32).astype('category')
df.head()
The crime is described using the following columns:
Apply the following operations on the location-related columns:
Column | Operation |
---|---|
'IUCR' | Drop |
'Primary Type' | Categorical |
'Description' | Categorical |
'FBI Code' | Drop |
'Arrest' | Categorical |
'Domestic' | Categorical |
'Location Description' | Categorical |
# TODO: SOLUTION
del df['IUCR']
del df['FBI Code']
for categorical_column in ['Primary Type', 'Description', 'Arrest', 'Domestic', 'Location Description']:
df[categorical_column] = df[categorical_column].astype('category')
df.head()
Now that we have finished preprocessing the data, you can show some information about the resulting data frame using df.head()
, df.info()
.
In addition, you can use df.describe()
to obtain some measures of central tendency and dispersion about the numeric columns.
Can you already see something insightful from these simple statistics?
# TODO: SOLUTION
df.head()
# TODO: SOLUTION
df.info()
# TODO: SOLUTION
df.describe()
# Something to see:
# hour:
# 25%-percentile = 9
# 75%-percentile = 18
# => 50% of the crimes happen between 9 and 18 (at day)
# month: from quantiles alone: perfect uniform over months
Use df.groupby(by=columns).size()
to count how often certain combinations in the given columns occur, i.e. use columns=['Primary Type']
to get a list of the most frequent primary crime type.
df.groupby(by=['Primary Type', 'Description']).size().sort_values(ascending=False)
To visualize the distribution of categorical attributes, we will first use equi-width histograms. Create such for all categorical attributes.
What form do these distribution exhibit? Are there distinct peaks? Can you guess reasons for these? What insights do you gain? How useful are the plots for the region information attributes?
Hint:
seaborn.catplot(data=df, x=<column_name>, kind='count')
Ward
order=<list-of-values>
keyword of seaborn.catplot
. What are useful options for ordering the value?# TODO: SOLUTION
for cat_var in ['month', 'hour', 'weekday', 'Domestic', 'Arrest', 'District', 'Ward', 'Location Description']:
if cat_var == 'weekday':
# Sort by week
order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
elif cat_var in ['District', 'Ward', 'Location Description']:
# Sort by frequency
order = df.groupby(by=cat_var).size().sort_values(ascending=False).index
else:
# Do not sort explicitly
order = None
seaborn.catplot(data=df, x=cat_var, kind='count', order=order)
plt.xticks(rotation=90)
# Discoveries:
# - month: bimodal distribution; peaks at January and July (=> maybe vacation time)
# - hour: most crimes during (late) daytime
# - weekday: slightly more crime on weekends
# - most crimes are non-domestic
# - most crimes do not lead to an arrest
# - There are some districts, and wards with more crime than other
# - 'Ward' is already hardly readible.
# - Sorting increases readability.
df.quantile(q=<p-values>)
, where p-values
is a list of quantiles you want to compute. The return value is a dataframe whose index contains the $p$ values, and the columns the corresponding quantiles for each numeric(!) column in the original dataframe.num
equi-distant values between $0$ and $1$, you can use numpy.linspace(start=0.0, stop=1.0, num=num)
.seaborn.scatterplot(data=quantiles, x=<p-values>, y=<column_name>)
Questions:
# TODO: SOLUTION
ps = numpy.linspace(start=0.0, stop=1.0, num=100)
quantiles = df.quantile(q=ps)
columns = quantiles.columns
quantiles['p'] = ps
for column in columns:
plt.subplots()
seaborn.scatterplot(data=quantiles, x=ps, y=column)
# Discoveries:
# - Non-uniform distribution of location; BUT: Spatial data should be visualiased on a map
# - Steps for discrete attributes
Another option to display the distribution of data values of one attribute is the boxplot. It also allows for easy comparison between the distribution in different subsamples.
Tasks:
'hour'
for different values of 'Primary Type'
. (Is there are "typical" time for some kind of crime?)'hour'
for different values of 'weekday'
.order
keyword to pass some explicit order. What are meaningful orderings? Do they facilitate understanding the plot?Hint:
plt.xticks(rotation=90)
to rotate the tick labels on the x-axis.# TODO: SOLUTION
plt.subplots()
group = df.groupby(by='Primary Type').agg({'hour': 'median'}).sort_values(by='hour', ascending=False)
order = group.index
seaborn.boxplot(data=df, x='Primary Type', y='hour', order=order)
plt.xticks(rotation=90)
plt.tight_layout()
plt.subplots()
order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
seaborn.boxplot(data=df, x='weekday', y='hour', order=order)
plt.xticks(rotation=90)
plt.tight_layout()
plt.subplots()
group = df.groupby(by='Location Description').agg({'hour': 'median'}).sort_values(by='hour', ascending=False)
order = group.index
seaborn.boxplot(data=df, x='Location Description', y='hour', order=order)
plt.xticks(rotation=90)
plt.tight_layout()
# Observations
# - Prostitution is concentrated in the evening
# - Public indecency is quite narrow around noon; BUT: Also small sample size
# - Most categories cover the complete data range with their 25%-75% quantile.
# - On weekends crime starts earlier
Compared to a boxplot, violin plots replace the box by a kernel density estimate of the data distribution. Thereby, they reveal more details about the distribution of the data. You can create violin plots using
seaborn.violinplot(data=df, x=<column_name_a>, y=<column_name_a>)
Create a violin plot for
x='Primary Type'
, y='hour'
; restrict 'Primary Type' to the 5 most common types.Hint:
df.groupby(by='Primary Type').agg({'Primary Type': 'count'})
# TODO: SOLUTION
for bw in [None, 1.0, 0.1, 0.01]:
plt.subplots()
order = df.groupby(by='Primary Type').agg({'Primary Type': 'count'}).sort_values(by='Primary Type', ascending=False).index
order = order[:5]
seaborn.violinplot(data=df, x='Primary Type', y='hour', order=order, bw=bw)
plt.xticks(rotation=90)
plt.tight_layout()
# Observation:
# - Deceptive Practice more spiky; three strong peaks
# - 5am minimum in all types
# - bandwidth:
# - too large: too much smoothing; no details
# - too small: problems with discrete hours