Knowledge Discovery and Data Mining I - Winter Semester 2018/19

  • Lecturer: Prof. Dr. Thomas Seidl
  • Assistants: Max Berrendorf, Julian Busch

Tutorial 2: Explorative Data Analysis

In this tutorial, we want to give you tbe opportunity to do some explorative data analysis on your own. For this purpose we will use a real-world dataset of crime in Chicago. The full dataset is accessible at https://catalog.data.gov/dataset/crimes-2001-to-present-398a4, but we also provide a subset of the data in compressed form at http://www.dbs.ifi.lmu.de/Lehre/KDD/WS1819/tutorials/ChicagoCrime2017.csv.xz. You do not need to unpack the file, as pandas can also read from compressed CSV files.

The meaning of each column is described at https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2 .

If you need any help, feel free to contact your tutor or the assistants.

0. Imports

In this tutorial, we will need the following packages. Please refer to the Python introduction when you need to know how to install Python packages.

In [1]:
%matplotlib inline
import numpy
import pandas
from matplotlib import pyplot as plt
import seaborn
from pandas.plotting import parallel_coordinates
In [2]:
# Set default figure size
from matplotlib import rcParams
rcParams['figure.figsize'] = (15, 15)

# For wide screens
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

1. Load data

Here, please insert the path of the downloaded file. A relative path will be relative to the notebook.

In [3]:
file_path = 'ChicagoCrime2017.csv.xz'

Use pandas.read_csv to read the CSV file into a pandas DataFrame.

In [4]:
df = pandas.read_csv(file_path)

df.info() displays type and shape information about the table.

In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 267715 entries, 0 to 267714
Data columns (total 22 columns):
ID                      267715 non-null int64
Case Number             267715 non-null object
Date                    267715 non-null object
Block                   267715 non-null object
IUCR                    267715 non-null object
Primary Type            267715 non-null object
Description             267715 non-null object
Location Description    266731 non-null object
Arrest                  267715 non-null bool
Domestic                267715 non-null bool
Beat                    267715 non-null int64
District                267714 non-null float64
Ward                    267714 non-null float64
Community Area          267715 non-null float64
FBI Code                267715 non-null object
X Coordinate            265004 non-null float64
Y Coordinate            265004 non-null float64
Year                    267715 non-null int64
Updated On              267715 non-null object
Latitude                265004 non-null float64
Longitude               265004 non-null float64
Location                265004 non-null object
dtypes: bool(2), float64(7), int64(3), object(10)
memory usage: 41.4+ MB

df.head() shows the first lines

In [6]:
df.head()
Out[6]:
ID Case Number Date Block IUCR Primary Type Description Location Description Arrest Domestic ... Ward Community Area FBI Code X Coordinate Y Coordinate Year Updated On Latitude Longitude Location
0 11094370 JA440032 09/21/2017 12:15:00 AM 072XX N CALIFORNIA AVE 1122 DECEPTIVE PRACTICE COUNTERFEIT CHECK CURRENCY EXCHANGE True False ... 50.0 2.0 10 1156443.0 1947707.0 2017 03/01/2018 03:52:35 PM 42.012293 -87.699714 (42.012293397, -87.699714109)
1 11118031 JA470589 10/12/2017 07:14:00 PM 055XX W GRAND AVE 1345 CRIMINAL DAMAGE TO CITY OF CHICAGO PROPERTY JAIL / LOCK-UP FACILITY True False ... 29.0 19.0 14 1138788.0 1913480.0 2017 03/01/2018 03:52:35 PM 41.918712 -87.765511 (41.918711651, -87.76551063)
2 11134189 JA491697 10/30/2017 11:52:00 AM 043XX S TALMAN AVE 4651 OTHER OFFENSE SEX OFFENDER: FAIL REG NEW ADD APARTMENT True False ... 12.0 58.0 26 1159425.0 1875711.0 2017 03/01/2018 03:52:35 PM 41.814670 -87.690727 (41.814669784, -87.690726696)
3 11156462 JA521389 09/29/2017 06:45:00 PM 055XX W BELMONT AVE 1110 DECEPTIVE PRACTICE BOGUS CHECK CURRENCY EXCHANGE True False ... 30.0 19.0 11 1138653.0 1920720.0 2017 03/01/2018 03:52:35 PM 41.938581 -87.765831 (41.938581442, -87.765830579)
4 11164874 JA531910 12/01/2017 06:20:00 AM 022XX W CHICAGO AVE 0265 CRIM SEXUAL ASSAULT AGGRAVATED: OTHER STREET True False ... 32.0 24.0 02 1161264.0 1905292.0 2017 03/01/2018 03:52:35 PM 41.895805 -87.683159 (41.895805099, -87.683159355)

5 rows × 22 columns

For some rows there where missing values. They have been replaced by NaN in the parser. We remove these rows using the following line.

In [7]:
df.dropna(how='any', inplace=True)

2. Preprocessing

From the output of df.info() above, you can see that the columns where converted to one of the following types: bool, float64, int64, object. In the following, we will pre-process the data for the visualizations.

Perform the following operations on the columns:

ID columns

The following columns are identifiers, which we will just keep as they are: 'ID', 'Case Number' (The ID is unique per row, whereas the Case Number can be used to find multiple crimes connected to one case). We set the ID as index, and drop Case Number.

In [8]:
# TODO: SOLUTION
df.set_index('ID', drop=True, inplace=True)
del df['Case Number']
df.head()
Out[8]:
Date Block IUCR Primary Type Description Location Description Arrest Domestic Beat District Ward Community Area FBI Code X Coordinate Y Coordinate Year Updated On Latitude Longitude Location
ID
11094370 09/21/2017 12:15:00 AM 072XX N CALIFORNIA AVE 1122 DECEPTIVE PRACTICE COUNTERFEIT CHECK CURRENCY EXCHANGE True False 2411 24.0 50.0 2.0 10 1156443.0 1947707.0 2017 03/01/2018 03:52:35 PM 42.012293 -87.699714 (42.012293397, -87.699714109)
11118031 10/12/2017 07:14:00 PM 055XX W GRAND AVE 1345 CRIMINAL DAMAGE TO CITY OF CHICAGO PROPERTY JAIL / LOCK-UP FACILITY True False 2515 25.0 29.0 19.0 14 1138788.0 1913480.0 2017 03/01/2018 03:52:35 PM 41.918712 -87.765511 (41.918711651, -87.76551063)
11134189 10/30/2017 11:52:00 AM 043XX S TALMAN AVE 4651 OTHER OFFENSE SEX OFFENDER: FAIL REG NEW ADD APARTMENT True False 922 9.0 12.0 58.0 26 1159425.0 1875711.0 2017 03/01/2018 03:52:35 PM 41.814670 -87.690727 (41.814669784, -87.690726696)
11156462 09/29/2017 06:45:00 PM 055XX W BELMONT AVE 1110 DECEPTIVE PRACTICE BOGUS CHECK CURRENCY EXCHANGE True False 2514 25.0 30.0 19.0 11 1138653.0 1920720.0 2017 03/01/2018 03:52:35 PM 41.938581 -87.765831 (41.938581442, -87.765830579)
11164874 12/01/2017 06:20:00 AM 022XX W CHICAGO AVE 0265 CRIM SEXUAL ASSAULT AGGRAVATED: OTHER STREET True False 1221 12.0 32.0 24.0 02 1161264.0 1905292.0 2017 03/01/2018 03:52:35 PM 41.895805 -87.683159 (41.895805099, -87.683159355)

Time

As loaded, the table has three time-related attributes: 'Year', 'Updated On', and 'Date'. As we are only looking at year 2017, the 'Year' column can be dropped. Also the 'Updated On' is not of use for us. Taking a closer look at 'Date', we find out that this column carries not only information about the day, but also the time. We will create four columns from this column: month, day, ẁeekday, hour. You can use the following snippet to parse the date as given into a datetime.

In [9]:
datetime_column = pandas.to_datetime(df['Date'], format='%m/%d/%Y %I:%M:%S %p')

Consult https://pandas.pydata.org/pandas-docs/stable/categorical.html#string-and-datetime-accessors for information on how to work with columns of type datetime.

In [10]:
# TODO: SOLUTION
del df['Year']
del df['Updated On']
del df['Date']

df['month'] = datetime_column.dt.month
df['day'] = datetime_column.dt.day
df['weekday'] = datetime_column.dt.weekday_name.astype('category')
df['hour'] = datetime_column.dt.hour

df.head()
Out[10]:
Block IUCR Primary Type Description Location Description Arrest Domestic Beat District Ward ... FBI Code X Coordinate Y Coordinate Latitude Longitude Location month day weekday hour
ID
11094370 072XX N CALIFORNIA AVE 1122 DECEPTIVE PRACTICE COUNTERFEIT CHECK CURRENCY EXCHANGE True False 2411 24.0 50.0 ... 10 1156443.0 1947707.0 42.012293 -87.699714 (42.012293397, -87.699714109) 9 21 Thursday 0
11118031 055XX W GRAND AVE 1345 CRIMINAL DAMAGE TO CITY OF CHICAGO PROPERTY JAIL / LOCK-UP FACILITY True False 2515 25.0 29.0 ... 14 1138788.0 1913480.0 41.918712 -87.765511 (41.918711651, -87.76551063) 10 12 Thursday 19
11134189 043XX S TALMAN AVE 4651 OTHER OFFENSE SEX OFFENDER: FAIL REG NEW ADD APARTMENT True False 922 9.0 12.0 ... 26 1159425.0 1875711.0 41.814670 -87.690727 (41.814669784, -87.690726696) 10 30 Monday 11
11156462 055XX W BELMONT AVE 1110 DECEPTIVE PRACTICE BOGUS CHECK CURRENCY EXCHANGE True False 2514 25.0 30.0 ... 11 1138653.0 1920720.0 41.938581 -87.765831 (41.938581442, -87.765830579) 9 29 Friday 18
11164874 022XX W CHICAGO AVE 0265 CRIM SEXUAL ASSAULT AGGRAVATED: OTHER STREET True False 1221 12.0 32.0 ... 02 1161264.0 1905292.0 41.895805 -87.683159 (41.895805099, -87.683159355) 12 1 Friday 6

5 rows × 21 columns

Location

The given data provides location information on several levels:

  • Longitude, Latitude: Geo position
  • Location = tuple (Longitude, Latitude) (redundant)
  • X Coordinate, Y Coodinate: position in State Plane Illinois East NAD 1983 projection (redundant)
  • Block, Beat, Community Area, District, Ward: All regions in Chicago. They follow the hierarchy Block < Beat < Community Area < Ward < District.
  • Due to missing values in the original data, Beat, Community Area, District, Ward have been parsed as float datatype. For visually more pleasing labels, you can first convert them to integers again, before making them categorical.

Apply the following operations on the location-related columns:

Column Operation
'Longitude' Keep
'Latitude' Keep
'Location' Drop
'X Coordinate' Drop
'Y Coordinate' Drop
'Block' Categorical
'Beat' Categorical
'Community Area' Categorical
'District' Categorical
'Ward' Categorical

Hints:

In [11]:
# TODO: SOLUTION
del df['Location']
del df['X Coordinate']
del df['Y Coordinate']

df['Block'] = df['Block'].astype('category')
for categorical_column in ['Beat', 'Community Area', 'District', 'Ward']:
    df[categorical_column] = df[categorical_column].astype(numpy.int32).astype('category')
df.head()
Out[11]:
Block IUCR Primary Type Description Location Description Arrest Domestic Beat District Ward Community Area FBI Code Latitude Longitude month day weekday hour
ID
11094370 072XX N CALIFORNIA AVE 1122 DECEPTIVE PRACTICE COUNTERFEIT CHECK CURRENCY EXCHANGE True False 2411 24 50 2 10 42.012293 -87.699714 9 21 Thursday 0
11118031 055XX W GRAND AVE 1345 CRIMINAL DAMAGE TO CITY OF CHICAGO PROPERTY JAIL / LOCK-UP FACILITY True False 2515 25 29 19 14 41.918712 -87.765511 10 12 Thursday 19
11134189 043XX S TALMAN AVE 4651 OTHER OFFENSE SEX OFFENDER: FAIL REG NEW ADD APARTMENT True False 922 9 12 58 26 41.814670 -87.690727 10 30 Monday 11
11156462 055XX W BELMONT AVE 1110 DECEPTIVE PRACTICE BOGUS CHECK CURRENCY EXCHANGE True False 2514 25 30 19 11 41.938581 -87.765831 9 29 Friday 18
11164874 022XX W CHICAGO AVE 0265 CRIM SEXUAL ASSAULT AGGRAVATED: OTHER STREET True False 1221 12 32 24 02 41.895805 -87.683159 12 1 Friday 6

Crime Description

The crime is described using the following columns:

  • 'IUCR', 'Primary Type', 'Description': description of the type of crime. 'Description' is actually also a categorisation on a finer hierarchy level than 'Primary Type'. 'IUCR' is equivalent to 'Description', but a description in form of a number, instead of textual description.
  • 'FBI Code': A different categorisation of the crime (cf. http://gis.chicagopolice.org/clearmap_crime_sums/crime_types.html)
  • 'Arrest', 'Domestic': Boolean flags indicating whether an arrest was made, and if the incident was classified as domestic violence.
  • 'Location Description': A textual description of the location.

Apply the following operations on the location-related columns:

Column Operation
'IUCR' Drop
'Primary Type' Categorical
'Description' Categorical
'FBI Code' Drop
'Arrest' Categorical
'Domestic' Categorical
'Location Description' Categorical
In [12]:
# TODO: SOLUTION
del df['IUCR']
del df['FBI Code']
    
for categorical_column in ['Primary Type', 'Description', 'Arrest', 'Domestic', 'Location Description']:
    df[categorical_column] = df[categorical_column].astype('category')
df.head()
Out[12]:
Block Primary Type Description Location Description Arrest Domestic Beat District Ward Community Area Latitude Longitude month day weekday hour
ID
11094370 072XX N CALIFORNIA AVE DECEPTIVE PRACTICE COUNTERFEIT CHECK CURRENCY EXCHANGE True False 2411 24 50 2 42.012293 -87.699714 9 21 Thursday 0
11118031 055XX W GRAND AVE CRIMINAL DAMAGE TO CITY OF CHICAGO PROPERTY JAIL / LOCK-UP FACILITY True False 2515 25 29 19 41.918712 -87.765511 10 12 Thursday 19
11134189 043XX S TALMAN AVE OTHER OFFENSE SEX OFFENDER: FAIL REG NEW ADD APARTMENT True False 922 9 12 58 41.814670 -87.690727 10 30 Monday 11
11156462 055XX W BELMONT AVE DECEPTIVE PRACTICE BOGUS CHECK CURRENCY EXCHANGE True False 2514 25 30 19 41.938581 -87.765831 9 29 Friday 18
11164874 022XX W CHICAGO AVE CRIM SEXUAL ASSAULT AGGRAVATED: OTHER STREET True False 1221 12 32 24 41.895805 -87.683159 12 1 Friday 6

Now that we have finished preprocessing the data, you can show some information about the resulting data frame using df.head(), df.info(). In addition, you can use df.describe() to obtain some measures of central tendency and dispersion about the numeric columns.

Can you already see something insightful from these simple statistics?

In [13]:
# TODO: SOLUTION
df.head()
Out[13]:
Block Primary Type Description Location Description Arrest Domestic Beat District Ward Community Area Latitude Longitude month day weekday hour
ID
11094370 072XX N CALIFORNIA AVE DECEPTIVE PRACTICE COUNTERFEIT CHECK CURRENCY EXCHANGE True False 2411 24 50 2 42.012293 -87.699714 9 21 Thursday 0
11118031 055XX W GRAND AVE CRIMINAL DAMAGE TO CITY OF CHICAGO PROPERTY JAIL / LOCK-UP FACILITY True False 2515 25 29 19 41.918712 -87.765511 10 12 Thursday 19
11134189 043XX S TALMAN AVE OTHER OFFENSE SEX OFFENDER: FAIL REG NEW ADD APARTMENT True False 922 9 12 58 41.814670 -87.690727 10 30 Monday 11
11156462 055XX W BELMONT AVE DECEPTIVE PRACTICE BOGUS CHECK CURRENCY EXCHANGE True False 2514 25 30 19 41.938581 -87.765831 9 29 Friday 18
11164874 022XX W CHICAGO AVE CRIM SEXUAL ASSAULT AGGRAVATED: OTHER STREET True False 1221 12 32 24 41.895805 -87.683159 12 1 Friday 6
In [14]:
# TODO: SOLUTION
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 264199 entries, 11094370 to 11069602
Data columns (total 16 columns):
Block                   264199 non-null category
Primary Type            264199 non-null category
Description             264199 non-null category
Location Description    264199 non-null category
Arrest                  264199 non-null category
Domestic                264199 non-null category
Beat                    264199 non-null category
District                264199 non-null category
Ward                    264199 non-null category
Community Area          264199 non-null category
Latitude                264199 non-null float64
Longitude               264199 non-null float64
month                   264199 non-null int64
day                     264199 non-null int64
weekday                 264199 non-null category
hour                    264199 non-null int64
dtypes: category(11), float64(2), int64(3)
memory usage: 17.4 MB
In [15]:
# TODO: SOLUTION
df.describe()

# Something to see:
# hour: 
#  25%-percentile = 9
#  75%-percentile = 18
#  => 50% of the crimes happen between 9 and 18 (at day)
# month: from quantiles alone: perfect uniform over months
Out[15]:
Latitude Longitude month day hour
count 264199.000000 264199.000000 264199.000000 264199.000000 264199.000000
mean 41.844366 -87.670226 6.550275 15.663398 13.202128
std 0.085974 0.058687 3.368554 8.810503 6.502804
min 41.644606 -87.928909 1.000000 1.000000 0.000000
25% 41.769477 -87.712332 4.000000 8.000000 9.000000
50% 41.865143 -87.664035 7.000000 16.000000 14.000000
75% 41.906221 -87.627760 9.000000 23.000000 19.000000
max 42.022671 -87.524529 12.000000 31.000000 23.000000

Bonus Task

Use df.groupby(by=columns).size() to count how often certain combinations in the given columns occur, i.e. use columns=['Primary Type'] to get a list of the most frequent primary crime type.

In [16]:
df.groupby(by=['Primary Type', 'Description']).size().sort_values(ascending=False)
Out[16]:
Primary Type                      Description                                
THEFT                             $500 AND UNDER                                 24259
BATTERY                           DOMESTIC BATTERY SIMPLE                        23792
                                  SIMPLE                                         16141
THEFT                             OVER $500                                      15124
CRIMINAL DAMAGE                   TO PROPERTY                                    13802
                                  TO VEHICLE                                     13511
ASSAULT                           SIMPLE                                         12695
THEFT                             FROM BUILDING                                  10441
                                  RETAIL THEFT                                   10438
MOTOR VEHICLE THEFT               AUTOMOBILE                                      9829
BURGLARY                          FORCIBLE ENTRY                                  7479
                                  UNLAWFUL ENTRY                                  4567
ROBBERY                           ARMED: HANDGUN                                  4546
OTHER OFFENSE                     TELEPHONE THREAT                                3891
CRIMINAL TRESPASS                 TO LAND                                         3846
DECEPTIVE PRACTICE                CREDIT CARD FRAUD                               3827
                                  FINANCIAL IDENTITY THEFT OVER $ 300             3671
WEAPONS VIOLATION                 UNLAWFUL POSS OF HANDGUN                        3594
ROBBERY                           STRONGARM - NO WEAPON                           3563
ASSAULT                           AGGRAVATED: HANDGUN                             2890
OTHER OFFENSE                     HARASSMENT BY TELEPHONE                         2739
                                  HARASSMENT BY ELECTRONIC MEANS                  2644
THEFT                             POCKET-PICKING                                  2223
NARCOTICS                         POSS: HEROIN(WHITE)                             2221
BATTERY                           AGGRAVATED: HANDGUN                             2039
                                  AGGRAVATED: OTHER DANG WEAPON                   1871
DECEPTIVE PRACTICE                FRAUD OR CONFIDENCE GAME                        1759
OTHER OFFENSE                     VIOLATE ORDER OF PROTECTION                     1554
ASSAULT                           AGGRAVATED:KNIFE/CUTTING INSTR                  1464
OTHER OFFENSE                     OTHER VEHICLE OFFENSE                           1445
                                                                                 ...  
MOTOR VEHICLE THEFT               ATT: TRUCK, BUS, MOTOR HOME                        2
OTHER OFFENSE                     MONEY LAUNDERING                                   2
PUBLIC PEACE VIOLATION            SELL/ADVERTISE FIREWORKS                           2
                                  INTERFERE W/ EMERGENCY EQUIP                       2
WEAPONS VIOLATION                 DEFACE IDENT MARKS OF FIREARM                      1
PROSTITUTION                      PANDERING                                          1
ASSAULT                           AGG PRO.EMP: OTHER FIREARM                         1
NON-CRIMINAL                      CONCEALED CARRY LICENSE REVOCATION                 1
MOTOR VEHICLE THEFT               THEFT/RECOVERY: CYCLE, SCOOTER, BIKE NO VIN        1
WEAPONS VIOLATION                 UNLAWFUL SALE/DELIVERY OF FIREARM AT SCHOOL        1
NARCOTICS                         MANU/DELIVER:HEROIN(BLACK TAR)                     1
NON-CRIMINAL                      FOUND PASSPORT                                     1
INTERFERENCE WITH PUBLIC OFFICER  OFFICIAL MISCONDUCT                                1
NON-CRIMINAL                      GUN OFFENDER NOTIFICATION-NO CONTACT               1
BATTERY                           AGGRAVATED PO: OTHER FIREARM                       1
PROSTITUTION                      PATRONIZING A PROSTITUTE                           1
                                  CAB OPERATION                                      1
OTHER OFFENSE                     INTERFERE W/ HIGHER EDUCATION                      1
                                  ARSONIST: DUTY TO REGISTER                         1
DECEPTIVE PRACTICE                INSURANCE FRAUD                                    1
OTHER OFFENSE                     ANIMAL FIGHTING                                    1
DECEPTIVE PRACTICE                UNIDENTIFIABLE RECORDING SOUND                     1
NON-CRIMINAL                      NOTIFICATION OF STALKING - NO CONTACT ORDER        1
GAMBLING                          GAME/AMUSEMENT DEVICE                              1
OFFENSE INVOLVING CHILDREN        SALE TOBACCO PRODUCTS TO MINOR                     1
CRIM SEXUAL ASSAULT               ATTEMPT AGG: KNIFE/CUT INSTR                       1
INTIMIDATION                      EDUCATIONAL INTIMIDAITON                           1
                                  INTIMIDATION OF LAW ENFORCEMENT OFFICIAL           1
LIQUOR LAW VIOLATION              MINOR MISREPRESENT AGE                             1
WEAPONS VIOLATION                 USE OF METAL PIERCING BULLETS                      1
Length: 329, dtype: int64

Basic Visualization

Histograms

To visualize the distribution of categorical attributes, we will first use equi-width histograms. Create such for all categorical attributes.

What form do these distribution exhibit? Are there distinct peaks? Can you guess reasons for these? What insights do you gain? How useful are the plots for the region information attributes?

Hint:

  • You can create a histogram using

seaborn.catplot(data=df, x=<column_name>, kind='count')

  • Do not create the histograms for regions attributes smaller than Ward
  • You can also pass an explicit order using the order=<list-of-values> keyword of seaborn.catplot. What are useful options for ordering the value?
In [17]:
# TODO: SOLUTION
for cat_var in ['month', 'hour', 'weekday', 'Domestic', 'Arrest', 'District', 'Ward', 'Location Description']:
    if cat_var == 'weekday':
        # Sort by week
        order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    elif cat_var in ['District', 'Ward', 'Location Description']:
        # Sort by frequency
        order = df.groupby(by=cat_var).size().sort_values(ascending=False).index
    else:
        # Do not sort explicitly
        order = None
    seaborn.catplot(data=df, x=cat_var, kind='count', order=order)
    plt.xticks(rotation=90)
    
# Discoveries:
# - month: bimodal distribution; peaks at January and July (=> maybe vacation time)
# - hour: most crimes during (late) daytime
# - weekday: slightly more crime on weekends
# - most crimes are non-domestic
# - most crimes do not lead to an arrest
# - There are some districts, and wards with more crime than other
# - 'Ward' is already hardly readible.
# - Sorting increases readability.

Quantile Plots

  • To create quantile plots, you can use df.quantile(q=<p-values>), where p-values is a list of quantiles you want to compute. The return value is a dataframe whose index contains the $p$ values, and the columns the corresponding quantiles for each numeric(!) column in the original dataframe.
  • To get num equi-distant values between $0$ and $1$, you can use numpy.linspace(start=0.0, stop=1.0, num=num).
  • To create the plot, use seaborn.scatterplot(data=quantiles, x=<p-values>, y=<column_name>)

Questions:

  • What do you observe here?
  • What benefit does a quantile plot offer compared to histograms?
  • How does the quantile plot behave for discrete attributes?
In [18]:
# TODO: SOLUTION
ps = numpy.linspace(start=0.0, stop=1.0, num=100)
quantiles = df.quantile(q=ps)
columns = quantiles.columns
quantiles['p'] = ps
for column in columns:
    plt.subplots()
    seaborn.scatterplot(data=quantiles, x=ps, y=column)

# Discoveries:
# - Non-uniform distribution of location; BUT: Spatial data should be visualiased on a map
# - Steps for discrete attributes

Boxplot

Another option to display the distribution of data values of one attribute is the boxplot. It also allows for easy comparison between the distribution in different subsamples.

Tasks:

  • Compare the distribution of 'hour' for different values of 'Primary Type'. (Is there are "typical" time for some kind of crime?)
  • Compare the distribution of 'hour' for different values of 'weekday'.
  • Again, you can use the order keyword to pass some explicit order. What are meaningful orderings? Do they facilitate understanding the plot?
  • Think about/Try out more comparisons. Is there something you can discover?
  • How meaningful are boxplots for small samples?

Hint:

  • You can use plt.xticks(rotation=90) to rotate the tick labels on the x-axis.
In [19]:
# TODO: SOLUTION
plt.subplots()
group = df.groupby(by='Primary Type').agg({'hour': 'median'}).sort_values(by='hour', ascending=False)
order = group.index
seaborn.boxplot(data=df, x='Primary Type', y='hour', order=order)
plt.xticks(rotation=90)
plt.tight_layout()

plt.subplots()
order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
seaborn.boxplot(data=df, x='weekday', y='hour', order=order)
plt.xticks(rotation=90)
plt.tight_layout()

plt.subplots()
group = df.groupby(by='Location Description').agg({'hour': 'median'}).sort_values(by='hour', ascending=False)
order = group.index
seaborn.boxplot(data=df, x='Location Description', y='hour', order=order)
plt.xticks(rotation=90)
plt.tight_layout()

# Observations
# - Prostitution is concentrated in the evening
# - Public indecency is quite narrow around noon; BUT: Also small sample size
# - Most categories cover the complete data range with their 25%-75% quantile.
# - On weekends crime starts earlier

Violin Plot

Compared to a boxplot, violin plots replace the box by a kernel density estimate of the data distribution. Thereby, they reveal more details about the distribution of the data. You can create violin plots using

seaborn.violinplot(data=df, x=<column_name_a>, y=<column_name_a>)

Create a violin plot for

  • x='Primary Type', y='hour'; restrict 'Primary Type' to the 5 most common types.
  • Explicitly set the bandwidth of the kernel to the values 1.0, 0.1 and 0.01. What do you observe?

Hint:

  • You can count the absolute frequency of a crime type using

df.groupby(by='Primary Type').agg({'Primary Type': 'count'})

In [20]:
# TODO: SOLUTION
for bw in [None, 1.0, 0.1, 0.01]:
    plt.subplots()
    order = df.groupby(by='Primary Type').agg({'Primary Type': 'count'}).sort_values(by='Primary Type', ascending=False).index
    order = order[:5]
    seaborn.violinplot(data=df, x='Primary Type', y='hour', order=order, bw=bw)
    plt.xticks(rotation=90)
    plt.tight_layout()

# Observation:
# - Deceptive Practice more spiky; three strong peaks
# - 5am minimum in all types
# - bandwidth:
#   - too large: too much smoothing; no details
#   - too small: problems with discrete hours
/home/berrendorf/common_venv/lib/python3.6/site-packages/ipykernel_launcher.py:4: FutureWarning: 'Primary Type' is both an index level and a column label.
Defaulting to column, but this will raise an ambiguity error in a future version
  after removing the cwd from sys.path.
/home/berrendorf/common_venv/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Advanced Visualization

Scatterplot Matrix

  • To create a scatterplot matrix, you can use
    seaborn.pairplot(data=data, kind='scatter', diag_kind='scatter')
  • As this visualization is quite compute-intensive, you can call it on a subsample of the data using df.sample(frac=<fraction>), e.g. frac=0.01.
  • pairplots takes an additional keyword argument hue to which you can pass the name of a (categorical) column. Afterwards, the points are coloured by this attribute.
  • You can use plot_kws={<key>: <value>} to pass arguments to the plot function, e.g. plot_kws={'alpha': .2} to plot the points with transparency.
  • You can use vars=<list-of-column-names> to give an explicit order, or use only a subset of the columns. Experiment with different orderings.
In [21]:
seaborn.pairplot(data=df.sample(frac=.001), kind='scatter', diag_kind='scatter', plot_kws={'alpha': .2} )
Out[21]:
<seaborn.axisgrid.PairGrid at 0x7fd295a34400>

When scattering many points, the scatter matrix may get cluttered, and it hard to detect areas where many points lie, as they overlap. To cope with this situation, one can replace the scatter plots in each cell by 2-D histograms.

For this we will use seaborn.PairGrid. A pair grid defines the "matrix" as known from the scatter plot, and provides interfaces to map different visualizations to the specific cells. For easier usage, we provide you the following method, which plots a equi-width 2-D histogram into a cell. Moreover, it determines the number of bins in $x$ and $y$ direction to not exceed the number of different values, in order to allow displaying categorical variables. Be aware that the order in that case does not necessarily convey any meaning!

In [22]:
def hist2d(x, y, color, label, max_bins=50, **kwargs):
    x_bins = numpy.unique(x).size
    x_bins = min(x_bins, max_bins)
    y_bins = numpy.unique(y).size
    y_bins = min(y_bins, max_bins)
    bins = (x_bins, y_bins)
    plt.hist2d(x=x, y=y, bins=bins)

Next, create the pairgrid g and map the method hist2d to the cells using g.map(upper). As before, you can pass the keyword argument vars to the PairGrid constructor to select only a subset of columns. You may also want to start with only a fraction of the data (i.e. using df.sample(frac=<fraction>)) as value for the data parameter). However, you should be able to run the visualization even for the full dataset.

  • Look at the cell (longitude, latitude). What can you observe? Also compare the plot against the shape of Chicago, e.g. from OpenStreeMap or Wikipedia
  • Look at the cells of the region hierarchy. What can you discover here?
  • What else can you see? Compare it against what you already saw in e.g. the histograms.

Hint:

  • You can use plt.subplots_adjust(wspace=0.02, hspace=0.02) to compact the grid.
In [23]:
# TODO: SOLUTION
frac = 0.01
selection = df.sample(frac=frac)
g = seaborn.PairGrid(data=selection)
g.map(hist2d)
plt.subplots_adjust(wspace=0.02, hspace=0.02)

# Observations:
# - location reflects shape of Chicago.
# - there is a clear crime hotspot at the harbour
# - one can re-discover previous observations, e.g. the 5 am crime minimum

Pixel-Oriented Visualization

We will also take a look into a (simple) pixel-oriented technique. Here, we have one subplot for each attribute, and each data points maps to one single pixel per window, where the colour indicates the value. First, we create a (n_features x n_features) matrix of subplots by

figure, axes = plt.subplots(nrows=n_features, ncols=n_features)

In row $r$, we display the whole dataset sorted by the $r$th feature. In order to avoid artifacts by having internal presorting in the data set, shuffle the rows first. You can do both using

sorted_df = df.sample(frac=1.0).sort_values(by=df.columns[r])

To visualize the $c$th feature in column $c$ (of the subplots matrix), we have to bring the values into a rectangular array. We use width=512, and infer the height by height = int(numpy.ceil(n_samples / width)). You can use the following method to bring a column from a data frame into a rectangular array.

In [24]:
def prepare_cell(df_column, width=512):
    n_samples = len(df_column)
    height = int(numpy.ceil(n_samples / width))
    buffer = numpy.empty(shape=(height, width)) 
    buffer[:] = float('nan')
    if hasattr(df_column, 'cat'):
        df_column = df_column.cat.codes
    buffer.flat[:n_samples] = df_column.values
    # buffer = buffer.reshape((height, width))
    return buffer

Finally, you can use axes[r, c].imshow(buffer, aspect='auto') to show the data in buffer in row $r$, column $c$ of the matrix.

Questions:

  • Is there something you can observe? How do you recognise if there is some pattern in the data?
  • Disable the sorting again. What is the effect on the readability of the visualization?

Hints:

  • Use axes[r, c].set_xticks([]), axes[r, c].set_yticks([]) to disable tick labels on the axes.
  • Use axes[r, 0].set_ylabel(df.columns[r]) and axes[-1, c].set_xlabel(df.columns[c]) to add x and y labels to you matrix.
  • Use plt.subplots_adjust(hspace=.02, wspace=.02) to adjust the space between the cells.
In [25]:
# TODO: SOLUTION
n_features = len(df.columns)
f, axes = plt.subplots(nrows=n_features, ncols=n_features)
for r, primary_column in enumerate(df.columns):
    sorted_df = df.sample(frac=1.0).sort_values(by=primary_column)
    for i, column in enumerate(df.columns):
        buffer = prepare_cell(df_column=sorted_df[column])
        axes[r, i].imshow(buffer, aspect='auto')
        axes[r, i].set_xticks([])
        axes[r, i].set_yticks([])
    axes[r, 0].set_ylabel(primary_column)
    axes[-1, r].set_xlabel(primary_column)
plt.subplots_adjust(hspace=.02, wspace=.02)

# Observations:
# - We can see the hierachy in Primary Type and Description
# - Block, Beat, District, Ward, Community Area are all different resolutions of the region hierarchy. We can clearly see this relation in the pattern.
# - The regions are located in some latitude, longitude area
# - There are some crime types (primary type and description) from which crimes are more often classified Domestic
# - There are some pattern between Description (i.e. high-resolution crime type) and location description, indicating that some crime types occur at typical locations
# - There are some pattern between Arrest and Primary Type, i.e. there are some crime types that typically lead to arrests
# ---
# - If we disable sorting, we cannot see patterns beside randomly looking noise.