Knowledge Discovery and Data Mining I - Winter Semester 2018/19¶

Lecturer: Prof. Dr. Thomas Seidl
Assistants: Max Berrendorf, Julian Busch

Tutorial 2: Explorative Data Analysis¶

In this tutorial, we want to give you tbe opportunity to do some explorative data analysis on your own. For this purpose we will use a real-world dataset of crime in Chicago. The full dataset is accessible at https://catalog.data.gov/dataset/crimes-2001-to-present-398a4, but we also provide a subset of the data in compressed form at http://www.dbs.ifi.lmu.de/Lehre/KDD/WS1819/tutorials/ChicagoCrime2017.csv.xz. You do not need to unpack the file, as pandas can also read from compressed CSV files.

The meaning of each column is described at https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2 .

If you need any help, feel free to contact your tutor or the assistants.

0. Imports¶

In this tutorial, we will need the following packages. Please refer to the Python introduction when you need to know how to install Python packages.

%matplotlib inline
import numpy
import pandas
from matplotlib import pyplot as plt
import seaborn
from pandas.plotting import parallel_coordinates

# Set default figure size
from matplotlib import rcParams
rcParams['figure.figsize'] = (15, 15)

# For wide screens
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

1. Load data¶

Here, please insert the path of the downloaded file. A relative path will be relative to the notebook.

file_path = 'ChicagoCrime2017.csv.xz'

Use pandas.read_csv to read the CSV file into a pandas DataFrame.

df = pandas.read_csv(file_path)

df.info() displays type and shape information about the table.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 267715 entries, 0 to 267714
Data columns (total 22 columns):
ID                      267715 non-null int64
Case Number             267715 non-null object
Date                    267715 non-null object
Block                   267715 non-null object
IUCR                    267715 non-null object
Primary Type            267715 non-null object
Description             267715 non-null object
Location Description    266731 non-null object
Arrest                  267715 non-null bool
Domestic                267715 non-null bool
Beat                    267715 non-null int64
District                267714 non-null float64
Ward                    267714 non-null float64
Community Area          267715 non-null float64
FBI Code                267715 non-null object
X Coordinate            265004 non-null float64
Y Coordinate            265004 non-null float64
Year                    267715 non-null int64
Updated On              267715 non-null object
Latitude                265004 non-null float64
Longitude               265004 non-null float64
Location                265004 non-null object
dtypes: bool(2), float64(7), int64(3), object(10)
memory usage: 41.4+ MB

df.head() shows the first lines

df.head()

For some rows there where missing values. They have been replaced by NaN in the parser. We remove these rows using the following line.

df.dropna(how='any', inplace=True)

2. Preprocessing¶

From the output of df.info() above, you can see that the columns where converted to one of the following types: bool, float64, int64, object. In the following, we will pre-process the data for the visualizations.

Perform the following operations on the columns:

ID columns¶

The following columns are identifiers, which we will just keep as they are: 'ID', 'Case Number' (The ID is unique per row, whereas the Case Number can be used to find multiple crimes connected to one case). We set the ID as index, and drop Case Number.

# TODO: SOLUTION
df.set_index('ID', drop=True, inplace=True)
del df['Case Number']
df.head()

Time¶

As loaded, the table has three time-related attributes: 'Year', 'Updated On', and 'Date'. As we are only looking at year 2017, the 'Year' column can be dropped. Also the 'Updated On' is not of use for us. Taking a closer look at 'Date', we find out that this column carries not only information about the day, but also the time. We will create four columns from this column: month, day, ẁeekday, hour. You can use the following snippet to parse the date as given into a datetime.

datetime_column = pandas.to_datetime(df['Date'], format='%m/%d/%Y %I:%M:%S %p')

Consult https://pandas.pydata.org/pandas-docs/stable/categorical.html#string-and-datetime-accessors for information on how to work with columns of type datetime.

# TODO: SOLUTION
del df['Year']
del df['Updated On']
del df['Date']

df['month'] = datetime_column.dt.month
df['day'] = datetime_column.dt.day
df['weekday'] = datetime_column.dt.weekday_name.astype('category')
df['hour'] = datetime_column.dt.hour

df.head()

Location¶

The given data provides location information on several levels:

Longitude, Latitude: Geo position
Location = tuple (Longitude, Latitude) (redundant)
X Coordinate, Y Coodinate: position in State Plane Illinois East NAD 1983 projection (redundant)
Block, Beat, Community Area, District, Ward: All regions in Chicago. They follow the hierarchy Block < Beat < Community Area < Ward < District.
Due to missing values in the original data, Beat, Community Area, District, Ward have been parsed as float datatype. For visually more pleasing labels, you can first convert them to integers again, before making them categorical.

Apply the following operations on the location-related columns:

Column	Operation
'Longitude'	Keep
'Latitude'	Keep
'Location'	Drop
'X Coordinate'	Drop
'Y Coordinate'	Drop
'Block'	Categorical
'Beat'	Categorical
'Community Area'	Categorical
'District'	Categorical
'Ward'	Categorical

Hints:

You can use df.drop(columns=<column_name>, inplace=True) to remove a column from a table (working in-place).
You can use df[<column_name>] = df[<column_name>].astype('category') to convert a column to categorical data type.

# TODO: SOLUTION
del df['Location']
del df['X Coordinate']
del df['Y Coordinate']

df['Block'] = df['Block'].astype('category')
for categorical_column in ['Beat', 'Community Area', 'District', 'Ward']:
    df[categorical_column] = df[categorical_column].astype(numpy.int32).astype('category')
df.head()

Crime Description¶

The crime is described using the following columns:

'IUCR', 'Primary Type', 'Description': description of the type of crime. 'Description' is actually also a categorisation on a finer hierarchy level than 'Primary Type'. 'IUCR' is equivalent to 'Description', but a description in form of a number, instead of textual description.
'FBI Code': A different categorisation of the crime (cf. http://gis.chicagopolice.org/clearmap_crime_sums/crime_types.html)
'Arrest', 'Domestic': Boolean flags indicating whether an arrest was made, and if the incident was classified as domestic violence.
'Location Description': A textual description of the location.

Apply the following operations on the location-related columns:

Column	Operation
'IUCR'	Drop
'Primary Type'	Categorical
'Description'	Categorical
'FBI Code'	Drop
'Arrest'	Categorical
'Domestic'	Categorical
'Location Description'	Categorical

# TODO: SOLUTION
del df['IUCR']
del df['FBI Code']
    
for categorical_column in ['Primary Type', 'Description', 'Arrest', 'Domestic', 'Location Description']:
    df[categorical_column] = df[categorical_column].astype('category')
df.head()

Now that we have finished preprocessing the data, you can show some information about the resulting data frame using df.head(), df.info(). In addition, you can use df.describe() to obtain some measures of central tendency and dispersion about the numeric columns.

Can you already see something insightful from these simple statistics?

# TODO: SOLUTION
df.head()

# TODO: SOLUTION
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 264199 entries, 11094370 to 11069602
Data columns (total 16 columns):
Block                   264199 non-null category
Primary Type            264199 non-null category
Description             264199 non-null category
Location Description    264199 non-null category
Arrest                  264199 non-null category
Domestic                264199 non-null category
Beat                    264199 non-null category
District                264199 non-null category
Ward                    264199 non-null category
Community Area          264199 non-null category
Latitude                264199 non-null float64
Longitude               264199 non-null float64
month                   264199 non-null int64
day                     264199 non-null int64
weekday                 264199 non-null category
hour                    264199 non-null int64
dtypes: category(11), float64(2), int64(3)
memory usage: 17.4 MB

# TODO: SOLUTION
df.describe()

# Something to see:
# hour: 
#  25%-percentile = 9
#  75%-percentile = 18
#  => 50% of the crimes happen between 9 and 18 (at day)
# month: from quantiles alone: perfect uniform over months

Bonus Task¶

Use df.groupby(by=columns).size() to count how often certain combinations in the given columns occur, i.e. use columns=['Primary Type'] to get a list of the most frequent primary crime type.

df.groupby(by=['Primary Type', 'Description']).size().sort_values(ascending=False)

Primary Type                      Description                                
THEFT                             $500 AND UNDER                                 24259
BATTERY                           DOMESTIC BATTERY SIMPLE                        23792
                                  SIMPLE                                         16141
THEFT                             OVER $500                                      15124
CRIMINAL DAMAGE                   TO PROPERTY                                    13802
                                  TO VEHICLE                                     13511
ASSAULT                           SIMPLE                                         12695
THEFT                             FROM BUILDING                                  10441
                                  RETAIL THEFT                                   10438
MOTOR VEHICLE THEFT               AUTOMOBILE                                      9829
BURGLARY                          FORCIBLE ENTRY                                  7479
                                  UNLAWFUL ENTRY                                  4567
ROBBERY                           ARMED: HANDGUN                                  4546
OTHER OFFENSE                     TELEPHONE THREAT                                3891
CRIMINAL TRESPASS                 TO LAND                                         3846
DECEPTIVE PRACTICE                CREDIT CARD FRAUD                               3827
                                  FINANCIAL IDENTITY THEFT OVER $ 300             3671
WEAPONS VIOLATION                 UNLAWFUL POSS OF HANDGUN                        3594
ROBBERY                           STRONGARM - NO WEAPON                           3563
ASSAULT                           AGGRAVATED: HANDGUN                             2890
OTHER OFFENSE                     HARASSMENT BY TELEPHONE                         2739
                                  HARASSMENT BY ELECTRONIC MEANS                  2644
THEFT                             POCKET-PICKING                                  2223
NARCOTICS                         POSS: HEROIN(WHITE)                             2221
BATTERY                           AGGRAVATED: HANDGUN                             2039
                                  AGGRAVATED: OTHER DANG WEAPON                   1871
DECEPTIVE PRACTICE                FRAUD OR CONFIDENCE GAME                        1759
OTHER OFFENSE                     VIOLATE ORDER OF PROTECTION                     1554
ASSAULT                           AGGRAVATED:KNIFE/CUTTING INSTR                  1464
OTHER OFFENSE                     OTHER VEHICLE OFFENSE                           1445
                                                                                 ...  
MOTOR VEHICLE THEFT               ATT: TRUCK, BUS, MOTOR HOME                        2
OTHER OFFENSE                     MONEY LAUNDERING                                   2
PUBLIC PEACE VIOLATION            SELL/ADVERTISE FIREWORKS                           2
                                  INTERFERE W/ EMERGENCY EQUIP                       2
WEAPONS VIOLATION                 DEFACE IDENT MARKS OF FIREARM                      1
PROSTITUTION                      PANDERING                                          1
ASSAULT                           AGG PRO.EMP: OTHER FIREARM                         1
NON-CRIMINAL                      CONCEALED CARRY LICENSE REVOCATION                 1
MOTOR VEHICLE THEFT               THEFT/RECOVERY: CYCLE, SCOOTER, BIKE NO VIN        1
WEAPONS VIOLATION                 UNLAWFUL SALE/DELIVERY OF FIREARM AT SCHOOL        1
NARCOTICS                         MANU/DELIVER:HEROIN(BLACK TAR)                     1
NON-CRIMINAL                      FOUND PASSPORT                                     1
INTERFERENCE WITH PUBLIC OFFICER  OFFICIAL MISCONDUCT                                1
NON-CRIMINAL                      GUN OFFENDER NOTIFICATION-NO CONTACT               1
BATTERY                           AGGRAVATED PO: OTHER FIREARM                       1
PROSTITUTION                      PATRONIZING A PROSTITUTE                           1
                                  CAB OPERATION                                      1
OTHER OFFENSE                     INTERFERE W/ HIGHER EDUCATION                      1
                                  ARSONIST: DUTY TO REGISTER                         1
DECEPTIVE PRACTICE                INSURANCE FRAUD                                    1
OTHER OFFENSE                     ANIMAL FIGHTING                                    1
DECEPTIVE PRACTICE                UNIDENTIFIABLE RECORDING SOUND                     1
NON-CRIMINAL                      NOTIFICATION OF STALKING - NO CONTACT ORDER        1
GAMBLING                          GAME/AMUSEMENT DEVICE                              1
OFFENSE INVOLVING CHILDREN        SALE TOBACCO PRODUCTS TO MINOR                     1
CRIM SEXUAL ASSAULT               ATTEMPT AGG: KNIFE/CUT INSTR                       1
INTIMIDATION                      EDUCATIONAL INTIMIDAITON                           1
                                  INTIMIDATION OF LAW ENFORCEMENT OFFICIAL           1
LIQUOR LAW VIOLATION              MINOR MISREPRESENT AGE                             1
WEAPONS VIOLATION                 USE OF METAL PIERCING BULLETS                      1
Length: 329, dtype: int64

Basic Visualization¶

Histograms¶

To visualize the distribution of categorical attributes, we will first use equi-width histograms. Create such for all categorical attributes.

What form do these distribution exhibit? Are there distinct peaks? Can you guess reasons for these? What insights do you gain? How useful are the plots for the region information attributes?

Hint:

You can create a histogram using

seaborn.catplot(data=df, x=<column_name>, kind='count')

Do not create the histograms for regions attributes smaller than Ward
You can also pass an explicit order using the order=<list-of-values> keyword of seaborn.catplot. What are useful options for ordering the value?

# TODO: SOLUTION
for cat_var in ['month', 'hour', 'weekday', 'Domestic', 'Arrest', 'District', 'Ward', 'Location Description']:
    if cat_var == 'weekday':
        # Sort by week
        order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    elif cat_var in ['District', 'Ward', 'Location Description']:
        # Sort by frequency
        order = df.groupby(by=cat_var).size().sort_values(ascending=False).index
    else:
        # Do not sort explicitly
        order = None
    seaborn.catplot(data=df, x=cat_var, kind='count', order=order)
    plt.xticks(rotation=90)
    
# Discoveries:
# - month: bimodal distribution; peaks at January and July (=> maybe vacation time)
# - hour: most crimes during (late) daytime
# - weekday: slightly more crime on weekends
# - most crimes are non-domestic
# - most crimes do not lead to an arrest
# - There are some districts, and wards with more crime than other
# - 'Ward' is already hardly readible.
# - Sorting increases readability.

Quantile Plots¶

To create quantile plots, you can use df.quantile(q=<p-values>), where p-values is a list of quantiles you want to compute. The return value is a dataframe whose index contains the $p$ values, and the columns the corresponding quantiles for each numeric(!) column in the original dataframe.
To get num equi-distant values between $0$ and $1$, you can use numpy.linspace(start=0.0, stop=1.0, num=num).
To create the plot, use seaborn.scatterplot(data=quantiles, x=<p-values>, y=<column_name>)

Questions:

What do you observe here?
What benefit does a quantile plot offer compared to histograms?
How does the quantile plot behave for discrete attributes?

# TODO: SOLUTION
ps = numpy.linspace(start=0.0, stop=1.0, num=100)
quantiles = df.quantile(q=ps)
columns = quantiles.columns
quantiles['p'] = ps
for column in columns:
    plt.subplots()
    seaborn.scatterplot(data=quantiles, x=ps, y=column)

# Discoveries:
# - Non-uniform distribution of location; BUT: Spatial data should be visualiased on a map
# - Steps for discrete attributes

Boxplot¶

Another option to display the distribution of data values of one attribute is the boxplot. It also allows for easy comparison between the distribution in different subsamples.

Tasks:

Compare the distribution of 'hour' for different values of 'Primary Type'. (Is there are "typical" time for some kind of crime?)
Compare the distribution of 'hour' for different values of 'weekday'.
Again, you can use the order keyword to pass some explicit order. What are meaningful orderings? Do they facilitate understanding the plot?
Think about/Try out more comparisons. Is there something you can discover?
How meaningful are boxplots for small samples?

Hint:

You can use plt.xticks(rotation=90) to rotate the tick labels on the x-axis.

# TODO: SOLUTION
plt.subplots()
group = df.groupby(by='Primary Type').agg({'hour': 'median'}).sort_values(by='hour', ascending=False)
order = group.index
seaborn.boxplot(data=df, x='Primary Type', y='hour', order=order)
plt.xticks(rotation=90)
plt.tight_layout()

plt.subplots()
order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
seaborn.boxplot(data=df, x='weekday', y='hour', order=order)
plt.xticks(rotation=90)
plt.tight_layout()

plt.subplots()
group = df.groupby(by='Location Description').agg({'hour': 'median'}).sort_values(by='hour', ascending=False)
order = group.index
seaborn.boxplot(data=df, x='Location Description', y='hour', order=order)
plt.xticks(rotation=90)
plt.tight_layout()

# Observations
# - Prostitution is concentrated in the evening
# - Public indecency is quite narrow around noon; BUT: Also small sample size
# - Most categories cover the complete data range with their 25%-75% quantile.
# - On weekends crime starts earlier

Violin Plot¶

Compared to a boxplot, violin plots replace the box by a kernel density estimate of the data distribution. Thereby, they reveal more details about the distribution of the data. You can create violin plots using

seaborn.violinplot(data=df, x=<column_name_a>, y=<column_name_a>)

Create a violin plot for

x='Primary Type', y='hour'; restrict 'Primary Type' to the 5 most common types.
Explicitly set the bandwidth of the kernel to the values 1.0, 0.1 and 0.01. What do you observe?

Hint:

You can count the absolute frequency of a crime type using

df.groupby(by='Primary Type').agg({'Primary Type': 'count'})

# TODO: SOLUTION
for bw in [None, 1.0, 0.1, 0.01]:
    plt.subplots()
    order = df.groupby(by='Primary Type').agg({'Primary Type': 'count'}).sort_values(by='Primary Type', ascending=False).index
    order = order[:5]
    seaborn.violinplot(data=df, x='Primary Type', y='hour', order=order, bw=bw)
    plt.xticks(rotation=90)
    plt.tight_layout()

# Observation:
# - Deceptive Practice more spiky; three strong peaks
# - 5am minimum in all types
# - bandwidth:
#   - too large: too much smoothing; no details
#   - too small: problems with discrete hours

/home/berrendorf/common_venv/lib/python3.6/site-packages/ipykernel_launcher.py:4: FutureWarning: 'Primary Type' is both an index level and a column label.
Defaulting to column, but this will raise an ambiguity error in a future version
  after removing the cwd from sys.path.
/home/berrendorf/common_venv/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Advanced Visualization¶

Scatterplot Matrix¶

To create a scatterplot matrix, you can use

seaborn.pairplot(data=data, kind='scatter', diag_kind='scatter')

As this visualization is quite compute-intensive, you can call it on a subsample of the data using df.sample(frac=<fraction>), e.g. frac=0.01.
pairplots takes an additional keyword argument hue to which you can pass the name of a (categorical) column. Afterwards, the points are coloured by this attribute.
You can use plot_kws={<key>: <value>} to pass arguments to the plot function, e.g. plot_kws={'alpha': .2} to plot the points with transparency.
You can use vars=<list-of-column-names> to give an explicit order, or use only a subset of the columns. Experiment with different orderings.

seaborn.pairplot(data=df.sample(frac=.001), kind='scatter', diag_kind='scatter', plot_kws={'alpha': .2} )

<seaborn.axisgrid.PairGrid at 0x7fd295a34400>

When scattering many points, the scatter matrix may get cluttered, and it hard to detect areas where many points lie, as they overlap. To cope with this situation, one can replace the scatter plots in each cell by 2-D histograms.

For this we will use seaborn.PairGrid. A pair grid defines the "matrix" as known from the scatter plot, and provides interfaces to map different visualizations to the specific cells. For easier usage, we provide you the following method, which plots a equi-width 2-D histogram into a cell. Moreover, it determines the number of bins in $x$ and $y$ direction to not exceed the number of different values, in order to allow displaying categorical variables. Be aware that the order in that case does not necessarily convey any meaning!

def hist2d(x, y, color, label, max_bins=50, **kwargs):
    x_bins = numpy.unique(x).size
    x_bins = min(x_bins, max_bins)
    y_bins = numpy.unique(y).size
    y_bins = min(y_bins, max_bins)
    bins = (x_bins, y_bins)
    plt.hist2d(x=x, y=y, bins=bins)

Next, create the pairgrid g and map the method hist2d to the cells using g.map(upper). As before, you can pass the keyword argument vars to the PairGrid constructor to select only a subset of columns. You may also want to start with only a fraction of the data (i.e. using df.sample(frac=<fraction>)) as value for the data parameter). However, you should be able to run the visualization even for the full dataset.

Look at the cell (longitude, latitude). What can you observe? Also compare the plot against the shape of Chicago, e.g. from OpenStreeMap or Wikipedia
Look at the cells of the region hierarchy. What can you discover here?
What else can you see? Compare it against what you already saw in e.g. the histograms.

Hint:

You can use plt.subplots_adjust(wspace=0.02, hspace=0.02) to compact the grid.

# TODO: SOLUTION
frac = 0.01
selection = df.sample(frac=frac)
g = seaborn.PairGrid(data=selection)
g.map(hist2d)
plt.subplots_adjust(wspace=0.02, hspace=0.02)

# Observations:
# - location reflects shape of Chicago.
# - there is a clear crime hotspot at the harbour
# - one can re-discover previous observations, e.g. the 5 am crime minimum

Pixel-Oriented Visualization¶

We will also take a look into a (simple) pixel-oriented technique. Here, we have one subplot for each attribute, and each data points maps to one single pixel per window, where the colour indicates the value. First, we create a (n_features x n_features) matrix of subplots by

figure, axes = plt.subplots(nrows=n_features, ncols=n_features)

In row $r$, we display the whole dataset sorted by the $r$th feature. In order to avoid artifacts by having internal presorting in the data set, shuffle the rows first. You can do both using

sorted_df = df.sample(frac=1.0).sort_values(by=df.columns[r])

To visualize the $c$th feature in column $c$ (of the subplots matrix), we have to bring the values into a rectangular array. We use width=512, and infer the height by height = int(numpy.ceil(n_samples / width)). You can use the following method to bring a column from a data frame into a rectangular array.

def prepare_cell(df_column, width=512):
    n_samples = len(df_column)
    height = int(numpy.ceil(n_samples / width))
    buffer = numpy.empty(shape=(height, width)) 
    buffer[:] = float('nan')
    if hasattr(df_column, 'cat'):
        df_column = df_column.cat.codes
    buffer.flat[:n_samples] = df_column.values
    # buffer = buffer.reshape((height, width))
    return buffer

Finally, you can use axes[r, c].imshow(buffer, aspect='auto') to show the data in buffer in row $r$, column $c$ of the matrix.

Questions:

Is there something you can observe? How do you recognise if there is some pattern in the data?
Disable the sorting again. What is the effect on the readability of the visualization?

Hints:

Use axes[r, c].set_xticks([]), axes[r, c].set_yticks([]) to disable tick labels on the axes.
Use axes[r, 0].set_ylabel(df.columns[r]) and axes[-1, c].set_xlabel(df.columns[c]) to add x and y labels to you matrix.
Use plt.subplots_adjust(hspace=.02, wspace=.02) to adjust the space between the cells.

# TODO: SOLUTION
n_features = len(df.columns)
f, axes = plt.subplots(nrows=n_features, ncols=n_features)
for r, primary_column in enumerate(df.columns):
    sorted_df = df.sample(frac=1.0).sort_values(by=primary_column)
    for i, column in enumerate(df.columns):
        buffer = prepare_cell(df_column=sorted_df[column])
        axes[r, i].imshow(buffer, aspect='auto')
        axes[r, i].set_xticks([])
        axes[r, i].set_yticks([])
    axes[r, 0].set_ylabel(primary_column)
    axes[-1, r].set_xlabel(primary_column)
plt.subplots_adjust(hspace=.02, wspace=.02)

# Observations:
# - We can see the hierachy in Primary Type and Description
# - Block, Beat, District, Ward, Community Area are all different resolutions of the region hierarchy. We can clearly see this relation in the pattern.
# - The regions are located in some latitude, longitude area
# - There are some crime types (primary type and description) from which crimes are more often classified Domestic
# - There are some pattern between Description (i.e. high-resolution crime type) and location description, indicating that some crime types occur at typical locations
# - There are some pattern between Arrest and Primary Type, i.e. there are some crime types that typically lead to arrests
# ---
# - If we disable sorting, we cannot see patterns beside randomly looking noise.

	ID	Case Number	Date	Block	IUCR	Primary Type	Description	Location Description	Arrest	Domestic	...	Ward	Community Area	FBI Code	X Coordinate	Y Coordinate	Year	Updated On	Latitude	Longitude	Location
0	11094370	JA440032	09/21/2017 12:15:00 AM	072XX N CALIFORNIA AVE	1122	DECEPTIVE PRACTICE	COUNTERFEIT CHECK	CURRENCY EXCHANGE	True	False	...	50.0	2.0	10	1156443.0	1947707.0	2017	03/01/2018 03:52:35 PM	42.012293	-87.699714	(42.012293397, -87.699714109)
1	11118031	JA470589	10/12/2017 07:14:00 PM	055XX W GRAND AVE	1345	CRIMINAL DAMAGE	TO CITY OF CHICAGO PROPERTY	JAIL / LOCK-UP FACILITY	True	False	...	29.0	19.0	14	1138788.0	1913480.0	2017	03/01/2018 03:52:35 PM	41.918712	-87.765511	(41.918711651, -87.76551063)
2	11134189	JA491697	10/30/2017 11:52:00 AM	043XX S TALMAN AVE	4651	OTHER OFFENSE	SEX OFFENDER: FAIL REG NEW ADD	APARTMENT	True	False	...	12.0	58.0	26	1159425.0	1875711.0	2017	03/01/2018 03:52:35 PM	41.814670	-87.690727	(41.814669784, -87.690726696)
3	11156462	JA521389	09/29/2017 06:45:00 PM	055XX W BELMONT AVE	1110	DECEPTIVE PRACTICE	BOGUS CHECK	CURRENCY EXCHANGE	True	False	...	30.0	19.0	11	1138653.0	1920720.0	2017	03/01/2018 03:52:35 PM	41.938581	-87.765831	(41.938581442, -87.765830579)
4	11164874	JA531910	12/01/2017 06:20:00 AM	022XX W CHICAGO AVE	0265	CRIM SEXUAL ASSAULT	AGGRAVATED: OTHER	STREET	True	False	...	32.0	24.0	02	1161264.0	1905292.0	2017	03/01/2018 03:52:35 PM	41.895805	-87.683159	(41.895805099, -87.683159355)

	Date	Block	IUCR	Primary Type	Description	Location Description	Arrest	Domestic	Beat	District	Ward	Community Area	FBI Code	X Coordinate	Y Coordinate	Year	Updated On	Latitude	Longitude	Location
ID
11094370	09/21/2017 12:15:00 AM	072XX N CALIFORNIA AVE	1122	DECEPTIVE PRACTICE	COUNTERFEIT CHECK	CURRENCY EXCHANGE	True	False	2411	24.0	50.0	2.0	10	1156443.0	1947707.0	2017	03/01/2018 03:52:35 PM	42.012293	-87.699714	(42.012293397, -87.699714109)
11118031	10/12/2017 07:14:00 PM	055XX W GRAND AVE	1345	CRIMINAL DAMAGE	TO CITY OF CHICAGO PROPERTY	JAIL / LOCK-UP FACILITY	True	False	2515	25.0	29.0	19.0	14	1138788.0	1913480.0	2017	03/01/2018 03:52:35 PM	41.918712	-87.765511	(41.918711651, -87.76551063)
11134189	10/30/2017 11:52:00 AM	043XX S TALMAN AVE	4651	OTHER OFFENSE	SEX OFFENDER: FAIL REG NEW ADD	APARTMENT	True	False	922	9.0	12.0	58.0	26	1159425.0	1875711.0	2017	03/01/2018 03:52:35 PM	41.814670	-87.690727	(41.814669784, -87.690726696)
11156462	09/29/2017 06:45:00 PM	055XX W BELMONT AVE	1110	DECEPTIVE PRACTICE	BOGUS CHECK	CURRENCY EXCHANGE	True	False	2514	25.0	30.0	19.0	11	1138653.0	1920720.0	2017	03/01/2018 03:52:35 PM	41.938581	-87.765831	(41.938581442, -87.765830579)
11164874	12/01/2017 06:20:00 AM	022XX W CHICAGO AVE	0265	CRIM SEXUAL ASSAULT	AGGRAVATED: OTHER	STREET	True	False	1221	12.0	32.0	24.0	02	1161264.0	1905292.0	2017	03/01/2018 03:52:35 PM	41.895805	-87.683159	(41.895805099, -87.683159355)

	Latitude	Longitude	month	day	hour
count	264199.000000	264199.000000	264199.000000	264199.000000	264199.000000
mean	41.844366	-87.670226	6.550275	15.663398	13.202128
std	0.085974	0.058687	3.368554	8.810503	6.502804
min	41.644606	-87.928909	1.000000	1.000000	0.000000
25%	41.769477	-87.712332	4.000000	8.000000	9.000000
50%	41.865143	-87.664035	7.000000	16.000000	14.000000
75%	41.906221	-87.627760	9.000000	23.000000	19.000000
max	42.022671	-87.524529	12.000000	31.000000	23.000000