T&T Crime Data Acquisition and Forecasting - Python
Kevin Folkes
2024-06-24

Project Outline:¶
- Locate and select data from the TTPS site, using Selenium Automation.
- Web Scrape/import data into notebook.
- Clean and prepare the data suitable for use.
- Train 80% of the data to a ML model then test conferdence interval.
- Predict the coming months.
In [73]:
# import libraries needed:
## Forecasting
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.forecasting.theta import ThetaForecaster
from sktime.performance_metrics.forecasting import mean_absolute_percentage_error
from sktime.utils.plotting import plot_series
import prophet
# Prophet
from prophet import Prophet
from prophet.plot import plot_plotly, plot_components_plotly
# Automation
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.action_chains import ActionChains
# Web Scraping
from bs4 import BeautifulSoup
# Pandas
import pandas as pd
import numpy as np
# Ploting
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import HTML
import warnings
warnings.filterwarnings("ignore")
In [7]:
# setup the webDriver:
driver = webdriver.Chrome()
# Open the Webpage:
url = 'https://www.ttps.gov.tt/Stats/Crime-Totals-By-Month'
driver.get(url)
# wait for the page to load:
time.sleep(2)
In [8]:
# locate and select the offence dropdown:
offence_dropdown = driver.find_element(By.ID, 'dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox2_B-1')
ActionChains(driver).move_to_element(offence_dropdown).click(offence_dropdown).perform()
# wait for the page to load:
time.sleep(1)
# from the dropdown, select the offence desired:
murder_offence = driver.find_element(By.ID, 'dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox2_DDD_L_LBI8T0')
ActionChains(driver).move_to_element(murder_offence).click(murder_offence).perform()
# wait for the page to load:
time.sleep(1)
# Locate and select the table tab to display tabular data:
table = driver.find_element(By.ID, 'dnn_ctr453_TTPSCrimeMonthTotals_ASPxPageControl1_T1T')
ActionChains(driver).move_to_element(table).click(table).perform()
# wait for the page to load:
time.sleep(2)
In [9]:
# Get a list of all year Id's in the year dropdown:
# find the container where the year dropdown data is stored:
container = driver.find_element(By.ID, 'dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBT')
# get all the td elements in the container:
td_elements = container.find_elements(By.TAG_NAME, 'td')
# extract the id's from the td tag:
year_ids = [td.get_attribute('id') for td in td_elements if td.get_attribute('id')]
# print all the id's collected:
for year_id in year_ids:
print(year_id)
dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI0T0 dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI1T0 dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI2T0 dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI3T0 dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI4T0 dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI5T0 dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI6T0 dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI7T0 dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI8T0 dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI9T0 dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI10T0 dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI11T0
Selenium Automation & Web Scraping:¶
- Here, we're combining the automation and web scraping to store the tabular data in a list.
- Selenium will navigate the page, then Beautiful Soup will parse the page and bring in the data.
In [14]:
# create an empty list to store the data in:
table_info = []
# loop through each year in the dropdown year, generate the report then web scrap the table info.
# store the table info in the empty list created above
for report_year in year_ids:
# find the year dropdown:
year_dropdown = driver.find_element(By.ID, 'dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_I')
# select dropdown:
ActionChains(driver).move_to_element(year_dropdown).click(year_dropdown).perform()
# Wait for the page to load
time.sleep(2)
# Get the year in the dropdown:
year_select = driver.find_element(By.ID, report_year)
# Extract the year text:
year = year_select.text
#select the year:
ActionChains(driver).move_to_element(year_select).click(year_select).perform()
# Wait for the page to load
time.sleep(2)
# find the generate report button:
generate = driver.find_element(By.ID, 'dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxButton1_CD')
# select generate report:
ActionChains(driver).move_to_element(generate).click(generate).perform()
time.sleep(3)
# Here we are swtitching to web scraping with Beautiful Soup:
# get the page source:
page_source = driver.page_source
# parse with bs4
soup = BeautifulSoup(page_source, 'html.parser')
# get the table location:
page_table = soup.find('table', id='dnn_ctr453_TTPSCrimeMonthTotals_ASPxPageControl1_ASPxGridView1_DXMainTable')
# there is no tagged table headers so we will skip this step and create our own headers later.
# get the table rows:
rows = page_table.find_all('tr')#[3:-1] Here you can skip rows if needed, but we will keep all rows and clean later in the data cleaning section.
# loop through all the rows and extract the info then store them in the empty list we created
for row in rows:
cells = row.find_all('td')
#strip the text in each cell of any whitespace.
cell_data = [cell.text.strip() for cell in cells]
#add the year in which this info was taken
cell_data.append(year)
# Add the data to the empty list:
table_info.append(cell_data)
# print the year when the data is finished extracting and stored:
print(f'All data in {year}, has been extracted and stored in the list: table_info')
All data in 2013, has been extracted and stored in the list: table_info All data in 2014, has been extracted and stored in the list: table_info All data in 2015, has been extracted and stored in the list: table_info All data in 2016, has been extracted and stored in the list: table_info All data in 2017, has been extracted and stored in the list: table_info All data in 2018, has been extracted and stored in the list: table_info All data in 2019, has been extracted and stored in the list: table_info All data in 2020, has been extracted and stored in the list: table_info All data in 2021, has been extracted and stored in the list: table_info All data in 2022, has been extracted and stored in the list: table_info All data in 2023, has been extracted and stored in the list: table_info All data in 2024, has been extracted and stored in the list: table_info
In [16]:
driver.close()
In [19]:
# transfer the collected table info to a dataframe:
crime_df = pd.DataFrame(table_info)
crime_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 191 entries, 0 to 190 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 0 191 non-null object 1 1 191 non-null object 2 2 167 non-null object 3 3 12 non-null object 4 4 12 non-null object dtypes: object(5) memory usage: 7.6+ KB
In [21]:
crime_df.head()
Out[21]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | Month | Month | Total | Total | 2013 |
1 | Month | 2013 | None | None | None |
2 | Total | 2013 | None | None | None |
3 | Jan | 38 | 2013 | None | None |
4 | Feb | 46 | 2013 | None | None |
In [23]:
# We can see that there are 2 unnecessary columns, they will be removed.
# column names will be changed.
# df will be cleaned by removing all None, Month & Sum entries
# remove unwanted columns:
crime_df = crime_df[[2, 0, 1, ]]
# rename columns:
crime_df.rename(columns={0: 'Month', 1: 'Total', 2: 'Year'}, inplace=True)
#Filter the df to only show data needed (remove rows with 'Total, Month & None')
crime_df = crime_df[crime_df['Month'] != 'Month']
crime_df = crime_df[crime_df['Month'] != 'Total']
crime_df = crime_df[~crime_df['Total'].str.contains('Sum')]
# reset index and drop extra column.
crime_df.reset_index(drop=True, inplace=True)
# the last two months appear to have no date, we'll remove them since it would create outliers
crime_df = crime_df.iloc[: -1]
In [25]:
crime_df.tail()
Out[25]:
Year | Month | Total | |
---|---|---|---|
137 | 2024 | Jun | 51 |
138 | 2024 | Jul | 66 |
139 | 2024 | Aug | 68 |
140 | 2024 | Sep | 58 |
141 | 2024 | Oct | 55 |
In [27]:
# Having the year and the Month in separate columns and in differemt formats isnt helpful.
# create a mapping for months:
month_map = {
'Jan': '01', 'Feb': '02', 'Mar': '03', 'Apr': '04',
'May': '05', 'Jun': '06', 'Jul': '07', 'Aug': '08',
'Sep': '09', 'Oct': '10', 'Nov': '11', 'Dec': '12'
}
# Add month numbers to df
crime_df['Month_Num'] = crime_df['Month'].map(month_map)
# combine month number and year to df as a pandas datetime object.
crime_df['date'] = pd.to_datetime(crime_df['Month_Num'].astype(str) + crime_df['Year'])
#format the month_year column
crime_df['Month_Year'] = crime_df['date'].dt.strftime('%m-%Y')
# Create a copy of the df for the Prophet Forecasting (we will use 2 type of models)
crime_df_fb = crime_df.copy()
#apply the month_year as the index
crime_df.index = crime_df['Month_Year']
# drop all other columns except totals:
crime_df = crime_df[[ 'Total']]
# change the index to a period type with a monthly frequency:
crime_df.index = pd.PeriodIndex(crime_df.index, freq='M')
# change the total to int type.
crime_df['Total'] = crime_df['Total'].astype(int)
crime_df.head()
Out[27]:
Total | |
---|---|
Month_Year | |
2013-01 | 38 |
2013-02 | 46 |
2013-03 | 18 |
2013-04 | 19 |
2013-05 | 33 |
In [30]:
crime_df_cleaned = crime_df.copy()
# split the df into 80/20 for training and testing:
crime_df_cleaned_train, crime_df_cleaned_test = temporal_train_test_split(crime_df_cleaned, train_size=0.8)
# set the forecast horizon:
fh = ForecastingHorizon(crime_df_cleaned_test.index, is_relative=False)
#fh = ForecastingHorizon([i for i in range(1,12)])
# use the Theta Forecaster model and set the seasonal period to 12 (yearly)
forecaster = ThetaForecaster(sp=12)
# Train the model on the data
forecaster.fit(crime_df_cleaned_train)
# Predict the 20% test data
crime_pred = forecaster.predict(fh)
In [32]:
# plot the trained data against the test data:
plot_series(crime_df_cleaned, crime_pred)
Out[32]:
(<Figure size 1600x400 with 1 Axes>, <Axes: ylabel='Total'>)
In [34]:
#get the Mean percentage error:
mape = mean_absolute_percentage_error(crime_df_cleaned_test, crime_pred)
print(f'Mean Absolute Percentage Error: {round(mape*100, 2)}%')
Mean Absolute Percentage Error: 17.29%
In [36]:
# Forecast into the future:
forecaster = ThetaForecaster(sp=12)
fh = ForecastingHorizon([i for i in range(1,12)])
# Train the model on all of the data
forecaster.fit(crime_df_cleaned)
# Predict the future data
crime_pred = forecaster.predict(fh)
# plot the trained data against the test data:
plot_series(crime_df_cleaned, crime_pred)
Out[36]:
(<Figure size 1600x400 with 1 Axes>, <Axes: ylabel='Total'>)
Prophet Model:¶
In [39]:
# Remove unwanted columns:
crime_df_fb = crime_df_fb[['Month_Year', 'Total']]
# change Month_Year column to datetime
crime_df_fb['Month_Year'] = pd.to_datetime(crime_df_fb['Month_Year'])
# change Total column to integer
crime_df_fb['Total'] = crime_df_fb['Total'].astype(int)
# Rename columns:
crime_df_fb.rename(columns= {'Month_Year' : 'ds', 'Total': 'y'}, inplace=True)
crime_df_fb['ds'] = crime_df_fb['ds'] + pd.offsets.MonthEnd(0)
crime_df_fb.head()
Out[39]:
ds | y | |
---|---|---|
0 | 2013-01-31 | 38 |
1 | 2013-02-28 | 46 |
2 | 2013-03-31 | 18 |
3 | 2013-04-30 | 19 |
4 | 2013-05-31 | 33 |
In [41]:
train_df = crime_df_fb[(crime_df_fb['ds'] < '2024-09-30')]
In [43]:
lockdowns = pd.DataFrame([
{'holiday': 'lockdown_1', 'ds': '2020-03-31', 'lower_window': 0, 'ds_upper': '2021-09-30'},
])
for t_col in ['ds', 'ds_upper']:
lockdowns[t_col] = pd.to_datetime(lockdowns[t_col])
lockdowns['upper_window'] = (lockdowns['ds_upper'] - lockdowns['ds']).dt.days
lockdowns
Out[43]:
holiday | ds | lower_window | ds_upper | upper_window | |
---|---|---|---|---|---|
0 | lockdown_1 | 2020-03-31 | 0 | 2021-09-30 | 548 |
In [87]:
#lockdowns2 = pd.DataFrame([
# {'holiday': 'lockdown_1', 'ds': '2021-03-01', 'lower_window': 0, 'ds_upper': '2021-09-30'},
#])
#for t_col in ['ds', 'ds_upper']:
# lockdowns2[t_col] = pd.to_datetime(lockdowns2[t_col])
#lockdowns2['upper_window'] = (lockdowns2['ds_upper'] - lockdowns2['ds']).dt.days
#lockdowns2
In [47]:
train_df.reset_index(drop=True, inplace=True)
train_df
Out[47]:
ds | y | |
---|---|---|
0 | 2013-01-31 | 38 |
1 | 2013-02-28 | 46 |
2 | 2013-03-31 | 18 |
3 | 2013-04-30 | 19 |
4 | 2013-05-31 | 33 |
... | ... | ... |
135 | 2024-04-30 | 37 |
136 | 2024-05-31 | 53 |
137 | 2024-06-30 | 51 |
138 | 2024-07-31 | 66 |
139 | 2024-08-31 | 68 |
140 rows × 2 columns
In [49]:
crime_df_fb
Out[49]:
ds | y | |
---|---|---|
0 | 2013-01-31 | 38 |
1 | 2013-02-28 | 46 |
2 | 2013-03-31 | 18 |
3 | 2013-04-30 | 19 |
4 | 2013-05-31 | 33 |
... | ... | ... |
137 | 2024-06-30 | 51 |
138 | 2024-07-31 | 66 |
139 | 2024-08-31 | 68 |
140 | 2024-09-30 | 58 |
141 | 2024-10-31 | 55 |
142 rows × 2 columns
In [51]:
# Post Covid test
post_covid = crime_df_fb[crime_df_fb['ds'] >= '2021-10-01']
post_covid.reset_index(inplace=True, drop = True)
post_covid = post_covid.head(33)
post_covid.tail()
Out[51]:
ds | y | |
---|---|---|
28 | 2024-02-29 | 42 |
29 | 2024-03-31 | 57 |
30 | 2024-04-30 | 37 |
31 | 2024-05-31 | 53 |
32 | 2024-06-30 | 51 |
In [53]:
# set the model seasonality
m = Prophet( holidays=lockdowns, seasonality_mode='multiplicative', changepoint_prior_scale=0.05, changepoint_range=0.95)
#m2 = Prophet(seasonality_mode='multiplicative')
# train the model on the data
m.fit(train_df)
#m2.fit(post_covid)
# set the predicted dates amount
future = m.make_future_dataframe( periods=12, freq='M')
#future2 = m2.make_future_dataframe( periods=12, freq='M')
future.tail()
14:00:18 - cmdstanpy - INFO - Chain [1] start processing 14:00:18 - cmdstanpy - INFO - Chain [1] done processing 14:00:18 - cmdstanpy - INFO - Chain [1] start processing 14:00:19 - cmdstanpy - INFO - Chain [1] done processing
Out[53]:
ds | |
---|---|
147 | 2025-04-30 |
148 | 2025-05-31 |
149 | 2025-06-30 |
150 | 2025-07-31 |
151 | 2025-08-31 |
In [55]:
# predict the next 12 months:
forecast = m.predict(future)
#forecast2 = m2.predict(future2)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
Out[55]:
ds | yhat | yhat_lower | yhat_upper | |
---|---|---|---|---|
147 | 2025-04-30 | 51.835601 | 42.992737 | 61.228994 |
148 | 2025-05-31 | 55.602831 | 46.429305 | 64.305231 |
149 | 2025-06-30 | 50.303951 | 41.353531 | 58.727845 |
150 | 2025-07-31 | 60.545914 | 52.267125 | 70.046841 |
151 | 2025-08-31 | 63.852228 | 54.833038 | 73.447367 |
In [85]:
# add residuals columns
forecast['residuals'] = crime_df_fb['y'] - forecast['yhat']
#forecast2['residuals'] = post_covid['y'] - forecast2['yhat']
# calculate mape
forecast['absolute_percentage_error'] = abs(forecast['residuals'] / crime_df_fb['y']) * 100
#forecast2['absolute_percentage_error'] = abs(forecast2['residuals'] / post_covid['y']) * 100
# calcualte mape
mape = forecast['absolute_percentage_error'].mean()
#mape2 = forecast2['absolute_percentage_error'].mean()
round(mape,2)
Out[85]:
13.38
In [61]:
# plot forecast
fig1 = m.plot(forecast)
In [63]:
# plot components
fig1 = m.plot_components(forecast)
In [67]:
# Python
from prophet.plot import add_changepoints_to_plot
fig = m.plot(forecast)
a = add_changepoints_to_plot(fig.gca(), m, forecast)
In [93]:
# Interactive Plot
HTML(plot_plotly(m, forecast).to_html())
Out[93]:
In [155]:
print(f'Mean Absolute Percentage Error: {round(mape, 2)}%')
Mean Absolute Percentage Error: 13.49%
Next Steps to Explore:¶
- Identify and Remove outliers from data.
- Add regressor for Covid19 period.
- Cross Validate data and find tune metrics
- Lower mape value to be more accurate with the predictions.