Kevin Folkes
2024-06-24

Project Outline:¶

Locate and select data from the TTPS site, using Selenium Automation.
Web Scrape/import data into notebook.
Clean and prepare the data suitable for use.
Train 80% of the data to a ML model then test conferdence interval.
Predict the coming months.

In [73]:

# import libraries needed: 
        
## Forecasting
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.forecasting.theta import ThetaForecaster
from sktime.performance_metrics.forecasting import mean_absolute_percentage_error
from sktime.utils.plotting import plot_series
import prophet
# Prophet
from prophet import Prophet
from prophet.plot import plot_plotly, plot_components_plotly

# Automation
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.action_chains import ActionChains

# Web Scraping
from bs4 import BeautifulSoup

# Pandas
import pandas as pd
import numpy as np

# Ploting
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import HTML


import warnings
warnings.filterwarnings("ignore")

Selenium Automation¶

Locate and select data from the TTPS site, using Selenium Automation.

In [7]:

# setup the webDriver: 
driver = webdriver.Chrome()

# Open the Webpage: 
url = 'https://www.ttps.gov.tt/Stats/Crime-Totals-By-Month'
driver.get(url)

# wait for the page to load: 
time.sleep(2)

In [8]:

    # locate and select the offence dropdown:
offence_dropdown = driver.find_element(By.ID, 'dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox2_B-1')
ActionChains(driver).move_to_element(offence_dropdown).click(offence_dropdown).perform()

# wait for the page to load: 
time.sleep(1)

# from the dropdown, select the offence desired: 
murder_offence = driver.find_element(By.ID, 'dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox2_DDD_L_LBI8T0')
ActionChains(driver).move_to_element(murder_offence).click(murder_offence).perform()

# wait for the page to load: 
time.sleep(1)

# Locate and select the table tab to display tabular data: 
table = driver.find_element(By.ID, 'dnn_ctr453_TTPSCrimeMonthTotals_ASPxPageControl1_T1T')
ActionChains(driver).move_to_element(table).click(table).perform()

# wait for the page to load: 
time.sleep(2)

In [9]:

# Get a list of all year Id's in the year dropdown: 
# find the container where the year dropdown data is stored: 
container = driver.find_element(By.ID, 'dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBT')

# get all the td elements in the container: 
td_elements = container.find_elements(By.TAG_NAME, 'td')

# extract the id's from the td tag:
year_ids = [td.get_attribute('id') for td in td_elements if td.get_attribute('id')]

# print all the id's collected: 
for year_id in year_ids: 
    print(year_id)

dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI0T0
        dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI1T0
        dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI2T0
        dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI3T0
        dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI4T0
        dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI5T0
        dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI6T0
        dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI7T0
        dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI8T0
        dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI9T0
        dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI10T0
        dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_DDD_L_LBI11T0

Selenium Automation & Web Scraping:¶

Here, we're combining the automation and web scraping to store the tabular data in a list.
Selenium will navigate the page, then Beautiful Soup will parse the page and bring in the data.

In [14]:

# create an empty list to store the data in: 
table_info = []


# loop through each year in the dropdown year, generate the report then web scrap the table info.
# store the table info in the empty list created above

for report_year in year_ids: 
    
    # find the year dropdown: 
    year_dropdown = driver.find_element(By.ID, 'dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxComboBox3_I')
    # select dropdown: 
    ActionChains(driver).move_to_element(year_dropdown).click(year_dropdown).perform()
    # Wait for the page to load
    time.sleep(2)
    
    # Get the year in the dropdown: 
    year_select = driver.find_element(By.ID, report_year)
    # Extract the year text: 
    year = year_select.text
    #select the year: 
    ActionChains(driver).move_to_element(year_select).click(year_select).perform()
    # Wait for the page to load
    time.sleep(2)
    
    # find the generate report button: 
    generate = driver.find_element(By.ID, 'dnn_ctr453_TTPSCrimeMonthTotals_ASPxRoundPanel1_ASPxButton1_CD')
    # select generate report: 
    ActionChains(driver).move_to_element(generate).click(generate).perform()
    time.sleep(3)
    
    
    
    
    # Here we are swtitching to web scraping with Beautiful Soup: 
    
    
    
    # get the page source: 
    page_source = driver.page_source
    # parse with bs4
    soup = BeautifulSoup(page_source, 'html.parser')
    
    # get the table location: 
    page_table = soup.find('table', id='dnn_ctr453_TTPSCrimeMonthTotals_ASPxPageControl1_ASPxGridView1_DXMainTable')
    # there is no tagged table headers so we will skip this step and create our own headers later.
    
    # get the table rows: 
    rows = page_table.find_all('tr')#[3:-1]  Here you can skip rows if needed, but we will keep all rows and clean later in the data cleaning section.
    
    # loop through all the rows and extract the info then store them in the empty list we created
    for row in rows: 
        cells = row.find_all('td')
        #strip the text in each cell of any whitespace.
        cell_data = [cell.text.strip() for cell in cells]
        #add the year in which this info was taken
        cell_data.append(year)
        # Add the data to the empty list: 
        table_info.append(cell_data)
    
    
    # print the year when the data is finished extracting and stored: 
    print(f'All data in {year}, has been extracted and stored in the list: table_info')

All data in 2013, has been extracted and stored in the list: table_info
        All data in 2014, has been extracted and stored in the list: table_info
        All data in 2015, has been extracted and stored in the list: table_info
        All data in 2016, has been extracted and stored in the list: table_info
        All data in 2017, has been extracted and stored in the list: table_info
        All data in 2018, has been extracted and stored in the list: table_info
        All data in 2019, has been extracted and stored in the list: table_info
        All data in 2020, has been extracted and stored in the list: table_info
        All data in 2021, has been extracted and stored in the list: table_info
        All data in 2022, has been extracted and stored in the list: table_info
        All data in 2023, has been extracted and stored in the list: table_info
        All data in 2024, has been extracted and stored in the list: table_info

In [16]:

driver.close()

Data Cleaning & Preparation:¶

Data will be stored in a df then cleaned and standardize.

In [19]:

# transfer the collected table info to a dataframe: 
crime_df = pd.DataFrame(table_info)

crime_df.info()

<class 'pandas.core.frame.DataFrame'>
        RangeIndex: 191 entries, 0 to 190
        Data columns (total 5 columns):
         #   Column  Non-Null Count  Dtype 
        ---  ------  --------------  ----- 
         0   0       191 non-null    object
         1   1       191 non-null    object
         2   2       167 non-null    object
         3   3       12 non-null     object
         4   4       12 non-null     object
        dtypes: object(5)
        memory usage: 7.6+ KB

In [21]:

crime_df.head()

Out[21]:

	0	1	2	3	4
0	Month	Month	Total	Total	2013
1	Month	2013	None	None	None
2	Total	2013	None	None	None
3	Jan	38	2013	None	None
4	Feb	46	2013	None	None

In [23]:

# We can see that there are 2 unnecessary columns, they will be removed.
# column names will be changed. 
# df will be cleaned by removing all None, Month & Sum entries

# remove unwanted columns: 
crime_df = crime_df[[2, 0, 1, ]]

# rename columns: 
crime_df.rename(columns={0: 'Month', 1: 'Total', 2: 'Year'}, inplace=True)


#Filter the df to only show data needed (remove rows with 'Total, Month & None')
crime_df = crime_df[crime_df['Month'] != 'Month']
crime_df = crime_df[crime_df['Month'] != 'Total']
crime_df = crime_df[~crime_df['Total'].str.contains('Sum')]

# reset index and drop extra column.
crime_df.reset_index(drop=True, inplace=True)

# the last two months appear to have no date, we'll remove them since it would create outliers
crime_df = crime_df.iloc[: -1]

In [25]:

crime_df.tail()

Out[25]:

	Year	Month	Total
137	2024	Jun	51
138	2024	Jul	66
139	2024	Aug	68
140	2024	Sep	58
141	2024	Oct	55

In [27]:

# Having the year and the Month in separate columns and in differemt formats isnt helpful.
        
# create a mapping for months: 
month_map = {
    'Jan': '01', 'Feb': '02', 'Mar': '03', 'Apr': '04',
    'May': '05', 'Jun': '06', 'Jul': '07', 'Aug': '08',
    'Sep': '09', 'Oct': '10', 'Nov': '11', 'Dec': '12'
}

# Add month numbers to df    
crime_df['Month_Num'] = crime_df['Month'].map(month_map)  
# combine month number and year to df as a pandas datetime object.
crime_df['date'] = pd.to_datetime(crime_df['Month_Num'].astype(str) + crime_df['Year'])
#format the month_year column
crime_df['Month_Year'] = crime_df['date'].dt.strftime('%m-%Y')

# Create a copy of the df for the Prophet Forecasting (we will use 2 type of models)
crime_df_fb = crime_df.copy()


#apply the month_year as the index 
crime_df.index = crime_df['Month_Year']

# drop all other columns except totals: 
crime_df = crime_df[[ 'Total']]

# change the index to a period type with a monthly frequency: 
crime_df.index = pd.PeriodIndex(crime_df.index, freq='M')

# change the total to int type.
crime_df['Total'] = crime_df['Total'].astype(int)

crime_df.head()

Out[27]:

	Total
Month_Year
2013-01	38
2013-02	46
2013-03	18
2013-04	19
2013-05	33

Forecasting¶

Split and train the model on 80% of the data, then cross validate with 20% test data.
Predict the following months.
Calculate the Mean Absolute Percentage Error
Plot findings

Tetha Model:¶

In [30]:

crime_df_cleaned = crime_df.copy()
        
# split the df into 80/20 for training and testing: 

crime_df_cleaned_train, crime_df_cleaned_test = temporal_train_test_split(crime_df_cleaned, train_size=0.8)

# set the forecast horizon: 
fh = ForecastingHorizon(crime_df_cleaned_test.index, is_relative=False)
#fh = ForecastingHorizon([i for i in range(1,12)])

# use the Theta Forecaster model and set the seasonal period to 12 (yearly)
forecaster = ThetaForecaster(sp=12)

# Train the model on the data
forecaster.fit(crime_df_cleaned_train)

# Predict the 20% test data
crime_pred = forecaster.predict(fh)

In [32]:

# plot the trained data against the test data:
plot_series(crime_df_cleaned, crime_pred)

Out[32]:

(<Figure size 1600x400 with 1 Axes>, <Axes: ylabel='Total'>)

No description has been provided for this image

In [34]:

#get the Mean percentage error: 
mape = mean_absolute_percentage_error(crime_df_cleaned_test, crime_pred)

print(f'Mean Absolute Percentage Error: {round(mape*100, 2)}%')

Mean Absolute Percentage Error: 17.29%

In [36]:

# Forecast into the future: 
forecaster = ThetaForecaster(sp=12)
fh = ForecastingHorizon([i for i in range(1,12)])

# Train the model on all of the data
forecaster.fit(crime_df_cleaned)

# Predict the future data
crime_pred = forecaster.predict(fh)

# plot the trained data against the test data:
plot_series(crime_df_cleaned, crime_pred)

Out[36]:

(<Figure size 1600x400 with 1 Axes>, <Axes: ylabel='Total'>)

Prophet Model:¶

In [39]:

# Remove unwanted columns: 
crime_df_fb =  crime_df_fb[['Month_Year', 'Total']]

# change Month_Year column to datetime
crime_df_fb['Month_Year'] = pd.to_datetime(crime_df_fb['Month_Year'])
# change Total column to integer
crime_df_fb['Total'] = crime_df_fb['Total'].astype(int)

# Rename columns: 
crime_df_fb.rename(columns= {'Month_Year' : 'ds', 'Total': 'y'}, inplace=True)

crime_df_fb['ds'] = crime_df_fb['ds'] + pd.offsets.MonthEnd(0)
crime_df_fb.head()

Out[39]:

	ds	y
0	2013-01-31	38
1	2013-02-28	46
2	2013-03-31	18
3	2013-04-30	19
4	2013-05-31	33

In [41]:

train_df = crime_df_fb[(crime_df_fb['ds'] < '2024-09-30')]

In [43]:

lockdowns = pd.DataFrame([
    {'holiday': 'lockdown_1', 'ds': '2020-03-31', 'lower_window': 0, 'ds_upper': '2021-09-30'},

])
for t_col in ['ds', 'ds_upper']:
    lockdowns[t_col] = pd.to_datetime(lockdowns[t_col])
lockdowns['upper_window'] = (lockdowns['ds_upper'] - lockdowns['ds']).dt.days
lockdowns

Out[43]:

	holiday	ds	lower_window	ds_upper	upper_window
0	lockdown_1	2020-03-31	0	2021-09-30	548

In [87]:

#lockdowns2 = pd.DataFrame([
#    {'holiday': 'lockdown_1', 'ds': '2021-03-01', 'lower_window': 0, 'ds_upper': '2021-09-30'},

#])
#for t_col in ['ds', 'ds_upper']:
#    lockdowns2[t_col] = pd.to_datetime(lockdowns2[t_col])
#lockdowns2['upper_window'] = (lockdowns2['ds_upper'] - lockdowns2['ds']).dt.days
#lockdowns2

In [47]:

train_df.reset_index(drop=True, inplace=True)
train_df

Out[47]:

	ds	y
0	2013-01-31	38
1	2013-02-28	46
2	2013-03-31	18
3	2013-04-30	19
4	2013-05-31	33
...	...	...
135	2024-04-30	37
136	2024-05-31	53
137	2024-06-30	51
138	2024-07-31	66
139	2024-08-31	68

140 rows × 2 columns

In [49]:

crime_df_fb

Out[49]:

	ds	y
0	2013-01-31	38
1	2013-02-28	46
2	2013-03-31	18
3	2013-04-30	19
4	2013-05-31	33
...	...	...
137	2024-06-30	51
138	2024-07-31	66
139	2024-08-31	68
140	2024-09-30	58
141	2024-10-31	55

142 rows × 2 columns

In [51]:

# Post Covid test
post_covid = crime_df_fb[crime_df_fb['ds'] >= '2021-10-01']
post_covid.reset_index(inplace=True, drop = True)

post_covid = post_covid.head(33)
post_covid.tail()

Out[51]:

	ds	y
28	2024-02-29	42
29	2024-03-31	57
30	2024-04-30	37
31	2024-05-31	53
32	2024-06-30	51

In [53]:

# set the model seasonality
m = Prophet( holidays=lockdowns, seasonality_mode='multiplicative', changepoint_prior_scale=0.05, changepoint_range=0.95)
#m2 = Prophet(seasonality_mode='multiplicative')

# train the model on the data
m.fit(train_df)
#m2.fit(post_covid)

# set the predicted dates amount
future = m.make_future_dataframe( periods=12, freq='M')
#future2 = m2.make_future_dataframe( periods=12, freq='M')

future.tail()

14:00:18 - cmdstanpy - INFO - Chain [1] start processing
        14:00:18 - cmdstanpy - INFO - Chain [1] done processing
        14:00:18 - cmdstanpy - INFO - Chain [1] start processing
        14:00:19 - cmdstanpy - INFO - Chain [1] done processing

Out[53]:

	ds
147	2025-04-30
148	2025-05-31
149	2025-06-30
150	2025-07-31
151	2025-08-31

In [55]:

# predict the next 12 months: 
        
forecast = m.predict(future)
#forecast2 = m2.predict(future2)

forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

Out[55]:

	ds	yhat	yhat_lower	yhat_upper
147	2025-04-30	51.835601	42.992737	61.228994
148	2025-05-31	55.602831	46.429305	64.305231
149	2025-06-30	50.303951	41.353531	58.727845
150	2025-07-31	60.545914	52.267125	70.046841
151	2025-08-31	63.852228	54.833038	73.447367

In [85]:

# add residuals columns
forecast['residuals'] = crime_df_fb['y'] - forecast['yhat']
#forecast2['residuals'] = post_covid['y'] - forecast2['yhat']

# calculate mape
forecast['absolute_percentage_error'] = abs(forecast['residuals'] / crime_df_fb['y']) * 100
#forecast2['absolute_percentage_error'] = abs(forecast2['residuals'] / post_covid['y']) * 100

# calcualte mape
mape = forecast['absolute_percentage_error'].mean()
#mape2 = forecast2['absolute_percentage_error'].mean()

round(mape,2)

Out[85]:

13.38

In [61]:

# plot forecast
        
fig1 = m.plot(forecast)

In [63]:

# plot components
        
fig1 = m.plot_components(forecast)

In [67]:

# Python
        
from prophet.plot import add_changepoints_to_plot
        
fig = m.plot(forecast)
        
a = add_changepoints_to_plot(fig.gca(), m, forecast)

In [93]:

# Interactive Plot
        
HTML(plot_plotly(m, forecast).to_html())

Out[93]:

In [155]:

print(f'Mean Absolute Percentage Error: {round(mape, 2)}%')

Mean Absolute Percentage Error: 13.49%

Next Steps to Explore:¶

Identify and Remove outliers from data.
Add regressor for Covid19 period.
Cross Validate data and find tune metrics
Lower mape value to be more accurate with the predictions.

Kevin's Portfolio

T&T Crime Data Acquisition and Forecasting - Python