Building a Data Warehouse with SQL Server
Welcome to the my the SQL Serve Data Warehouse Project!¶
As a Data Analyst expanding into Data Engineering, this project showcases my approach to building a modern data warehouse using SQL Server. It covers ETL processes, data modeling, and analytics, applying best practices I've learned along the way. I'm excited to continue growing in this space and sharing my journey!
Due to the nature of a Data Warehouse, not all SQL scripts are displayed on this page. This serves as an introductory overview of the project.
For full details, including all SQL DDL and Stored Procedure scripts, please visit the GitHub repository: 🔗 SQL Data Warehouse Project
Project Layout:¶
Data Architecture:
Designing A Modern Data Warehouse Useing the Medallion Architecture Bronze, Silver, and Gold layers.
ETL Pipelines:
Etracting, Transforming and Loading data from source system into warehose.
Requirements:¶
Create a modern data warehouse with SQL Server to integrate sales data, supporting analytical reporting and data-driven decision-making.
Design Layers & Data Architecture¶
Before writing any scripts or code, it’s crucial to map out the data layers.
This will provide a clear understanding of the next steps and the responsibilities of each layer.
At this stage, it's important to stick strictly to the intended purpose of each layer.
For instance, the bronze layer is not meant for data transformations. Regardless of how messy the data may appear, I must refrain from performing any data cleaning tasks at this stage.
Create Database and Schemas:¶
Before creating databases and schemas, a Naming Conventions document was established to ensure consistency throughout the project.
You can find this document in the GitHub repository: 🔗 Naming Conventions Document
USE master;
GO
-- Drop and recreate the 'DataWarehouse' database
IF EXISTS (SELECT 1 FROM sys.databases WHERE name = 'DataWarehouse')
BEGIN
ALTER DATABASE DataWarehouse SET SINGLE_USER WITH ROLLBACK IMMMEDIATE;
DROP DATABASE DataWarehouse
END
GO
-- Create the 'DataWarehouse' database
CREATE DATABASE DataWarehouse;
GO
USE DataWarehouse;
GO
-- Create Schemas
CREATE SCHEMA bronze;
GO
CREATE SCHEMA silver;
GO
CREATE SCHEMA gold;
Data Integration¶
The source system files are provided in CSV format.
After reviewing the data, it was determined that there are two main sources: CRM (Customer Relationship Management) and ERP (Enterprise Resource Planning). The Data Integration Mapping provides a high-level overview of how these datasets relate to each other, helping to better prepare for data extraction and/or transformation all layeres.
Bronze Layer:¶
Bronze Layer Overview:
The Bronze layer is where data is extracted from the source systems into the data warehouse.
- Data State: Raw and unprocessed (no transformations, column names remain unchanged from the source).
- Target Objects: Tables within the bronze schema.
- Load Method: Full Load (Truncate & Insert).
- Target Audience: Data Engineers.
This layer serves as the foundation for further processing in the Silver and Gold layers.
Bronze DDL Code Snip:
DROP TABLE IF EXISTS bronze.crm_cust_info;
CREATE TABLE bronze.crm_cust_info (
cst_id INT,
cst_key NVARCHAR(50),
cst_firstname NVARCHAR(50),
cst_lastname NVARCHAR(50),
cst_marital_status NVARCHAR(50),
cst_gndr NVARCHAR(50),
cst_created_date DATE
);
DROP TABLE IF EXISTS bronze.crm_prd_info;
CREATE TABLE bronze.crm_prd_info(
prd_id INT,
prd_key NVARCHAR(50),
prd_nm NVARCHAR(50),
prd_cost INT,
prd_line NVARCHAR(50),
prd_start_dt DATETIME,
prd_end_dt DATETIME
);
DROP TABLE IF EXISTS bronze.crm_sales_details;
CREATE TABLE bronze.crm_sales_details (
sls_ord_num NVARCHAR(50),
sls_prd_key NVARCHAR(50),
sls_cust_id INT,
sls_order_dt INT,
sls_ship_dt INT,
sls_due_dt INT,
sls_sales INT,
sls_quantity INT,
sls_price INT
);
--------------snip----------------
Bronze Stored Procedure Code Snip:
CREATE OR ALTER PROCEDURE bronze.load_bronze AS
BEGIN
DECLARE @start_time DATETIME, @end_time DATETIME, @batch_start_time DATETIME, @batch_end_time DATETIME;
SET @batch_start_time = GETDATE();
BEGIN TRY
PRINT '==============================================================';
PRINT 'Loading Bronze Layer';
PRINT '==============================================================';
PRINT '**************************************************************';
PRINT 'Loading CRM Tables';
PRINT '**************************************************************';
SET @start_time = GETDATE();
PRINT '>> Truncating Table: bronze.crm_cust_info';
TRUNCATE TABLE bronze.crm_cust_info;
PRINT '>> Inserting Data Into: bronze.crm_cust_info';
BULK INSERT bronze.crm_cust_info
FROM 'D:\Data Analytics\Projects\Data Warehouse\source files\sql-data-warehouse-project\datasets\source_crm\cust_info.csv'
WITH(
FIRSTROW = 2,
FIELDTERMINATOR = ',',
TABLOCK
);
SET @end_time = GETDATE();
PRINT '>> Load Duration: ' + CAST(DATEDIFF(second, @start_time, @end_time) AS NVARCHAR) + ' seconds';
PRINT '----------------------------------------------------------------';
--------------snip----------------
Bronze Data Quality Checks After Load Code Snip:
-- Check for Null or Duplicates in Primary Key
-- Expectation: No Result
SELECT prd_id, COUNT(*)
FROM [bronze].[crm_prd_info]
GROUP BY prd_id
HAVING COUNT(*) > 1 OR prd_id IS NULL;
-- Checking for white space
-- Expectation: No Results
SELECT prd_nm
FROM [bronze].[crm_prd_info]
WHERE prd_nm != TRIM(prd_nm)
--Checking for nulls or negative numbers
--Expectations: No Results
SELECT prd_cost
FROM [bronze].[crm_prd_info]
WHERE prd_cost < 0 OR prd_cost IS NULL -- there are nulls
-- Low Cardinality
SELECT DISTINCT prd_line
FROM [bronze].[crm_prd_info]
--------------snip----------------
The end result is that all source files are successfully loaded into the bronze schema in their raw, unprocessed form. These tables can now be queried within MS SQL Server and further transformed, cleansed, and loaded into the silver layer.
Silver Layer:¶
Silver Layer Overview:
The Silver layer is where data is cleaned and extracted from the Bronze layer into the Silver schema tables within the data warehouse.
- Data State: Cleaned and standardized (column names remain unchanged).
- Target Objects: Tables within the silver schema.
- Load Method: Full Load (Truncate & Insert).
- Target Audience: BI Analysts, Data Analysts, Data Scientists, Data Engineers.
This layer provides a structured and standardized version of the Bronze layer, serving as the foundation for further transformations in the Gold layer.
Silver DDL Code Snip:
USE DataWarehouse;
GO
DROP TABLE IF EXISTS silver.crm_cust_info;
CREATE TABLE silver.crm_cust_info (
cst_id INT,
cst_key NVARCHAR(50),
cst_firstname NVARCHAR(50),
cst_lastname NVARCHAR(50),
cst_marital_status NVARCHAR(50),
cst_gndr NVARCHAR(50),
cst_created_date DATE,
dwh_create_date DATETIME2 DEFAULT GETDATE()
);
DROP TABLE IF EXISTS silver.crm_prd_info;
CREATE TABLE silver.crm_prd_info(
prd_id INT,
cat_id NVARCHAR(50),
prd_key NVARCHAR(50),
prd_nm NVARCHAR(50),
prd_cost INT,
prd_line NVARCHAR(50),
prd_start_dt DATE,
prd_end_dt DATE,
dwh_create_date DATETIME2 DEFAULT GETDATE()
);
DROP TABLE IF EXISTS silver.crm_sales_details;
CREATE TABLE silver.crm_sales_details (
sls_ord_num NVARCHAR(50),
sls_prd_key NVARCHAR(50),
sls_cust_id INT,
sls_order_dt INT,
sls_ship_dt INT,
sls_due_dt INT,
sls_sales INT,
sls_quantity INT,
sls_price INT,
dwh_create_date DATETIME2 DEFAULT GETDATE()
);
--------------snip----------------
Silver Stored Procedure Code Snip:
CREATE OR ALTER PROCEDURE silver.load_silver AS
BEGIN
DECLARE @start_time DATETIME, @end_time DATETIME, @batch_start_time DATETIME, @batch_end_time DATETIME;
SET @batch_start_time = GETDATE()
BEGIN TRY
PRINT '=============================================================================================';
PRINT 'Loading Silver Layer';
PRINT '=============================================================================================';
PRINT '*********************************************************************************************';
PRINT 'Loading CRM Tables';
PRINT '*********************************************************************************************';
-- Loading cleaned crm_cust_info to the silver ddl
SET @start_time = GETDATE();
TRUNCATE TABLE silver.crm_cust_info;
PRINT 'Inserting Data Into: silver.crm_cust_info';
INSERT INTO silver.crm_cust_info (
[cst_id]
,[cst_key]
,[cst_firstname]
,[cst_lastname]
,[cst_marital_status]
,[cst_gndr]
,[cst_created_date])
SELECT
[cst_id]
,[cst_key]
,TRIM([cst_firstname]) AS [cst_firstname]
,TRIM([cst_lastname]) AS [cst_lastname]
,CASE
WHEN UPPER(TRIM([cst_marital_status])) = 'S' THEN 'Single'
WHEN UPPER(TRIM([cst_marital_status])) = 'M' THEN 'Married'
ELSE 'n/a'
END as [cst_marital_status] -- normnalize marital status values to readable format.
,CASE
WHEN UPPER(TRIM([cst_gndr])) = 'F' THEN 'Female'
WHEN UPPER(TRIM([cst_gndr])) = 'M' THEN 'Male'
ELSE 'n/a'
END as [cst_gndr]-- normnalize gender status values to readable format.
,[cst_created_date]
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY cst_id ORDER BY cst_created_date DESC) flag_last
FROM [DataWarehouse].[bronze].[crm_cust_info]
WHERE cst_id IS NOT NULL) as t
WHERE flag_last = 1; -- Select most recent record per customer.
SET @end_time = GETDATE();
PRINT '>> Load Duration: ' + CAST(DATEDIFF(SECOND, @start_time, @end_time) AS NVARCHAR) + ' seconds';
PRINT '------------------------------------------------------------------------------------------------';
--------------snip----------------
Silver Data Quality Checks After Load Code Snip:
-- ============================================ ERP_CUST_AZ12 ================================================= --
-- Check for NAS prefix from the cid column or if another prefix is populated.
SELECT cid
FROM [DataWarehouse].silver.[erp_cust_az12]
WHERE cid LIKE 'NAS%' OR LEN(cid) >10
-- Check the extremes of the bdate that are beyond the current year
-- Ensure that there's not birthdate in the future (beyond the current year)
SELECT [bdate]
FROM [DataWarehouse].silver.[erp_cust_az12]
WHERE [bdate] < '1925-01-01' OR [bdate] > GETDATE()
ORDER BY 1
-- Low cardinality Check for gender column
-- Ensure distinct values are unrelated from each other (f, female etc.)
SELECT DISTINCT [gen]
FROM [DataWarehouse].silver.[erp_cust_az12];
-- Final Review of table:
SELECT TOP (100) *
FROM [DataWarehouse].[silver].[erp_cust_az12];
--------------snip----------------
The end result of the Silver layer is that all tables from the Bronze schema are now cleaned, standardized and loaded into the silver schema, making them ready for further transformations to become business-ready in the Gold layer.
Gold Layer:¶
Gold Layer Overview:
The Gold layer is where data is transformed and prepared for business-ready analysis and reporting.
- Data State: Cleaned, transformed, and business-ready.
- Target Objects: Views within the Gold schema.
- Load Method: None (Views)
- Data Model: Star Schema (Aggreations, Business Logic & Rules)
- Target Audience: BI Analysts, Data Analysts, Data Scientists, Business Users .
This layer provides curated, business-ready data, enriched with business rules and logic for advanced analysis and reporting.
Gold View Creation Script Code Snip:
/*==========================================Customers============================================*/
DROP VIEW IF EXISTS gold.dim_customers;
GO
-- create dim_customer view
CREATE VIEW gold.dim_customers AS
SELECT
ROW_NUMBER() OVER(ORDER BY [cst_id]) AS customer_key
,ci.[cst_id] AS customer_id
,ci.[cst_key] AS customer_number
,ci.[cst_firstname] AS first_name
,ci.[cst_lastname] AS last_name
,cl.cntry AS country
,ci.[cst_marital_status] AS marital_status
,CASE
WHEN ci.cst_gndr != 'n/a' THEN cst_gndr -- CRM is the master for gender info.
ELSE COALESCE(bd.gen , 'n/a')
END AS gender
,bd.bdate AS birth_date
,ci.[cst_created_date] AS create_date
FROM [DataWarehouse].[silver].[crm_cust_info] AS ci
LEFT JOIN silver.erp_cust_az12 as bd
ON ci.cst_key = bd.cid
LEFT JOIN silver.erp_loc_a101 as cl
ON ci.cst_key = cl.cid;
GO
/*==========================================Products============================================*/
DROP VIEW IF EXISTS gold.dim_products;
GO
-- create dim_products view
CREATE VIEW gold.dim_products AS
SELECT
ROW_NUMBER() OVER(ORDER BY p.[prd_start_dt], p.[prd_nm]) AS product_key
,p.[prd_id] AS product_id
,p.[prd_key] AS product_number
,p.[prd_nm] AS product_name
,p.[cat_id] AS category_id
,COALESCE(c.cat, 'n/a') AS category_name
,COALESCE(c.subcat, 'n/a') AS subcategory
,COALESCE(c.maintenance, 'n/a') AS maintenance
,p.[prd_cost] AS product_cost
,p.[prd_line] AS prodcut_line
,p.[prd_start_dt] AS start_date
FROM [DataWarehouse].[silver].[crm_prd_info] as p
LEFT JOIN silver.erp_px_cat_g1v2 as c
ON p.cat_id = c.id
WHERE p.prd_end_dt IS NULL;-- filter out historical data
GO
--------------snip----------------
Gold Data Quality Checks After Load Code Snip:
/*==========================================Customers============================================*/
-- Check for duplicated customer_id rows
-- Expectation: No result
SELECT customer_id, COUNT(*)
FROM [gold].[dim_customers]
GROUP BY customer_id
HAVING COUNT(*) > 1
-- Check distcint values of gender columns
-- Expectation: Only Distinct and undrelated values should appear.(Male, Female, n/a)
SELECT DISTINCT gender
FROM [gold].[dim_customers]
-- Final Review:
SELECT *
FROM [gold].[dim_customers]
/*==========================================Products============================================*/
-- Check for duplicated product_id rows
-- Expectation: No result
SELECT product_id, COUNT(*)
FROM [gold].[dim_products]
GROUP BY product_id
HAVING COUNT(*) > 1
-- Check for NULL values in category_name, subcategory and maintenance columns
-- Expectation: No Result
SELECT *
FROM [gold].[dim_products]
WHERE category_name IS NULL OR subcategory IS NULL OR maintenance IS NULL;
-- Final Review:
SELECT *
FROM [gold].[dim_products];
/*==========================================Products============================================*/
-- Check if fact view joins correctly with dim views
-- Expectation 1 : all columns from gold views (fact_sales, dim_customer & dim_products)
-- Expectation 2 : No result
SELECT *
FROM gold.fact_sales as f
LEFT JOIN gold.dim_customers c
ON f.customer_key = c.customer_key
LEFT JOIN gold.dim_products as p
ON f.product_key = p.product_key
WHERE c.customer_key IS NULL -- Check if dim_customer has any customer_key not found in fact_sales.
OR
p.product_key IS NULL-- Check if dim_products has any product_key not found in fact_sales.
Star Schema Diagram:
Final Update to Data Flow Diagram:
Data Catalog:
A Data Catalog improves data discovery, governance, quality, collaboration, and efficiency. It helps users quickly find and trust data, ensures compliance, promotes knowledge sharing, and boosts productivity by reducing time spent searching for data. Ultimately, it enables better decision-making and self-service analytics.
The Data Catalog for this project can be found 🔗HERE
END OF PROJECT:¶
Thank you for taking the time to view this project! This was my first end-to-end attempt at building a data warehouse, and throughout the process, I’ve gained valuable insights—especially on the importance of proper documentation. When done correctly, it makes the entire project more structured, efficient, and even enjoyable.
As a Data Analyst, part of my role is to break down complex tasks into manageable steps and work toward a solution. Documentation played a key role in making that possible.
A special thanks to Baraa Khatib Salkini for his content and in-depth guidance on this project. The dataset was provided by him, and you can find more about his work 🔗 here.
Next Steps:
As data roles continue to evolve and overlap, venturing into Data Engineering has been an exciting experience. It’s definitely something I plan to learn more about and incorporate into my work whenever possible.
That said, I still enjoy being a Data Analyst, and now that this project has produced fully business-ready data, the next step is to extract meaningful insights—which I look forward to doing in my upcoming projects. 🚀