Data Warehouse - Data Architecture (dark).drawio.png

Welcome to the my the SQL Serve Data Warehouse Project!¶

As a Data Analyst expanding into Data Engineering, this project showcases my approach to building a modern data warehouse using SQL Server. It covers ETL processes, data modeling, and analytics, applying best practices I've learned along the way. I'm excited to continue growing in this space and sharing my journey!

Due to the nature of a Data Warehouse, not all SQL scripts are displayed on this page. This serves as an introductory overview of the project.

For full details, including all SQL DDL and Stored Procedure scripts, please visit the GitHub repository: 🔗 SQL Data Warehouse Project

Project Layout:¶

Data Architecture:
Designing A Modern Data Warehouse Useing the Medallion Architecture Bronze, Silver, and Gold layers.

ETL Pipelines:
Etracting, Transforming and Loading data from source system into warehose.

Requirements:¶

Create a modern data warehouse with SQL Server to integrate sales data, supporting analytical reporting and data-driven decision-making.

Data Sources: Import data from two source systems (ERP and CRM) provided as CSV files.

Data Quality: Cleanse and resolve data quality issues prior to analysis.

Integration: Combine both sources into a single, user-friendly data model designed for analytical queries.

Scope: Focus on the latest dataset only; historization of data is not required.

Documentation: Provide clear documentation of the data model to support both business stakeholders and analytics teams.

Design Layers & Data Architecture¶

Before writing any scripts or code, it’s crucial to map out the data layers.
This will provide a clear understanding of the next steps and the responsibilities of each layer.

At this stage, it's important to stick strictly to the intended purpose of each layer.
For instance, the bronze layer is not meant for data transformations. Regardless of how messy the data may appear, I must refrain from performing any data cleaning tasks at this stage.

Data Architecture - Design Layers.drawio.png

Create Database and Schemas:¶

Before creating databases and schemas, a Naming Conventions document was established to ensure consistency throughout the project.
You can find this document in the GitHub repository: 🔗 Naming Conventions Document

In [ ]:


  
  USE master;
  GO
  
  -- Drop and recreate the 'DataWarehouse' database
  IF EXISTS (SELECT 1 FROM sys.databases WHERE name = 'DataWarehouse')
  BEGIN 
  	ALTER DATABASE DataWarehouse SET SINGLE_USER WITH ROLLBACK IMMMEDIATE;
  	DROP DATABASE DataWarehouse
  END 
  GO
  
  -- Create the 'DataWarehouse' database
  CREATE DATABASE DataWarehouse;
  GO
  
  USE DataWarehouse;
  GO
  
  -- Create Schemas 
  CREATE SCHEMA bronze;
  GO
  
  CREATE SCHEMA silver;
  GO
  
  CREATE SCHEMA gold;

Data Integration¶

The source system files are provided in CSV format.

After reviewing the data, it was determined that there are two main sources: CRM (Customer Relationship Management) and ERP (Enterprise Resource Planning). The Data Integration Mapping provides a high-level overview of how these datasets relate to each other, helping to better prepare for data extraction and/or transformation all layeres.

Bronze Layer:¶

Bronze Layer Overview:

The Bronze layer is where data is extracted from the source systems into the data warehouse.

Data State: Raw and unprocessed (no transformations, column names remain unchanged from the source).
Target Objects: Tables within the bronze schema.
Load Method: Full Load (Truncate & Insert).
Target Audience: Data Engineers.

This layer serves as the foundation for further processing in the Silver and Gold layers.

Bronze DDL Code Snip:

In [ ]:

 
      
  DROP TABLE IF EXISTS bronze.crm_cust_info;
  CREATE TABLE bronze.crm_cust_info (
  	cst_id INT, 
  	cst_key NVARCHAR(50), 
  	cst_firstname NVARCHAR(50),
  	cst_lastname NVARCHAR(50),
  	cst_marital_status NVARCHAR(50),
  	cst_gndr NVARCHAR(50),
  	cst_created_date DATE
  );
  
  
  DROP TABLE IF EXISTS bronze.crm_prd_info;
  CREATE TABLE bronze.crm_prd_info(
  
  	prd_id INT,
  	prd_key NVARCHAR(50),
  	prd_nm NVARCHAR(50),
  	prd_cost INT,
  	prd_line NVARCHAR(50),
  	prd_start_dt DATETIME,
  	prd_end_dt DATETIME
  );
  
  
  DROP TABLE IF EXISTS bronze.crm_sales_details;
  CREATE TABLE bronze.crm_sales_details (
  
  	sls_ord_num NVARCHAR(50),
  	sls_prd_key NVARCHAR(50),
  	sls_cust_id INT,
  	sls_order_dt INT,
  	sls_ship_dt INT,
  	sls_due_dt INT,
  	sls_sales INT,
  	sls_quantity INT,
  	sls_price INT
  );
  
  --------------snip----------------

Bronze Stored Procedure Code Snip:

In [ ]:


  
  CREATE OR ALTER PROCEDURE bronze.load_bronze AS
  
  BEGIN
  	DECLARE @start_time DATETIME, @end_time DATETIME, @batch_start_time DATETIME, @batch_end_time DATETIME;
  	SET @batch_start_time = GETDATE();
  	BEGIN TRY
  		PRINT '==============================================================';
  		PRINT 'Loading Bronze Layer';
  		PRINT '==============================================================';
  
  
  		PRINT '**************************************************************';
  		PRINT 'Loading CRM Tables';
  		PRINT '**************************************************************';
  
  		SET @start_time = GETDATE();
  		PRINT '>> Truncating Table: bronze.crm_cust_info';
  		TRUNCATE TABLE bronze.crm_cust_info;
  
  		PRINT '>> Inserting Data Into: bronze.crm_cust_info';
  		BULK INSERT bronze.crm_cust_info
  		FROM 'D:\Data Analytics\Projects\Data Warehouse\source files\sql-data-warehouse-project\datasets\source_crm\cust_info.csv'
  		WITH(
  			FIRSTROW = 2,
  			FIELDTERMINATOR = ',',
  			TABLOCK
  		);
  		SET @end_time = GETDATE();
  		PRINT '>> Load Duration: ' + CAST(DATEDIFF(second, @start_time, @end_time) AS NVARCHAR) + ' seconds';
  		PRINT '----------------------------------------------------------------';
  
  --------------snip----------------

Bronze Data Quality Checks After Load Code Snip:

In [ ]:


  
    -- Check for Null or Duplicates in Primary Key
    -- Expectation: No Result
    SELECT prd_id, COUNT(*)
    FROM [bronze].[crm_prd_info]
    GROUP BY prd_id
    HAVING COUNT(*) > 1 OR prd_id IS NULL;
  
  
  -- Checking for white space
  -- Expectation: No Results
  
  SELECT prd_nm
  FROM [bronze].[crm_prd_info]
  WHERE prd_nm != TRIM(prd_nm)
  
  
  --Checking for nulls or negative numbers
  --Expectations: No Results
  
  SELECT prd_cost
  FROM [bronze].[crm_prd_info]
  WHERE prd_cost < 0 OR prd_cost IS NULL -- there are nulls
  
  
  -- Low Cardinality 
  SELECT DISTINCT prd_line
  FROM [bronze].[crm_prd_info]
  
  --------------snip----------------

The end result is that all source files are successfully loaded into the bronze schema in their raw, unprocessed form. These tables can now be queried within MS SQL Server and further transformed, cleansed, and loaded into the silver layer.

Silver Layer:¶

Silver Layer Overview:

The Silver layer is where data is cleaned and extracted from the Bronze layer into the Silver schema tables within the data warehouse.

Data State: Cleaned and standardized (column names remain unchanged).
Target Objects: Tables within the silver schema.
Load Method: Full Load (Truncate & Insert).
Target Audience: BI Analysts, Data Analysts, Data Scientists, Data Engineers.

This layer provides a structured and standardized version of the Bronze layer, serving as the foundation for further transformations in the Gold layer.

Silver DDL Code Snip:

In [ ]:


  
  USE DataWarehouse;
  GO
  
  
  DROP TABLE IF EXISTS silver.crm_cust_info;
  CREATE TABLE silver.crm_cust_info (
  	cst_id INT, 
  	cst_key NVARCHAR(50), 
  	cst_firstname NVARCHAR(50),
  	cst_lastname NVARCHAR(50),
  	cst_marital_status NVARCHAR(50),
  	cst_gndr NVARCHAR(50),
  	cst_created_date DATE,
  	dwh_create_date DATETIME2 DEFAULT GETDATE()
  );
  
  
  DROP TABLE IF EXISTS silver.crm_prd_info;
  CREATE TABLE silver.crm_prd_info(
  
  	prd_id INT,
  	cat_id NVARCHAR(50),
  	prd_key NVARCHAR(50),
  	prd_nm NVARCHAR(50),
  	prd_cost INT,
  	prd_line NVARCHAR(50),
  	prd_start_dt DATE,
  	prd_end_dt DATE,
  	dwh_create_date DATETIME2 DEFAULT GETDATE()
  
  );
  
  
  DROP TABLE IF EXISTS silver.crm_sales_details;
  CREATE TABLE silver.crm_sales_details (
  
  	sls_ord_num NVARCHAR(50),
  	sls_prd_key NVARCHAR(50),
  	sls_cust_id INT,
  	sls_order_dt INT,
  	sls_ship_dt INT,
  	sls_due_dt INT,
  	sls_sales INT,
  	sls_quantity INT,
  	sls_price INT,
  	dwh_create_date DATETIME2 DEFAULT GETDATE()
  );
  
  --------------snip----------------

Silver Stored Procedure Code Snip:

In [ ]:


  
  CREATE OR ALTER PROCEDURE silver.load_silver AS 
  BEGIN
  	DECLARE @start_time DATETIME, @end_time DATETIME, @batch_start_time DATETIME, @batch_end_time DATETIME;
  	SET @batch_start_time = GETDATE()
  	BEGIN TRY
  		PRINT '=============================================================================================';
  		PRINT 'Loading Silver Layer';
  		PRINT '=============================================================================================';
  
  		PRINT '*********************************************************************************************';
  		PRINT 'Loading CRM Tables';
  		PRINT '*********************************************************************************************';
  
  
  		-- Loading cleaned crm_cust_info to the silver ddl
  
  		SET @start_time = GETDATE();
  		TRUNCATE TABLE silver.crm_cust_info;
  		PRINT 'Inserting Data Into: silver.crm_cust_info';
  		INSERT INTO silver.crm_cust_info (
  			[cst_id]
  			,[cst_key]
  			,[cst_firstname]
  			,[cst_lastname]
  			,[cst_marital_status]
  			,[cst_gndr]
  			,[cst_created_date])
  
  		SELECT 
  			[cst_id]
  		   ,[cst_key]
  		   ,TRIM([cst_firstname]) AS [cst_firstname]
  		   ,TRIM([cst_lastname]) AS [cst_lastname]
  		   ,CASE
  				WHEN UPPER(TRIM([cst_marital_status])) = 'S' THEN 'Single'
  				WHEN UPPER(TRIM([cst_marital_status])) = 'M' THEN 'Married'
  				ELSE 'n/a'
  			END as [cst_marital_status] -- normnalize marital status values to readable format.
  		   ,CASE
  				WHEN UPPER(TRIM([cst_gndr])) = 'F' THEN 'Female'
  				WHEN UPPER(TRIM([cst_gndr])) = 'M' THEN 'Male'
  				ELSE 'n/a'
  			END as [cst_gndr]-- normnalize gender status values to readable format.
  		   ,[cst_created_date]
  
  		FROM (
  		SELECT *, ROW_NUMBER() OVER(PARTITION BY cst_id ORDER BY cst_created_date DESC) flag_last
  		FROM [DataWarehouse].[bronze].[crm_cust_info]
  		WHERE cst_id IS NOT NULL) as t
  		WHERE flag_last = 1; -- Select most recent record per customer.
  		SET @end_time = GETDATE();
  		PRINT '>> Load Duration: ' + CAST(DATEDIFF(SECOND, @start_time, @end_time) AS NVARCHAR) + ' seconds';
  		PRINT '------------------------------------------------------------------------------------------------';
  
  --------------snip----------------

Silver Data Quality Checks After Load Code Snip:

In [ ]:


  
  -- ============================================ ERP_CUST_AZ12 ================================================= --
  
  
  -- Check for NAS prefix from the cid column or if another prefix is populated.
  
  SELECT cid
   FROM [DataWarehouse].silver.[erp_cust_az12]
   WHERE cid LIKE 'NAS%' OR LEN(cid) >10
  
  
  
  -- Check the extremes of the bdate that are beyond the current year
   -- Ensure that there's not birthdate in the future (beyond the current year)
   SELECT [bdate]
   FROM [DataWarehouse].silver.[erp_cust_az12]
   WHERE [bdate] < '1925-01-01' OR [bdate] > GETDATE()
   ORDER BY 1
  
  
  
  -- Low cardinality Check for gender column
  -- Ensure distinct values are unrelated from each other (f, female etc.)
  SELECT DISTINCT	 [gen]
  FROM [DataWarehouse].silver.[erp_cust_az12];
  
  
  
  -- Final Review of table:
  SELECT TOP (100) *
  FROM [DataWarehouse].[silver].[erp_cust_az12];
  
  --------------snip----------------

The end result of the Silver layer is that all tables from the Bronze schema are now cleaned, standardized and loaded into the silver schema, making them ready for further transformations to become business-ready in the Gold layer.

Gold Layer:¶

Gold Layer Overview:

The Gold layer is where data is transformed and prepared for business-ready analysis and reporting.

Data State: Cleaned, transformed, and business-ready.
Target Objects: Views within the Gold schema.
Load Method: None (Views)
Data Model: Star Schema (Aggreations, Business Logic & Rules)
Target Audience: BI Analysts, Data Analysts, Data Scientists, Business Users .

This layer provides curated, business-ready data, enriched with business rules and logic for advanced analysis and reporting.

Gold View Creation Script Code Snip:

In [ ]:


  
  /*==========================================Customers============================================*/
  
  DROP VIEW IF EXISTS gold.dim_customers;
  GO
  -- create dim_customer view
  CREATE VIEW gold.dim_customers AS
  
  
  SELECT
  	 ROW_NUMBER() OVER(ORDER BY [cst_id]) AS customer_key
  	,ci.[cst_id] AS customer_id
  	,ci.[cst_key] AS customer_number
  	,ci.[cst_firstname] AS first_name
  	,ci.[cst_lastname] AS last_name
  	,cl.cntry AS country
  	,ci.[cst_marital_status] AS marital_status
  	,CASE 
  		WHEN ci.cst_gndr != 'n/a' THEN cst_gndr -- CRM is the master for gender info.
  		ELSE COALESCE(bd.gen , 'n/a')
  		END AS gender
  	,bd.bdate AS birth_date
  	,ci.[cst_created_date] AS create_date
  	
  
  FROM [DataWarehouse].[silver].[crm_cust_info] AS ci
  	LEFT JOIN  silver.erp_cust_az12 as bd
  		ON ci.cst_key = bd.cid
  	LEFT JOIN  silver.erp_loc_a101 as cl
  		ON ci.cst_key = cl.cid;
  GO
  
  /*==========================================Products============================================*/
  
  DROP VIEW IF EXISTS gold.dim_products;
  GO
  -- create dim_products view
  CREATE VIEW gold.dim_products AS 
  
  
  SELECT 
  	 ROW_NUMBER() OVER(ORDER BY p.[prd_start_dt], p.[prd_nm]) AS product_key
  	,p.[prd_id] AS product_id
  	,p.[prd_key] AS product_number
  	,p.[prd_nm] AS product_name
      ,p.[cat_id] AS category_id
  	,COALESCE(c.cat, 'n/a') AS category_name
  	,COALESCE(c.subcat, 'n/a') AS subcategory
  	,COALESCE(c.maintenance, 'n/a') AS maintenance
      ,p.[prd_cost] AS product_cost
      ,p.[prd_line] AS prodcut_line
      ,p.[prd_start_dt] AS start_date
  
  FROM [DataWarehouse].[silver].[crm_prd_info] as p
  	LEFT JOIN silver.erp_px_cat_g1v2 as c
  	ON p.cat_id = c.id
  WHERE p.prd_end_dt IS NULL;-- filter out historical data
  GO
  
  --------------snip----------------

Gold Data Quality Checks After Load Code Snip:

In [ ]:


  
  /*==========================================Customers============================================*/
  
  
  -- Check for duplicated customer_id rows 
  -- Expectation: No result
  SELECT customer_id, COUNT(*)
  FROM [gold].[dim_customers]
  GROUP BY customer_id 
  HAVING COUNT(*) > 1
  
  
  
  -- Check distcint values of gender columns
  -- Expectation: Only Distinct and undrelated values should appear.(Male, Female, n/a)
  SELECT DISTINCT gender
  FROM [gold].[dim_customers]
  
  
  -- Final Review: 
  SELECT *
  FROM [gold].[dim_customers]
  
  
  /*==========================================Products============================================*/
  
  
  -- Check for duplicated product_id rows 
  -- Expectation: No result
  SELECT product_id, COUNT(*)
  FROM [gold].[dim_products]
  GROUP BY product_id 
  HAVING COUNT(*) > 1
  
  
  -- Check for NULL values in category_name, subcategory and maintenance columns
  -- Expectation: No Result
  SELECT *
  FROM [gold].[dim_products]
  WHERE category_name IS NULL OR subcategory IS NULL OR maintenance IS NULL;
  
  
  -- Final Review: 
  SELECT *
  FROM [gold].[dim_products];
  
  
  /*==========================================Products============================================*/
  
  -- Check if fact view joins correctly with dim views
  -- Expectation 1 : all columns from gold views (fact_sales, dim_customer & dim_products)
  -- Expectation 2 : No result
  SELECT *
  FROM gold.fact_sales as f 
  LEFT JOIN gold.dim_customers c
  	ON f.customer_key = c.customer_key
  LEFT JOIN gold.dim_products as p
  	ON f.product_key = p.product_key
  WHERE c.customer_key IS NULL -- Check if dim_customer has any customer_key not found in fact_sales.
  OR 
  p.product_key IS NULL-- Check if dim_products has any product_key not found in fact_sales.

Star Schema Diagram:

Final Update to Data Flow Diagram:

Data Catalog:

A Data Catalog improves data discovery, governance, quality, collaboration, and efficiency. It helps users quickly find and trust data, ensures compliance, promotes knowledge sharing, and boosts productivity by reducing time spent searching for data. Ultimately, it enables better decision-making and self-service analytics.

The Data Catalog for this project can be found 🔗HERE

END OF PROJECT:¶

Thank you for taking the time to view this project! This was my first end-to-end attempt at building a data warehouse, and throughout the process, I’ve gained valuable insights—especially on the importance of proper documentation. When done correctly, it makes the entire project more structured, efficient, and even enjoyable.

As a Data Analyst, part of my role is to break down complex tasks into manageable steps and work toward a solution. Documentation played a key role in making that possible.

A special thanks to Baraa Khatib Salkini for his content and in-depth guidance on this project. The dataset was provided by him, and you can find more about his work 🔗 here.

Next Steps:

As data roles continue to evolve and overlap, venturing into Data Engineering has been an exciting experience. It’s definitely something I plan to learn more about and incorporate into my work whenever possible.

That said, I still enjoy being a Data Analyst, and now that this project has produced fully business-ready data, the next step is to extract meaningful insights—which I look forward to doing in my upcoming projects. 🚀

Kevin's Portfolio

Building a Data Warehouse with SQL Server

Welcome to the my the SQL Serve Data Warehouse Project!¶

Project Layout:¶

Requirements:¶

Design Layers & Data Architecture¶

Create Database and Schemas:¶

Data Integration¶

Bronze Layer:¶

Silver Layer:¶

Gold Layer:¶

END OF PROJECT:¶