Analyzing User Behaviour on Instagram - Python
2023-05-05

Project Layout:¶
This project is a take on Hi Counselor's Analysing User Behaviour on Instagram Using Python with Jupyter Notebook and MSSQL (Microsoft SQL Server) instead of their platform and MySQL. Additionally, all queries and SQL scripts are done directly in the notebook using python.
Parts:¶
The project is split into 2 parts.
- Part 1: Simple Data cleaning with Python.
- Part 2: Connecting, Reading & Writing to a Database using Python & Creating Visualizations Using Python.
Tasks¶
Each part of the project will have a subset of tasks. While parts 1 & 2 tasks are taken directly from Hi Counselor's project , Part 2 also contains python visualizations with Seaborn from the applicable queries.
Constraints¶
The datasets used are third-party however, the source is unknown.
The data is not current as the years are from 2016 - 2017.
2 Questions/Task from the original project was dropped:
- Total Posts by users (longer version of SELECT COUNT()FROM photos) - This was deemed redundant as it is answered in Part 2, task 5.*
- Find the percentage of our users who have either never commented on a photo or have commented on photos before. - This was dropped as the task is not clear. Users who have never commented on a photo or users who have commented on photos are all users, which is answered in the result of part 2, task 5.
- Total Posts by users (longer version of SELECT COUNT()FROM photos) - This was deemed redundant as it is answered in Part 2, task 5.*
# Importing librariesfor the entire project.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pyodbc
import warnings
# Turning off warnings
warnings.filterwarnings('ignore')
# Setting option to display all rows when a df is colled
pd.set_option('display.max_rows', None)
# Setting the theme for the seaborn plots
sns.set_theme(style='dark')
# Importing all necessary csv's as dataframes and storing them in a dictionary.
# placing the names of the files to import into a list.
files = ['comments.csv', 'follows.csv', 'likes.csv',
'photo_tags.csv', 'photos.csv','tags.csv','users.csv']
# placing the names to use for each dataframe into a list.
dataframes = ['comments', 'follows', 'likes',
'photo_tags','photos', 'tags', 'users']
# create an empty dictionary to store the dataframes and their names
dfs = {}
# Looping through the file and names and importing each file into a dataframe
for file, name in zip(files, dataframes):
df = pd.read_csv(file)
dfs[name] = df
Task 1:
Rename columns, only these columns are allowed in the dataset
1. id
2. comment_text
3. user_id
4. photo_id
5. created_at
# Get a list of the columns in the dataframe
dfs['comments'].columns
Index(['id', 'comment', 'User id', 'Photo id', 'created Timestamp', 'posted date', 'emoji used', 'Hashtags used count'], dtype='object')
# Remove unwanted columns and store the dataframe outside of the dictionary
comments = dfs['comments'][['id', 'comment', 'User id', 'Photo id', 'created Timestamp']]
# Rename the columns to the desired names in outlined in task 1
comments.rename(columns={
'comment': 'comment_text',
'User id': 'user_id',
'Photo id': 'photo_id',
'created Timestamp': 'created_at'
}, inplace=True)
Task 2:
Rename columns, only these columns are allowed in the dataset
1. follower_id
2. followee_id
3. created_at
# Get a list of the columns in the dataframe
dfs['follows'].columns
Index(['follower', 'followee ', 'created time', 'is follower active', 'followee Acc status'], dtype='object')
# Remove unwanted columns and store the dataframe outside of the dictionary.
follows = dfs['follows'][['follower', 'followee ', 'created time']]
# Rename the columns to the desired names in outlined in task 2
follows.rename(columns={
'follower': 'follower_id',
'followee ': 'followee_id',
'created time': 'created_at'
}, inplace=True)
Task 3:
Rename columns, only these columns are allowed in the dataset
1. user_id
2. photo_id
3. created_at
# Get a list of the columns in the dataframe
dfs['likes'].columns
Index(['user ', 'photo', 'created time', 'following or not', 'like type'], dtype='object')
# Remove unwanted columns and store the dataframe outside of the dictionary.
likes = dfs['likes'][['user ', 'photo', 'created time']]
# Rename the columns to the desired names in outlined in task 3
likes.rename(columns={
'user ': 'user_id',
'photo': 'photo_id',
'created time': 'created_at'
}, inplace=True)
Task 4:
Rename columns, only these columns are allowed in the dataset
1. photo_id
2. tag_id
# Get a list of the columns in the dataframe
dfs['photo_tags'].columns
Index(['photo', 'tag ID', 'user id'], dtype='object')
# Remove unwanted columns and store the dataframe outside of the dictionary
photo_tags = dfs['photo_tags'][['photo', 'tag ID']]
# Rename the columns to the desired names in outlined in task 4.
photo_tags.rename(columns={
'photo': 'photo_id',
'tag ID': 'tag_id'
}, inplace=True)
Task 5:
Rename columns, only these columns are allowed in the dataset
1. id
2. image_url
3. user_id
4. created_date
# Get a list of the columns in the dataframe
dfs['photos'].columns
Index(['id', 'image link', 'user ID', 'created dat', 'Insta filter used', 'photo type'], dtype='object')
# Remove unwanted columns and store the dataframe outside of the dictionary
photos = dfs['photos'][['id', 'image link', 'user ID', 'created dat']]
# Rename the columns to the desired names in outlined in task 5.
photos.rename(columns={
'image link': 'image_url',
'user ID': 'user_id',
'created dat': 'created_date'
}, inplace=True)
Task 6:
rename columns, only these columns are allowed in the dataset
1. id
2. tag_name
3. created_at
# Get a list of the columns in the dataframe
dfs['tags'].columns
Index(['id', 'tag text', 'created time', 'location'], dtype='object')
# Remove unwanted columns and store the dataframe outside of the dictionary
tags = dfs['tags'][['id', 'tag text', 'created time']]
# Rename the columns to the desired names in outlined in task 6
tags.rename(columns={
'tag text': 'tag_name',
'created time': 'created_at'
}, inplace=True)
Task 7:
rename columns, only these columns are allowed in the dataset
1. id
2. username
3. created_at
dfs['users'].columns
Index(['id', 'name', 'created time', 'private/public', 'post count', 'Verified status'], dtype='object')
# Remove unwanted columns and store the dataframe outside of the dictionary
users = dfs['users'][['id', 'name', 'created time']]
# Rename the columns to the desired names in outlined in task 7
users.rename(columns={
'name': 'username',
'created time': 'created_at'
}, inplace=True)
# Changing the date format to a python date time so SQL can read the data when uploading into the database
comments['created_at'] = pd.to_datetime(comments['created_at'], format='%d-%m-%Y %H:%M')
follows['created_at'] = pd.to_datetime(follows['created_at'], format='%d-%m-%Y %H:%M')
likes['created_at'] = pd.to_datetime(likes['created_at'], format='%d-%m-%Y %H:%M')
photos['created_date'] = pd.to_datetime(photos['created_date'], format='%d-%m-%Y %H:%M')
tags['created_at'] = pd.to_datetime(tags['created_at'], format='%d-%m-%Y %H:%M')
users['created_at'] = pd.to_datetime(users['created_at'], format='%d-%m-%Y %H:%M')
Part 2: Connecting, Reading & Writing to a Database using Python¶
In this section, data manipulation will be done using SQL with python, while we can do the same by just using the pandas dataframes already created, the intent here is to utilize SQL queries.
# Setting up SQL Connection with MSSQL
# Windows authentications is used with MSSQL therefore no username and password is necessary.
conxn = pyodbc.connect(
"Driver={SQL Server Native Client 11.0};"
"Server=FRIDAY\SQLEXPRESS;"
"Database=Hi_Counrselor;"
"trusted_connection=yes;"
)
# Setting cursor to be able to read and write queries
cursor = conxn.cursor()
# Creating a list of CREATE TABLE SQL queries that will create a table for each dataframe.
# These tables will be stored in the database outlined in the conxn variable (Hi_Counselor)
create_table_query = ["""
DROP TABLE IF EXISTS [Hi_Counselor].[dbo].[comments]
CREATE TABLE[Hi_Counselor].[dbo].[comments]
(
[id] INT NOT NULL,
[comment_text] NVARCHAR(MAX) NULL,
[user_id] INT NOT NULL,
[photo_id] INT NULL,
[created_at] DATETIME NOT NULL
)
""",
"""
DROP TABLE IF EXISTS [Hi_Counselor].[dbo].[follows]
CREATE TABLE[Hi_Counselor].[dbo].[follows]
(
[follower_id] INT NOT NULL,
[followee_id] INT NULL,
[created_at] DATETIME NOT NULL
)
""",
"""
DROP TABLE IF EXISTS [Hi_Counselor].[dbo].[likes]
CREATE TABLE[Hi_Counselor].[dbo].[likes]
(
[user_id] INT NOT NULL,
[photo_id] INT NULL,
[created_at] DATETIME NOT NULL
)
""",
"""
DROP TABLE IF EXISTS [Hi_Counselor].[dbo].[photo_tags]
CREATE TABLE[Hi_Counselor].[dbo].[photo_tags]
(
[photo_id] INT NOT NULL,
[tag_id] INT NULL
)
""",
"""
DROP TABLE IF EXISTS [Hi_Counselor].[dbo].[photos]
CREATE TABLE[Hi_Counselor].[dbo].[photos]
(
[id] INT NOT NULL,
[image_url] NVARCHAR(MAX) NOT NULL,
[user_id] INT NOT NULL,
[created_date] DATETIME NOT NULL
)
""",
"""
DROP TABLE IF EXISTS [Hi_Counselor].[dbo].[tags]
CREATE TABLE[Hi_Counselor].[dbo].[tags]
(
[id] INT NOT NULL,
[tag_name] NVARCHAR(MAX) NOT NULL,
[created_at] DATETIME NOT NULL
)
""",
"""
DROP TABLE IF EXISTS [Hi_Counselor].[dbo].[users]
CREATE TABLE[Hi_Counselor].[dbo].[users]
(
[id] INT NOT NULL,
[username] NVARCHAR(MAX) NOT NULL,
[created_at] DATETIME NOT NULL
)
"""
]
# Creatinng all 7 tables in the database using a for loop
for table in create_table_query:
cursor.execute(table)
cursor.commit()
# Tables are empty and still needs data inserted into them.
# Creating a list of INSERT INTO queries to have all the dataframes data inserted into their tabels.
sql_insert_query = [
"""
INSERT INTO [Hi_Counselor].[dbo].[comments]
(
[id],
[comment_text],
[user_id],
[photo_id],
[created_at]
)
VALUES
(
?,?,?,?,?
)
""",
"""
INSERT INTO [Hi_Counselor].[dbo].[follows]
(
[follower_id],
[followee_id],
[created_at]
)
VALUES
(
?,?,?
)
""",
"""
INSERT INTO [Hi_Counselor].[dbo].[likes]
(
[user_id],
[photo_id],
[created_at]
)
VALUES
(
?,?,?
)
""",
"""
INSERT INTO [Hi_Counselor].[dbo].[photo_tags]
(
[photo_id],
[tag_id]
)
VALUES
(
?,?
)
""",
"""
INSERT INTO [Hi_Counselor].[dbo].[photos]
(
[id],
[image_url],
[user_id],
[created_date]
)
VALUES
(
?,?,?,?
)
""",
"""
INSERT INTO [Hi_Counselor].[dbo].[tags]
(
[id],
[tag_name],
[created_at]
)
VALUES
(
?,?,?
)
""",
"""
INSERT INTO [Hi_Counselor].[dbo].[users]
(
[id],
[username],
[created_at]
)
VALUES
(
?,?,?
)
"""
]
# Creating a list of dataframes to use in a for loop that inserts the data into the tables.
dfs = [comments, follows, likes, photo_tags, photos, tags, users]
# looping through the list of dataframes and converting it to records (a list of lists)
for i, df in enumerate(dfs):
query = sql_insert_query[i]
records = df.values.tolist()
# inserting each row of the df into their respective tables.
cursor.executemany(query, records)
cursor.commit()
# to correct the date column in each table, an update query will be used convert it to the corrected format.
# create a list of update queries for each table.
update_query = [
"""
UPDATE [Hi_Counselor].[dbo].[comments]
SET [created_at] = CONVERT(nvarchar(16), [created_at], 101) + ' ' + CONVERT(nvarchar(16), [created_at], 108)
""",
"""
UPDATE [Hi_Counselor].[dbo].[follows]
SET [created_at] = CONVERT(nvarchar(16), [created_at], 101) + ' ' + CONVERT(nvarchar(16), [created_at], 108)
""",
"""
UPDATE [Hi_Counselor].[dbo].[likes]
SET [created_at] = CONVERT(nvarchar(16), [created_at], 101) + ' ' + CONVERT(nvarchar(16), [created_at], 108)
""",
"""
UPDATE [Hi_Counselor].[dbo].[photos]
SET [created_date] = CONVERT(nvarchar(16), [created_date], 101) + ' ' + CONVERT(nvarchar(16), [created_date], 108)
""",
"""
UPDATE [Hi_Counselor].[dbo].[tags]
SET [created_at] = CONVERT(nvarchar(16), [created_at], 101) + ' ' + CONVERT(nvarchar(16), [created_at], 108)
""",
"""
UPDATE [Hi_Counselor].[dbo].[users]
SET [created_at] = CONVERT(nvarchar(16), [created_at], 101) + ' ' + CONVERT(nvarchar(16), [created_at], 108)
"""
]
# Loop through the list and run each query
for query in update_query:
cursor.execute(query)
cursor.commit()
# defining a function to read each query into a dataftame.
def send_query(query):
"""Takes a query string from a variable and returns the results in a dataframe"""
conxn = pyodbc.connect(
"Driver={SQL Server Native Client 11.0};"
"Server=FRIDAY\SQLEXPRESS;"
"Database=Hi_Counrselor;"
"trusted_connection=yes;"
)
result = pd.read_sql(query, conxn)
return result
Task 1:
Find the 5 oldest users.
task1 = """
SELECT TOP 5 id,
username,
created_at
FROM Hi_Counselor.dbo.users
ORDER BY 3 ASC
"""
send_query(task1)
id | username | created_at | |
---|---|---|---|
0 | 80 | Darby_Herzog | 2016-05-06 00:14:00 |
1 | 67 | Emilio_Bernier52 | 2016-05-06 13:04:00 |
2 | 63 | Elenor88 | 2016-05-08 01:30:00 |
3 | 95 | Nicole71 | 2016-05-09 17:30:00 |
4 | 38 | Jordyn.Jacobson2 | 2016-05-14 07:56:00 |
df_q1 = send_query(task1)
ax0 = sns.lineplot(x='username', y='created_at', data=df_q1, marker='o', markerfacecolor='g')
ax0.set(title='5 Oldest Users Registration Date', ylabel='Registered Date', xlabel='Usernames')
plt.xticks(rotation=45)
plt.show()
Task 2:
- What day of the week do most users register on?
- We need to figure out when to schedule an ad campgain
task2 = """
SELECT DATENAME(WEEKDAY, created_at) as day_of_week, COUNT(id) as registered_users
FROM Hi_Counselor.dbo.users
GROUP BY DATENAME(WEEKDAY, created_at)
ORDER BY 2 DESC
"""
send_query(task2)
day_of_week | registered_users | |
---|---|---|
0 | Sunday | 16 |
1 | Thursday | 16 |
2 | Friday | 15 |
3 | Monday | 14 |
4 | Tuesday | 14 |
5 | Wednesday | 13 |
6 | Saturday | 12 |
df_q2 =send_query(task2)
ax1 = sns.barplot(x='day_of_week', y='registered_users', data=df_q2)
ax1.set(title='User Registration by Day of the Week')
[Text(0.5, 1.0, 'User Registration by Day of the Week')]
Task 3:
- We want to target our inactive users with an email campaign.
Find the users who have never posted a photo.
task3 = """
SELECT users.id, users.username, COUNT(DISTINCT photos.user_id) AS number_of_photos
FROM Hi_Counselor.dbo.photos
FULL OUTER JOIN Hi_Counselor.dbo.users
ON photos.user_id = users.id
GROUP BY users.id, users.username
HAVING COUNT( photos.user_id) = 0
ORDER BY users.id
"""
send_query(task3)
id | username | number_of_photos | |
---|---|---|---|
0 | 5 | Aniya_Hackett | 0 |
1 | 7 | Kasandra_Homenick | 0 |
2 | 14 | Jaclyn81 | 0 |
3 | 21 | Rocio33 | 0 |
4 | 24 | Maxwell.Halvorson | 0 |
5 | 25 | Tierra.Trantow | 0 |
6 | 34 | Pearl7 | 0 |
7 | 36 | Ollie_Ledner37 | 0 |
8 | 41 | Mckenna17 | 0 |
9 | 45 | David.Osinski47 | 0 |
10 | 49 | Morgan.Kassulke | 0 |
11 | 53 | Linnea59 | 0 |
12 | 54 | Duane60 | 0 |
13 | 57 | Julien_Schmidt | 0 |
14 | 66 | Mike.Auer39 | 0 |
15 | 68 | Franco_Keebler64 | 0 |
16 | 71 | Nia_Haag | 0 |
17 | 74 | Hulda.Macejkovic | 0 |
18 | 75 | Leslie67 | 0 |
19 | 76 | Janelle.Nikolaus81 | 0 |
20 | 80 | Darby_Herzog | 0 |
21 | 81 | Esther.Zulauf61 | 0 |
22 | 83 | Bartholome.Bernhard | 0 |
23 | 89 | Jessyca_West | 0 |
24 | 90 | Esmeralda.Mraz57 | 0 |
25 | 91 | Bethany20 | 0 |
Task 4:
- We're running a new contest to see who can get the most likes on a single photo. WHO WON?
task4 = """
SELECT TOP 1 likes.photo_id, count(likes.user_id) AS number_of_likes, photos.user_id, users.username
FROM Hi_Counselor.dbo.likes
JOIN Hi_Counselor.dbo.photos ON Hi_Counselor.dbo.likes.photo_id=Hi_Counselor.dbo.photos.id
JOIN Hi_Counselor.dbo.users ON Hi_Counselor.dbo.photos.user_id=Hi_Counselor.dbo.users.id
GROUP BY likes.photo_id, photos.user_id, users.username
ORDER BY 2 DESC
"""
send_query(task4)
photo_id | number_of_likes | user_id | username | |
---|---|---|---|---|
0 | 145 | 48 | 52 | Zack_Kemmer93 |
Task 5:
- Our Investors want to know...How many times does the average user post? (total number of photos/total number of users)
task5 = """
SELECT DISTINCT(COUNT(photos.id)) as total_photos ,
COUNT(DISTINCT(users.id)) as total_users,
ROUND(CAST(COUNT(DISTINCT photos.id) as DECIMAL(5,2))/CAST(COUNT(DISTINCT users.id) as DECIMAL(5,2)),2) as average_photos_per_user
FROM Hi_Counselor.dbo.photos
FULL OUTER JOIN Hi_Counselor.dbo.users ON Hi_Counselor.dbo.users.id=Hi_Counselor.dbo.photos.user_id
"""
send_query(task5)
total_photos | total_users | average_photos_per_user | |
---|---|---|---|
0 | 257 | 100 | 2.57 |
Task 6:
- user ranking by postings higher to lower
task6 = """
SELECT users.username, COUNT(photos.id) as total_posts
FROM Hi_Counselor.dbo.photos
JOIN Hi_Counselor.dbo.users ON Hi_Counselor.dbo.users.id=Hi_Counselor.dbo.photos.user_id
GROUP BY users.username
ORDER BY 2 DESC
"""
send_query(task6)
username | total_posts | |
---|---|---|
0 | Eveline95 | 12 |
1 | Clint27 | 11 |
2 | Cesar93 | 10 |
3 | Delfina_VonRueden68 | 9 |
4 | Aurelie71 | 8 |
5 | Jaime53 | 8 |
6 | Donald.Fritsch | 6 |
7 | Florence99 | 5 |
8 | Harrison.Beatty50 | 5 |
9 | Colten.Harris76 | 5 |
10 | Adelle96 | 5 |
11 | Alexandro35 | 5 |
12 | Janet.Armstrong | 5 |
13 | Josianne.Friesen | 5 |
14 | Justina.Gaylord27 | 5 |
15 | Kathryn80 | 5 |
16 | Kenton_Kirlin | 5 |
17 | Mariano_Koch3 | 5 |
18 | Travon.Waters | 5 |
19 | Yvette.Gottlieb91 | 5 |
20 | Zack_Kemmer93 | 5 |
21 | Rick29 | 4 |
22 | Seth46 | 4 |
23 | Tabitha_Schamberger11 | 4 |
24 | Malinda_Streich | 4 |
25 | Andre_Purdy85 | 4 |
26 | Annalise.McKenzie16 | 4 |
27 | Billy52 | 4 |
28 | Dario77 | 4 |
29 | Harley_Lind18 | 4 |
30 | Irwin.Larson | 4 |
31 | Dereck65 | 4 |
32 | Gus93 | 4 |
33 | Elenor88 | 4 |
34 | Emilio_Bernier52 | 3 |
35 | Frederik_Rice | 3 |
36 | Gerard79 | 3 |
37 | Alek_Watsica | 3 |
38 | Arely_Bogan63 | 3 |
39 | Keenan.Schamberger60 | 3 |
40 | Maya.Farrell | 3 |
41 | Norbert_Carroll35 | 3 |
42 | Presley_McClure | 3 |
43 | Milford_Gleichner42 | 2 |
44 | Nicole71 | 2 |
45 | Tomas.Beatty93 | 2 |
46 | Sam52 | 2 |
47 | Willie_Leuschke | 2 |
48 | Jordyn.Jacobson2 | 2 |
49 | Lennie_Hartmann40 | 2 |
50 | Kaley9 | 2 |
51 | Javonte83 | 2 |
52 | Aracely.Johnston98 | 2 |
53 | Alysa22 | 2 |
54 | Hailee26 | 2 |
55 | Ressie_Stanton46 | 2 |
56 | Imani_Nicolas17 | 1 |
57 | Granville_Kutch | 1 |
58 | Erick5 | 1 |
59 | Aiyana_Hoeger | 1 |
60 | Darwin29 | 1 |
61 | Delpha.Kihn | 1 |
62 | Damon35 | 1 |
63 | Jaylan.Lakin | 1 |
64 | Jayson65 | 1 |
65 | Karley_Bosco | 1 |
66 | Katarina.Dibbert | 1 |
67 | Kelsi26 | 1 |
68 | Kenneth64 | 1 |
69 | Yazmin_Mills95 | 1 |
70 | Meggie_Doyle | 1 |
71 | Rafael.Hickle2 | 1 |
72 | Odessa2 | 1 |
73 | Peter.Stehr0 | 1 |
df_q3 = send_query(task6)
ax2 = sns.histplot(data=df_q3, x='total_posts', kde=True, bins=10)
results = send_query(task6)
results[['total_posts']].sum()
total_posts 257 dtype: int64
Task 7:
- Total numbers of users who have posted at least one time
# Taks 8:
task7 = """
WITH CTE AS(
SELECT COUNT(users.username) as users_count, COUNT(photos.id) as total_posts
FROM Hi_Counselor.dbo.photos
JOIN Hi_Counselor.dbo.users ON Hi_Counselor.dbo.users.id=Hi_Counselor.dbo.photos.user_id
GROUP BY users.username
)
SELECT COUNT(users_count) Total_ursers_posted
FROM CTE
"""
send_query(task7)
Total_ursers_posted | |
---|---|
0 | 74 |
Task 8:
- A brand wants to know which hashtags to use in a post. What are the top 5 most commonly used hashtags?
task8 = """
SELECT TOP 5 tags.tag_name, photo_tags.tag_id, COUNT(photo_tags.tag_id) AS number_of_times_tag_is_used
From Hi_Counselor.dbo.photo_tags
JOIN Hi_Counselor.dbo.tags ON Hi_Counselor.dbo.photo_tags.tag_id = Hi_Counselor.dbo.tags.id
GROUP BY photo_tags.tag_id, tags.tag_name
ORDER BY 3 DESC
"""
send_query(task8)
tag_name | tag_id | number_of_times_tag_is_used | |
---|---|---|---|
0 | smile | 21 | 59 |
1 | beach | 20 | 42 |
2 | party | 17 | 39 |
3 | fun | 13 | 38 |
4 | concert | 18 | 24 |
df_q4 = send_query(task8)
ax3 = sns.barplot(data=df_q4, x='number_of_times_tag_is_used', y='tag_name')
Task 9:
- Find users who have liked every single photo on the site
task9 = """
SELECT users.username, COUNT(likes.photo_id) AS number_of_likes
FROM Hi_Counselor.dbo.likes
JOIN Hi_Counselor.dbo.users ON users.id = likes.user_id
GROUP BY users.username
HAVING COUNT(likes.photo_id) = 257
ORDER BY 1
"""
send_query(task9)
username | number_of_likes | |
---|---|---|
0 | Aniya_Hackett | 257 |
1 | Bethany20 | 257 |
2 | Duane60 | 257 |
3 | Jaclyn81 | 257 |
4 | Janelle.Nikolaus81 | 257 |
5 | Julien_Schmidt | 257 |
6 | Leslie67 | 257 |
7 | Maxwell.Halvorson | 257 |
8 | Mckenna17 | 257 |
9 | Mike.Auer39 | 257 |
10 | Nia_Haag | 257 |
11 | Ollie_Ledner37 | 257 |
12 | Rocio33 | 257 |
Task 10:
- Find users who have never commented on a photo
task10 = """
SELECT users.username, COUNT(comments.user_id) AS number_of_comments
FROM Hi_Counselor.dbo.comments
FULL OUTER JOIN Hi_Counselor.dbo.users ON users.id = comments.user_id
GROUP BY users.username
HAVING COUNT(comments.user_id) = 0
ORDER BY 1
"""
send_query(task10)
username | number_of_comments | |
---|---|---|
0 | Aurelie71 | 0 |
1 | Bartholome.Bernhard | 0 |
2 | Cesar93 | 0 |
3 | Clint27 | 0 |
4 | Darby_Herzog | 0 |
5 | David.Osinski47 | 0 |
6 | Delfina_VonRueden68 | 0 |
7 | Donald.Fritsch | 0 |
8 | Esmeralda.Mraz57 | 0 |
9 | Esther.Zulauf61 | 0 |
10 | Eveline95 | 0 |
11 | Florence99 | 0 |
12 | Franco_Keebler64 | 0 |
13 | Hulda.Macejkovic | 0 |
14 | Jaime53 | 0 |
15 | Jessyca_West | 0 |
16 | Kasandra_Homenick | 0 |
17 | Kenton_Kirlin | 0 |
18 | Linnea59 | 0 |
19 | Mariano_Koch3 | 0 |
20 | Morgan.Kassulke | 0 |
21 | Pearl7 | 0 |
22 | Tierra.Trantow | 0 |
Task 11:
- Find total users who have either never commented on a photo or have commented on every photo.
- Find the percentage of our users who have either never commented on a photo or have commented on every photo.
task11_a = """
SELECT users.username, COUNT(comments.user_id) AS number_of_comments
FROM Hi_Counselor.dbo.comments
FULL OUTER JOIN Hi_Counselor.dbo.users ON users.id = comments.user_id
GROUP BY users.username
HAVING COUNT(comments.user_id) = 0 OR COUNT(comments.user_id) = 257
ORDER BY 2,1
"""
send_query(task11_a)
username | number_of_comments | |
---|---|---|
0 | Aurelie71 | 0 |
1 | Bartholome.Bernhard | 0 |
2 | Cesar93 | 0 |
3 | Clint27 | 0 |
4 | Darby_Herzog | 0 |
5 | David.Osinski47 | 0 |
6 | Delfina_VonRueden68 | 0 |
7 | Donald.Fritsch | 0 |
8 | Esmeralda.Mraz57 | 0 |
9 | Esther.Zulauf61 | 0 |
10 | Eveline95 | 0 |
11 | Florence99 | 0 |
12 | Franco_Keebler64 | 0 |
13 | Hulda.Macejkovic | 0 |
14 | Jaime53 | 0 |
15 | Jessyca_West | 0 |
16 | Kasandra_Homenick | 0 |
17 | Kenton_Kirlin | 0 |
18 | Linnea59 | 0 |
19 | Mariano_Koch3 | 0 |
20 | Morgan.Kassulke | 0 |
21 | Pearl7 | 0 |
22 | Tierra.Trantow | 0 |
23 | Aniya_Hackett | 257 |
24 | Bethany20 | 257 |
25 | Duane60 | 257 |
26 | Jaclyn81 | 257 |
27 | Janelle.Nikolaus81 | 257 |
28 | Julien_Schmidt | 257 |
29 | Leslie67 | 257 |
30 | Maxwell.Halvorson | 257 |
31 | Mckenna17 | 257 |
32 | Mike.Auer39 | 257 |
33 | Nia_Haag | 257 |
34 | Ollie_Ledner37 | 257 |
35 | Rocio33 | 257 |
task11_b = """
WITH CTE AS(
SELECT users.id, users.username, COUNT(comments.user_id) AS number_of_comments,
CASE
WHEN COUNT(comments.user_id) = 0 or COUNT(comments.user_id) = 257 THEN 1
WHEN COUNT(comments.user_id) != 0 AND COUNT(comments.user_id) != 257 THEN 0
END AS condition
FROM Hi_Counselor.dbo.comments
FULL OUTER JOIN Hi_Counselor.dbo.users ON Hi_Counselor.dbo.users.id = comments.user_id
GROUP BY users.username, users.id)
SELECT COUNT(username) as total_users, SUM(condition) as bots_celebs_accounts, SUM(CAST(condition AS DECIMAL))/CAST(COUNT(username) AS DECIMAL)*100 AS bots_celebs_accounts_pct
FROM CTE;
"""
send_query(task11_b)
total_users | bots_celebs_accounts | bots_celebs_accounts_pct | |
---|---|---|---|
0 | 100 | 36 | 36.0 |
Task 12:
- Find users who have ever commented on a photo.
- Find the total of users who have ever commented on a photo
task12_a = """
SELECT users.id, users.username, COUNT(comments.user_id) AS number_of_comments
FROM Hi_Counselor.dbo.comments
FULL OUTER JOIN Hi_Counselor.dbo.users ON Hi_Counselor.dbo.users.id = comments.user_id
GROUP BY users.username, users.id
HAVING COUNT(comments.user_id) != 0
ORDER BY 3
"""
send_query(task12_a)
id | username | number_of_comments | |
---|---|---|---|
0 | 11 | Justina.Gaylord27 | 49 |
1 | 16 | Annalise.McKenzie16 | 52 |
2 | 18 | Odessa2 | 53 |
3 | 42 | Maya.Farrell | 54 |
4 | 48 | Granville_Kutch | 55 |
5 | 52 | Zack_Kemmer93 | 56 |
6 | 32 | Irwin.Larson | 56 |
7 | 85 | Milford_Gleichner42 | 57 |
8 | 38 | Jordyn.Jacobson2 | 58 |
9 | 13 | Alexandro35 | 58 |
10 | 61 | Jayson65 | 58 |
11 | 62 | Ressie_Stanton46 | 58 |
12 | 47 | Harrison.Beatty50 | 59 |
13 | 28 | Dario77 | 59 |
14 | 19 | Hailee26 | 60 |
15 | 9 | Gus93 | 60 |
16 | 44 | Seth46 | 60 |
17 | 65 | Adelle96 | 60 |
18 | 78 | Colten.Harris76 | 60 |
19 | 92 | Frederik_Rice | 61 |
20 | 8 | Tabitha_Schamberger11 | 61 |
21 | 33 | Yvette.Gottlieb91 | 61 |
22 | 6 | Travon.Waters | 62 |
23 | 10 | Presley_McClure | 63 |
24 | 93 | Willie_Leuschke | 63 |
25 | 73 | Jaylan.Lakin | 63 |
26 | 37 | Yazmin_Mills95 | 63 |
27 | 72 | Kathryn80 | 64 |
28 | 4 | Arely_Bogan63 | 64 |
29 | 30 | Kaley9 | 65 |
30 | 98 | Imani_Nicolas17 | 65 |
31 | 31 | Aiyana_Hoeger | 66 |
32 | 2 | Andre_Purdy85 | 66 |
33 | 55 | Meggie_Doyle | 66 |
34 | 27 | Darwin29 | 67 |
35 | 82 | Aracely.Johnston98 | 67 |
36 | 3 | Harley_Lind18 | 67 |
37 | 20 | Delpha.Kihn | 67 |
38 | 22 | Kenneth64 | 67 |
39 | 35 | Lennie_Hartmann40 | 67 |
40 | 39 | Kelsi26 | 67 |
41 | 40 | Rafael.Hickle2 | 68 |
42 | 12 | Dereck65 | 68 |
43 | 79 | Katarina.Dibbert | 68 |
44 | 94 | Damon35 | 68 |
45 | 95 | Nicole71 | 68 |
46 | 46 | Malinda_Streich | 68 |
47 | 56 | Peter.Stehr0 | 68 |
48 | 99 | Alek_Watsica | 68 |
49 | 97 | Tomas.Beatty93 | 68 |
50 | 50 | Gerard79 | 69 |
51 | 69 | Karley_Bosco | 69 |
52 | 70 | Erick5 | 69 |
53 | 26 | Josianne.Friesen | 69 |
54 | 100 | Javonte83 | 70 |
55 | 43 | Janet.Armstrong | 72 |
56 | 60 | Sam52 | 72 |
57 | 87 | Rick29 | 74 |
58 | 96 | Keenan.Schamberger60 | 75 |
59 | 84 | Alysa22 | 76 |
60 | 67 | Emilio_Bernier52 | 76 |
61 | 15 | Billy52 | 77 |
62 | 63 | Elenor88 | 80 |
63 | 17 | Norbert_Carroll35 | 83 |
64 | 24 | Maxwell.Halvorson | 257 |
65 | 41 | Mckenna17 | 257 |
66 | 36 | Ollie_Ledner37 | 257 |
67 | 14 | Jaclyn81 | 257 |
68 | 21 | Rocio33 | 257 |
69 | 5 | Aniya_Hackett | 257 |
70 | 57 | Julien_Schmidt | 257 |
71 | 54 | Duane60 | 257 |
72 | 75 | Leslie67 | 257 |
73 | 76 | Janelle.Nikolaus81 | 257 |
74 | 71 | Nia_Haag | 257 |
75 | 66 | Mike.Auer39 | 257 |
76 | 91 | Bethany20 | 257 |
ax5 = sns.histplot(data=send_query(task12_a), x='number_of_comments', kde=True)
ax5.set(title='Histogram of Number of Comments by User',
ylabel='Number of comments', xlabel='Comment Number Buckets')
[Text(0.5, 1.0, 'Histogram of Number of Comments by User'), Text(0, 0.5, 'Number of comments'), Text(0.5, 0, 'Comment Number Buckets')]
task12_b = """
Select COUNT(username) as total_users_with_comments
FROM (
SELECT users.id, users.username, COUNT(comments.user_id) AS number_of_comments
FROM Hi_Counselor.dbo.comments
FULL OUTER JOIN Hi_Counselor.dbo.users ON Hi_Counselor.dbo.users.id = comments.user_id
GROUP BY users.username, users.id
HAVING COUNT(comments.user_id) != 0)
subquery
"""
send_query(task12_b)
total_users_with_comments | |
---|---|
0 | 77 |