SQL on Marketing Analytics using Dvd rental
Problem Statement: We have been tasked by the DVD Rental Co marketing team to help them generate the analytical inputs required to drive their very first customer email campaign.
The marketing team expects their personalised emails to drive increased sales and engagement from the DVD Rental Co customer base.
The main initiative is to share insights about each customer’s viewing behaviour to demonstrate DVD Rental Co’s relentless focus on customer experience.
The insights requested by the marketing team include key statistics about each customer’s top 2 categories and favourite actor. There are also 3 personalised recommendations based off each customer’s previous viewing history as well as titles which are popular with other customers.
Perform an anti join to check which column values exist in
dvd_rentals.rental
but not indvd_rentals.inventory
-- how many foreign keys only exist in the left table and not in the right? SELECT COUNT(DISTINCT rental.inventory_id) FROM dvd_rentals.rental WHERE NOT EXISTS ( SELECT inventory_id FROM dvd_rentals.inventory WHERE rental.inventory_id = inventory.inventory_id );
Count=0
Now check the right side table using the same process:
dvd_rentals.inventory
-- how many foreign keys only exist in the right table and not in the left? -- note the table reference changes SELECT COUNT(DISTINCT inventory.inventory_id) FROM dvd_rentals.inventory WHERE NOT EXISTS ( SELECT inventory_id FROM dvd_rentals.rental WHERE rental.inventory_id = inventory.inventory_id );
Count= 1
There seems to be a single value which is not showing up - let’s investigate which film it is:
SELECT * FROM dvd_rentals.inventory WHERE NOT EXISTS ( SELECT inventory_id FROM dvd_rentals.rental WHERE rental.inventory_id = inventory.inventory_id );
| inventory_id | film_id | store_id | last_update | | --- | --- | --- | --- | | 5 | 1 | 2 | 2006-02-15 05:09:17 |
Conclusion: although there is a single
inventory_id
record which is missing from thedvd_rentals.rental
table - there might be no issues with this discrepancy as it seems that some inventory might just never be rented out to customers at the retail rental stores.Finally - let’s confirm that both left and inner joins do not differ at all when we look at the resulting row counts from the joint tables:
DROP TABLE IF EXISTS left_rental_join; CREATE TEMP TABLE left_rental_join AS SELECT rental.customer_id, rental.inventory_id, inventory.film_id FROM dvd_rentals.rental LEFT JOIN dvd_rentals.inventory ON rental.inventory_id = inventory.inventory_id; DROP TABLE IF EXISTS inner_rental_join; CREATE TEMP TABLE inner_rental_join AS SELECT rental.customer_id, rental.inventory_id, inventory.film_id FROM dvd_rentals.rental INNER JOIN dvd_rentals.inventory ON rental.inventory_id = inventory.inventory_id; -- Output SQL ( SELECT 'left join' AS join_type, COUNT(*) AS record_count, COUNT(DISTINCT inventory_id) AS unique_key_values FROM left_rental_join ) UNION ( SELECT 'inner join' AS join_type, COUNT(*) AS record_count, COUNT(DISTINCT inventory_id) AS unique_key_values FROM inner_rental_join );
| join_type | record_count | unique_key_values | | --- | --- | --- | | inner join | 16044 | 4580 | | left join | 16044 | 4580 |
We perform this same analysis for all of our tables within our core tables and concluded that the distribution for each of the join keys are as expected and are similar to what we see for these first 2 tables.
Category Insights
Create Base Dataset
We first created a complete_joint_dataset
which joins multiple tables together after analysing the relationships between each table to confirm if there was a one-to-many, many-to-one or a many-to-many relationship for each of the join columns.
We also included the rental_date
column to help us split ties for rankings which had the same count of rentals at a customer level - this helps us prioritise film categories which were more recently viewed.
DROP TABLE IF EXISTS complete_joint_dataset;
CREATE TEMP TABLE complete_joint_dataset AS
SELECT
rental.customer_id,
inventory.film_id,
film.title,
category.name AS category_name,
-- also included rental_date for sorting purposes
rental.rental_date
FROM dvd_rentals.rental
INNER JOIN dvd_rentals.inventory
ON rental.inventory_id = inventory.inventory_id
INNER JOIN dvd_rentals.film
ON inventory.film_id = film.film_id
INNER JOIN dvd_rentals.film_category
ON film.film_id = film_category.film_id
INNER JOIN dvd_rentals.category
ON film_category.category_id = category.category_id;
SELECT * FROM complete_joint_dataset limit 10;
customer_id | film_id | title | category_id | category_name |
---|---|---|---|---|
130 | 80 | BLANKET BEVERLY | 8 | Family |
459 | 333 | FREAKY POCUS | 12 | Music |
408 | 373 | GRADUATE LORD | 3 | Children |
333 | 535 | LOVE SUICIDES | 11 | Horror |
222 | 450 | IDOLS SNATCHERS | 3 | Children |
549 | 613 | MYSTIC TRUMAN | 5 | Comedy |
269 | 870 | SWARM GOLD | 11 | Horror |
239 | 510 | LAWLESS VISION | 2 | Animation |
126 | 565 | MATRIX SNOWMAN | 9 | Foreign |
399 | 396 | HANGING DEEP | 7 | Drama |
Category Counts
We then created a follow-up table which uses the complete_joint_dataset
to aggregate our data and generate a rental_count
and the latest rental_date
for our ranking purposes downstream.
DROP TABLE IF EXISTS category_counts;
CREATE TEMP TABLE category_counts AS
SELECT
customer_id,
category_name,
COUNT(*) AS rental_count,
MAX(rental_date) AS latest_rental_date
FROM complete_joint_dataset
GROUP BY
customer_id,
category_name;
SELECT *
FROM category_counts
WHERE customer_id = 1
ORDER BY
rental_count DESC,
latest_rental_date DESC;
customer_id | category_name | rental_count | latest_rental_date |
---|---|---|---|
1 | Classics | 6 | 2005-08-19 09:55:16 |
1 | Comedy | 5 | 2005-08-22 19:41:37 |
1 | Drama | 4 | 2005-08-18 03:57:29 |
1 | Animation | 2 | 2005-08-22 20:03:46 |
1 | Sci-Fi | 2 | 2005-08-21 23:33:57 |
1 | New | 2 | 2005-08-19 13:56:54 |
1 | Action | 2 | 2005-08-17 12:37:54 |
1 | Music | 2 | 2005-07-09 16:38:01 |
1 | Sports | 2 | 2005-07-08 07:33:56 |
1 | Family | 1 | 2005-08-02 18:01:38 |
1 | Documentary | 1 | 2005-08-01 08:51:04 |
1 | Foreign | 1 | 2005-07-28 16:18:23 |
1 | Travel | 1 | 2005-07-11 10:13:46 |
1 | Games | 1 | 2005-07-08 03:17:05 |
Total Counts
We will then use this category_counts
table to generate our total_counts
table.
DROP TABLE IF EXISTS total_counts;
CREATE TEMP TABLE total_counts AS
SELECT
customer_id,
SUM(rental_count) AS total_count
FROM category_counts
GROUP BY
customer_id;
SELECT *
FROM total_counts
LIMIT 5;
customer_id | total_count |
184 | 23 |
87 | 30 |
477 | 22 |
273 | 35 |
550 | 32 |
Top Categories
We can also use a simple DENSE_RANK
window function to generate a ranking of categories for each customer.
We will also split arbitrary ties by preferencing the category which had the most recent latest_rental_date
value we generated in the category_counts
for this exact purpose. To further prevent any ties - we will also sort the category_name
in alphabetical (ascending) order just in case!
DROP TABLE IF EXISTS top_categories;
CREATE TEMP TABLE top_categories AS
WITH ranked_cte AS (
SELECT
customer_id,
category_name,
rental_count,
DENSE_RANK() OVER (
PARTITION BY customer_id
ORDER BY
rental_count DESC,
latest_rental_date DESC,
category_name
) AS category_rank
FROM category_counts
)
SELECT * FROM ranked_cte
WHERE category_rank <= 2;
SELECT *
FROM top_categories
LIMIT 5;
customer_id | category_name | rental_count | category_rank |
1 | Classics | 6 | 1 |
1 | Comedy | 5 | 2 |
2 | Sports | 5 | 1 |
2 | Classics | 4 | 2 |
3 | Action | 4 | 1 |
Average Category Count
Next we will need to use the category_counts
table to generate the average aggregated rental count for each category rounded down to the nearest integer using the FLOOR
function
DROP TABLE IF EXISTS average_category_count;
CREATE TEMP TABLE average_category_count AS
SELECT
category_name,
FLOOR(AVG(rental_count)) AS category_average
FROM category_counts
GROUP BY category_name;
SELECT *
FROM average_category_count
ORDER BY
category_average DESC,
category_name;
category_name | category_average |
---|---|
Action | 2 |
Animation | 2 |
Classics | 2 |
Documentary | 2 |
Drama | 2 |
Family | 2 |
Foreign | 2 |
Games | 2 |
New | 2 |
Sci-Fi | 2 |
Sports | 2 |
Children | 1 |
Comedy | 1 |
Horror | 1 |
Music | 1 |
Travel | 1 |
Top Category Percentile
Now we need to compare each customer’s top category rental_count
to all other DVD Rental Co customers - we do this using a combination of a LEFT JOIN
and a PERCENT_RANK
window function ordered by descending rental count to show the required top N%
customer insight.
We will also use a CASE WHEN
to replace a 0
ranking value to 1
as it doesn’t make sense for the customer to be in the Top 0%
!
DROP TABLE IF EXISTS top_category_percentile;
CREATE TEMP TABLE top_category_percentile AS
WITH calculated_cte AS (
SELECT
top_categories.customer_id,
top_categories.category_name AS top_category_name,
top_categories.rental_count,
category_counts.category_name,
top_categories.category_rank,
PERCENT_RANK() OVER (
PARTITION BY category_counts.category_name
ORDER BY category_counts.rental_count DESC
) AS raw_percentile_value
FROM category_counts
LEFT JOIN top_categories
ON category_counts.customer_id = top_categories.customer_id
)
SELECT
customer_id,
category_name,
rental_count,
category_rank,
CASE
WHEN ROUND(100 * raw_percentile_value) = 0 THEN 1
ELSE ROUND(100 * raw_percentile_value)
END AS percentile
FROM calculated_cte
WHERE
category_rank = 1
AND top_category_name = category_name;
SELECT *
FROM top_category_percentile
LIMIT 10;
customer_id | category_name | rental_count | percentile |
323 | Action | 7 | 1 |
506 | Action | 7 | 1 |
151 | Action | 6 | 1 |
410 | Action | 6 | 1 |
126 | Action | 6 | 1 |
51 | Action | 6 | 1 |
487 | Action | 6 | 1 |
363 | Action | 6 | 1 |
209 | Action | 6 | 1 |
560 | Action | 6 | 1 |