SQL on Marketing Analytics using Dvd rental

Problem Statement: We have been tasked by the DVD Rental Co marketing team to help them generate the analytical inputs required to drive their very first customer email campaign.

The marketing team expects their personalised emails to drive increased sales and engagement from the DVD Rental Co customer base.

The main initiative is to share insights about each customer’s viewing behaviour to demonstrate DVD Rental Co’s relentless focus on customer experience.

The insights requested by the marketing team include key statistics about each customer’s top 2 categories and favourite actor. There are also 3 personalised recommendations based off each customer’s previous viewing history as well as titles which are popular with other customers.

  1. Perform an anti join to check which column values exist in dvd_rentals.rental but not in dvd_rentals.inventory

     -- how many foreign keys only exist in the left table and not in the right?
     SELECT
       COUNT(DISTINCT rental.inventory_id)
     FROM dvd_rentals.rental
     WHERE NOT EXISTS (
       SELECT inventory_id
       FROM dvd_rentals.inventory
       WHERE rental.inventory_id = inventory.inventory_id
     );
    

    Count=0

  2. Now check the right side table using the same process: dvd_rentals.inventory

     -- how many foreign keys only exist in the right table and not in the left?
     -- note the table reference changes
     SELECT
       COUNT(DISTINCT inventory.inventory_id)
     FROM dvd_rentals.inventory
     WHERE NOT EXISTS (
       SELECT inventory_id
       FROM dvd_rentals.rental
       WHERE rental.inventory_id = inventory.inventory_id
     );
    

    Count= 1

  3. There seems to be a single value which is not showing up - let’s investigate which film it is:

     SELECT *
     FROM dvd_rentals.inventory
     WHERE NOT EXISTS (
       SELECT inventory_id
       FROM dvd_rentals.rental
       WHERE rental.inventory_id = inventory.inventory_id
     );
    

    | inventory_id | film_id | store_id | last_update | | --- | --- | --- | --- | | 5 | 1 | 2 | 2006-02-15 05:09:17 |

    Conclusion: although there is a single inventory_id record which is missing from the dvd_rentals.rental table - there might be no issues with this discrepancy as it seems that some inventory might just never be rented out to customers at the retail rental stores.

  4. Finally - let’s confirm that both left and inner joins do not differ at all when we look at the resulting row counts from the joint tables:

     DROP TABLE IF EXISTS left_rental_join;
     CREATE TEMP TABLE left_rental_join AS
     SELECT
       rental.customer_id,
       rental.inventory_id,
       inventory.film_id
     FROM dvd_rentals.rental
     LEFT JOIN dvd_rentals.inventory
       ON rental.inventory_id = inventory.inventory_id;
    
     DROP TABLE IF EXISTS inner_rental_join;
     CREATE TEMP TABLE inner_rental_join AS
     SELECT
       rental.customer_id,
       rental.inventory_id,
       inventory.film_id
     FROM dvd_rentals.rental
     INNER JOIN dvd_rentals.inventory
       ON rental.inventory_id = inventory.inventory_id;
    
     -- Output SQL
     (
       SELECT
         'left join' AS join_type,
         COUNT(*) AS record_count,
         COUNT(DISTINCT inventory_id) AS unique_key_values
       FROM left_rental_join
     )
     UNION
     (
       SELECT
         'inner join' AS join_type,
         COUNT(*) AS record_count,
         COUNT(DISTINCT inventory_id) AS unique_key_values
       FROM inner_rental_join
     );
    

    | join_type | record_count | unique_key_values | | --- | --- | --- | | inner join | 16044 | 4580 | | left join | 16044 | 4580 |

    We perform this same analysis for all of our tables within our core tables and concluded that the distribution for each of the join keys are as expected and are similar to what we see for these first 2 tables.

Category Insights

Create Base Dataset

We first created a complete_joint_dataset which joins multiple tables together after analysing the relationships between each table to confirm if there was a one-to-many, many-to-one or a many-to-many relationship for each of the join columns.

We also included the rental_date column to help us split ties for rankings which had the same count of rentals at a customer level - this helps us prioritise film categories which were more recently viewed.

DROP TABLE IF EXISTS complete_joint_dataset;
CREATE TEMP TABLE complete_joint_dataset AS
SELECT
  rental.customer_id,
  inventory.film_id,
  film.title,
  category.name AS category_name,
  -- also included rental_date for sorting purposes
  rental.rental_date
FROM dvd_rentals.rental
INNER JOIN dvd_rentals.inventory
  ON rental.inventory_id = inventory.inventory_id
INNER JOIN dvd_rentals.film
  ON inventory.film_id = film.film_id
INNER JOIN dvd_rentals.film_category
  ON film.film_id = film_category.film_id
INNER JOIN dvd_rentals.category
  ON film_category.category_id = category.category_id;
SELECT * FROM complete_joint_dataset limit 10;
customer_idfilm_idtitlecategory_idcategory_name
13080BLANKET BEVERLY8Family
459333FREAKY POCUS12Music
408373GRADUATE LORD3Children
333535LOVE SUICIDES11Horror
222450IDOLS SNATCHERS3Children
549613MYSTIC TRUMAN5Comedy
269870SWARM GOLD11Horror
239510LAWLESS VISION2Animation
126565MATRIX SNOWMAN9Foreign
399396HANGING DEEP7Drama

Category Counts

We then created a follow-up table which uses the complete_joint_dataset to aggregate our data and generate a rental_count and the latest rental_date for our ranking purposes downstream.

DROP TABLE IF EXISTS category_counts;
CREATE TEMP TABLE category_counts AS
SELECT
  customer_id,
  category_name,
  COUNT(*) AS rental_count,
  MAX(rental_date) AS latest_rental_date
FROM complete_joint_dataset
GROUP BY
  customer_id,
  category_name;
SELECT *
FROM category_counts
WHERE customer_id = 1
ORDER BY
    rental_count DESC,
    latest_rental_date DESC;
customer_idcategory_namerental_countlatest_rental_date
1Classics62005-08-19 09:55:16
1Comedy52005-08-22 19:41:37
1Drama42005-08-18 03:57:29
1Animation22005-08-22 20:03:46
1Sci-Fi22005-08-21 23:33:57
1New22005-08-19 13:56:54
1Action22005-08-17 12:37:54
1Music22005-07-09 16:38:01
1Sports22005-07-08 07:33:56
1Family12005-08-02 18:01:38
1Documentary12005-08-01 08:51:04
1Foreign12005-07-28 16:18:23
1Travel12005-07-11 10:13:46
1Games12005-07-08 03:17:05

Total Counts

We will then use this category_counts table to generate our total_counts table.

DROP TABLE IF EXISTS total_counts;
CREATE TEMP TABLE total_counts AS
SELECT
  customer_id,
  SUM(rental_count) AS total_count
FROM category_counts
GROUP BY
  customer_id;
SELECT *
FROM total_counts
LIMIT 5;
customer_idtotal_count
18423
8730
47722
27335
55032

Top Categories

We can also use a simple DENSE_RANK window function to generate a ranking of categories for each customer.

We will also split arbitrary ties by preferencing the category which had the most recent latest_rental_date value we generated in the category_counts for this exact purpose. To further prevent any ties - we will also sort the category_name in alphabetical (ascending) order just in case!

DROP TABLE IF EXISTS top_categories;
CREATE TEMP TABLE top_categories AS
WITH ranked_cte AS (
  SELECT
    customer_id,
    category_name,
    rental_count,
    DENSE_RANK() OVER (
      PARTITION BY customer_id
      ORDER BY
        rental_count DESC,
        latest_rental_date DESC,
        category_name
    ) AS category_rank
  FROM category_counts
)
SELECT * FROM ranked_cte
WHERE category_rank <= 2;
SELECT *
FROM top_categories
LIMIT 5;
customer_idcategory_namerental_countcategory_rank
1Classics61
1Comedy52
2Sports51
2Classics42
3Action41

Average Category Count

Next we will need to use the category_counts table to generate the average aggregated rental count for each category rounded down to the nearest integer using the FLOOR function

DROP TABLE IF EXISTS average_category_count;
CREATE TEMP TABLE average_category_count AS
SELECT
  category_name,
  FLOOR(AVG(rental_count)) AS category_average
FROM category_counts
GROUP BY category_name;
SELECT *
FROM average_category_count
ORDER BY
  category_average DESC,
  category_name;
category_namecategory_average
Action2
Animation2
Classics2
Documentary2
Drama2
Family2
Foreign2
Games2
New2
Sci-Fi2
Sports2
Children1
Comedy1
Horror1
Music1
Travel1

Top Category Percentile

Now we need to compare each customer’s top category rental_count to all other DVD Rental Co customers - we do this using a combination of a LEFT JOIN and a PERCENT_RANK window function ordered by descending rental count to show the required top N% customer insight.

We will also use a CASE WHEN to replace a 0 ranking value to 1 as it doesn’t make sense for the customer to be in the Top 0%!

DROP TABLE IF EXISTS top_category_percentile;
CREATE TEMP TABLE top_category_percentile AS
WITH calculated_cte AS (
SELECT
  top_categories.customer_id,
  top_categories.category_name AS top_category_name,
  top_categories.rental_count,
  category_counts.category_name,
  top_categories.category_rank,
  PERCENT_RANK() OVER (
    PARTITION BY category_counts.category_name
    ORDER BY category_counts.rental_count DESC
  ) AS raw_percentile_value
FROM category_counts
LEFT JOIN top_categories
  ON category_counts.customer_id = top_categories.customer_id
)
SELECT
  customer_id,
  category_name,
  rental_count,
  category_rank,
  CASE
    WHEN ROUND(100 * raw_percentile_value) = 0 THEN 1
    ELSE ROUND(100 * raw_percentile_value)
  END AS percentile
FROM calculated_cte
WHERE
  category_rank = 1
  AND top_category_name = category_name;
SELECT *
FROM top_category_percentile
LIMIT 10;
customer_idcategory_namerental_countpercentile
323Action71
506Action71
151Action61
410Action61
126Action61
51Action61
487Action61
363Action61
209Action61
560Action61