• Home
  • About
  • CV
    • English Version
    • German Version
  • Projects
  • Blog

On this page

  • Importing libraries needed
  • Overview on dataset
  • Info on dataset
  • Null values in dataset, necessary for cleaning
  • All null values dropped, simplifying dataset
  • Creating the TFID vector values
  • Allowing us to set the index back to default
  • Recommendation function (Input = Moana)
  • Link to other coding sources used in project

Disney Movie Recommendation System

Python
Big Data
This Python project uses techniques such as Cosine Similarity and TF-IDF Vectorisation to choose similar Disney movie titles, based on their descriptions.
Author

Lisa Lucky

Published

April 24, 2025

This project is adapted off of Kevin Kibe’s Medium article which details the usage of Cosine Similarity and TF-IDF Vectorisation for book recommendations 1. Using a Disney movies dataset from Kaggle 2 which included titles, descriptions and directors, a defintion function from the Medium article was adapted to allow for the return of five similar movie titles (based on their title input).

Under the same dataset, a large amount of Kaggle users had used Tensorflow and others, TF-IDF vectorisors (input “recommendation” in search bar for notebooks). Simplfying previous works and looking to the Medium article, I was able to build a recommendation system from the Disney catalogue. Below details my process as well as explanations with the code.

Importing libraries needed

To begin, necessary libraries were imported.

Code
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Overview on dataset

Code
df = pd.read_csv('for-download/disney-titles.csv')
df.sample(5)
show_id type title director cast country date_added release_year rating duration listed_in description
1114 s1115 TV Show Origins: The Journey of Humankind NaN Mark Monroe, Jason Silva United States November 12, 2019 2016 TV-14 1 Season Docuseries, Historical Hosted by Jason Silva, this eight-part series ...
554 s555 Movie The Olympic Elk James Algar Winston Hibler United States April 24, 2020 1952 TV-G 27 min Animals & Nature, Documentary, Family Olympic elk trek toward the fertile grazing gr...
153 s154 Movie Shark vs. Surfer Phil Stebbing Billy Lloyd NaN July 23, 2021 2020 TV-14 44 min Animals & Nature, Documentary Shark vs. Surfer visits shark-infested surf sp...
222 s223 TV Show Disney Special Agent Oso: Three Healthy Steps ... NaN Sean Astin, Meghan Strange, Phill Lewis, Amber... NaN May 14, 2021 2011 TV-Y 1 Season Action-Adventure, Animation, Kids Oso and his friends watch a kid as they demons...
96 s97 TV Show Dog: Impossible NaN Matt Beisner United States September 22, 2021 2019 TV-PG 2 Seasons Animals & Nature, Docuseries, Family Matt Beisner uses unique approaches to modifyi...

Info on dataset

Code
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1450 entries, 0 to 1449
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       1450 non-null   object
 1   type          1450 non-null   object
 2   title         1450 non-null   object
 3   director      977 non-null    object
 4   cast          1260 non-null   object
 5   country       1231 non-null   object
 6   date_added    1447 non-null   object
 7   release_year  1450 non-null   int64 
 8   rating        1447 non-null   object
 9   duration      1450 non-null   object
 10  listed_in     1450 non-null   object
 11  description   1450 non-null   object
dtypes: int64(1), object(11)
memory usage: 136.1+ KB

Null values in dataset, necessary for cleaning

Code
df.isnull().sum()
show_id           0
type              0
title             0
director        473
cast            190
country         219
date_added        3
release_year      0
rating            3
duration          0
listed_in         0
description       0
dtype: int64

All null values dropped, simplifying dataset

Code
df.dropna() #(Geeksforgeeks, 2025) dropping all rows with missing values to simplify, https://www.geeksforgeeks.org/working-with-missing-data-in-pandas/
show_id type title director cast country date_added release_year rating duration listed_in description
2 s3 Movie Ice Age: A Mammoth Christmas Karen Disher Raymond Albert Romano, John Leguizamo, Denis L... United States November 26, 2021 2011 TV-G 23 min Animation, Comedy, Family Sid the Sloth is on Santa's naughty list.
5 s6 Movie Becoming Cousteau Liz Garbus Jacques Yves Cousteau, Vincent Cassel United States November 24, 2021 2021 PG-13 94 min Biographical, Documentary An inside look at the legendary life of advent...
9 s10 Movie A Muppets Christmas: Letters To Santa Kirk R. Thatcher Steve Whitmire, Dave Goelz, Bill Barretta, Eri... United States November 19, 2021 2008 G 45 min Comedy, Family, Musical Celebrate the holiday season with all your fav...
12 s13 Movie The Pixar Story Leslie Iwerks Stacy Keach, John Lasseter, Brad Bird, John Mu... United States November 19, 2021 2007 G 91 min Documentary, Family A groundbreaking company forever changes the f...
19 s20 Movie Enchanted Kevin Lima Amy Adams, Patrick Dempsey, James Marsden, Tim... United States November 12, 2021 2007 PG 110 min Comedy, Family, Fantasy An animated princess winds up in the real worl...
... ... ... ... ... ... ... ... ... ... ... ... ...
1445 s1446 Movie X-Men Origins: Wolverine Gavin Hood Hugh Jackman, Liev Schreiber, Danny Huston, wi... United States, United Kingdom June 4, 2021 2009 PG-13 108 min Action-Adventure, Family, Science Fiction Wolverine unites with legendary X-Men to fight...
1446 s1447 Movie Night at the Museum: Battle of the Smithsonian Shawn Levy Ben Stiller, Amy Adams, Owen Wilson, Hank Azar... United States, Canada April 2, 2021 2009 PG 106 min Action-Adventure, Comedy, Family Larry Daley returns to rescue some old friends...
1447 s1448 Movie Eddie the Eagle Dexter Fletcher Tom Costello, Jo Hartley, Keith Allen, Dickon ... United Kingdom, Germany, United States December 18, 2020 2016 PG-13 107 min Biographical, Comedy, Drama True story of Eddie Edwards, a British ski-jum...
1448 s1449 Movie Bend It Like Beckham Gurinder Chadha Parminder Nagra, Keira Knightley, Jonathan Rhy... United Kingdom, Germany, United States September 18, 2020 2003 PG-13 112 min Buddy, Comedy, Coming of Age Despite the wishes of their traditional famili...
1449 s1450 Movie Captain Sparky vs. The Flying Saucers Mark Waring Charlie Tahan United States April 1, 2020 2012 TV-G 2 min Action-Adventure, Animals & Nature, Animation View one of Sparky's favorite home movies.

818 rows × 12 columns

Creating the TFID vector values

Code
vector = TfidfVectorizer(analyzer = "word", ngram_range=(1,2), min_df=0.0, stop_words='english')
tfidf_matrix = vector.fit_transform(df['title'])

the_matrix =  cosine_similarity(tfidf_matrix, tfidf_matrix)
print(the_matrix)
[[1.         0.07856161 0.07072431 ... 0.         0.         0.        ]
 [0.07856161 1.         0.09737137 ... 0.         0.         0.        ]
 [0.07072431 0.09737137 1.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.         0.         0.         ... 0.         1.         0.        ]
 [0.         0.         0.         ... 0.         0.         1.        ]]

Allowing us to set the index back to default

Code
df = df.reset_index() 
indices = pd.Series(df['description'])
indices [:5]
0     Join Mickey and the gang as they duck the halls!
1    Santa Claus passes his magic bag to a new St. ...
2            Sid the Sloth is on Santa's naughty list.
3                 This is real life, not just fantasy!
4    A three-part documentary from Peter Jackson ca...
Name: description, dtype: object

Recommendation function (Input = Moana)

Code
give_title = df['title'].tolist() # It should in other words, return a list of values from title because thus far we have only dealt with description (Pandas.pydata, n.d), https://pandas.pydata.org/docs/reference/api/pandas.Series.to_list.html 
def recommend(title, cosine_sim = the_matrix):
    if title not in give_title:
        return "Unable to find title."

    recommended_movies = []
    # Search for title in title column to get the index
    idx = df.index[df['title'] == title].tolist()[0]
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
    top_5_indices = list(score_series.iloc[1:6].index)

    for i in top_5_indices:
        recommended_movies.append(list(df['title'])[i]) # Recommend based on title

    return recommended_movies

print(recommend("Moana"))

#Final block fixed and debugged using Google Colab AI assistant (Gemini).
['James and the Giant Peach', 'Jack', 'Iron Will', 'Iron Man Armored Adventures', 'Iron Man (Series)']

Insights and Reflections
  • I learned how to use Cosine Similarity and TF-IDF Vectorisation. Similar to the idea of the SQL project, I leveraged information about Disney movies to build a recommendation system.

  • System recommendations are essential in today’s business world, where, through systems such as these, we can better understand customers and how to attract them 3. This proves my ability to chase after trends and understand the customer’s mind. A significant component of Disney’s success is its ability to recommend content properly to customers and ensure attention is always kept 4. My ability to recognise this allows me to apply what I have learned in university to add value to a company.

Link to other coding sources used in project

  • TF-IDF and Cosine Similarity help

  • Dealing with missing data in Pandas

  • Use of code blocks in Quarto

  • Pandas to.list() explanation

  • The reset in Pandas

Footnotes

  1. Code adapted from article↩︎

  2. Link to Kaggle dataset↩︎

  3. Disney+ and their recommendation system↩︎

  4. Disney+ and their recommendation system↩︎

Copyright 2025, Lisa Lucky