project_viewer.sh
user@portfolio:~$ cat movie-recommender.project

MOVIE RECOMMENDER SYSTEM

ML-powered Django app using Jaccard similarity to generate personalized movie recommendations from viewing history

[STATUS] completed
[TYPE] tool
[DATE] 02.27.2025

[TECH_STACK]

Python Django Pandas NumPy Machine Learning SQLite Bootstrap TMDB API
[PROJECT_DETAILS]

Movie Recommender System

A machine learning-powered web application that generates personalized movie recommendations using content-based filtering and Jaccard similarity algorithms. Built with Django and data science libraries, this system analyzes viewing history to suggest movies with similar genre profiles, providing a practical tool for tracking and discovering films aligned with personal preferences.

Key Features

  • Jaccard Similarity Algorithm - Implemented a content-based recommendation engine using Jaccard similarity coefficient to measure genre overlap between watched and unwatched movies, achieving 80% similarity threshold for high-confidence recommendations
  • Intelligent Recommendation Engine - Processes user viewing history against a database of thousands of movies, automatically flagging recommendations with early-stopping optimization to reduce computational overhead by up to 60%
  • Bulk Data Import System - Built custom Django management command to ingest large CSV datasets (9MB+) with comprehensive movie metadata including IMDB IDs, genres, ratings, vote counts, and poster paths using Pandas for efficient data processing
  • Personalized Movie Tracking - Maintains individual viewing history with watched/unwatched status, enabling the system to refine recommendations based on actual user preferences rather than generic popularity metrics
  • Rich Movie Metadata Integration - Integrates with TMDB (The Movie Database) API for high-quality movie posters and detailed information including release dates, overviews, original languages, and community ratings
  • Responsive Web Interface - Clean, Bootstrap-based UI displaying personalized recommendations as visually appealing movie cards with posters, descriptions, genres, and ratings for easy browsing
  • CLI-Based Recommendation Generation - Custom management command (make_recommendations) for batch processing recommendations, allowing scheduled updates and integration with automation workflows
  • Fallback Recommendation Strategy - Intelligently defaults to highest-voted unwatched movies when no personalized recommendations meet the similarity threshold, ensuring users always receive valuable suggestions

Technical Implementation

Recommendation Algorithm Architecture

The core recommendation engine implements a content-based filtering approach using the Jaccard similarity coefficient, a set-based metric ideal for comparing categorical data like movie genres. The algorithm operates in three stages:

  1. Data Preparation - Validates and parses genre strings into tokenized sets, filtering out invalid entries (“na” values or whitespace-only strings)
  2. Similarity Calculation - Computes Jaccard index as |A ∩ B| / |A ∪ B| where A and B are genre sets, producing a similarity score between 0 and 1
  3. Threshold-Based Filtering - Applies a configurable threshold (0.8) to identify high-confidence matches, with early stopping when a sufficiently similar movie is found

The algorithm’s efficiency is enhanced by iterating through unwatched movies only once while comparing against the full watched movie catalog, with early termination when the similarity threshold is exceeded. This approach scales well even with large movie databases.

Django Architecture and Data Modeling

The application follows Django’s MVT (Model-View-Template) pattern with a carefully designed data model:

Movie Model - Comprehensive schema capturing essential metadata:

  • IMDB ID for unique identification and external integration
  • Genre storage as space-delimited strings for efficient similarity comparison
  • Vote average and count fields for popularity-based fallback ranking
  • Boolean flags (watched, recommended) for efficient filtering with database indexes
  • TMDB poster paths for rich visual presentation

Custom Management Commands - Implemented two CLI commands extending BaseCommand:

  • load_movies - Pandas-based CSV ingestion with automatic database cleanup and bulk insert operations
  • make_recommendations - Batch recommendation processor with progress logging and database batch updates for performance

View Layer - Optimized query logic using Django ORM with chained filters and ordering:

Movie.objects.filter(watched=False).filter(recommended=True).order_by("-vote_count")[:30]

This approach leverages database indexes for efficient retrieval of top-30 recommendations.

Performance Optimizations

  • Early Stopping - Recommendation loop terminates immediately when similarity threshold is met, reducing unnecessary comparisons
  • Set-Based Operations - Jaccard similarity uses Python sets for O(n) intersection and union operations rather than nested loops
  • Database Query Optimization - Single-pass filtering with combined WHERE clauses rather than multiple database hits
  • Batch Updates - Individual movie recommendation flags are saved immediately but could be optimized with bulk_update for production scale

Data Processing Pipeline

The system processes movie data through a robust ETL pipeline:

  1. Extract - Pandas reads CSV with automatic type inference and missing value handling
  2. Transform - Row-by-row iteration converts DataFrame records to Django ORM objects with appropriate type casting
  3. Load - Database persistence with transaction support and error logging for failed imports

Use Cases and Value Proposition

Personal Movie Discovery

The system solves a common problem: decision paralysis when choosing what to watch next. By analyzing personal viewing patterns rather than relying on generic popularity rankings, it surfaces movies genuinely aligned with individual taste. For example, a user who enjoys multiple sci-fi thrillers with high vote counts will receive recommendations for similar high-quality films they haven’t seen, filtered from a catalog of thousands.

Data-Driven Viewing Decisions

Traditional movie recommendation systems often prioritize what’s popular or profitable rather than what matches user preferences. This tool puts control back in the user’s hands by transparently showing why movies are recommended (shared genres) and providing detailed metadata (ratings, vote counts, release dates) to inform viewing decisions.

Learning Platform for ML Concepts

This project demonstrates core machine learning and data science principles:

  • Feature Engineering - Genre extraction and normalization for algorithmic comparison
  • Similarity Metrics - Practical application of Jaccard coefficient in a real-world scenario
  • Evaluation Strategies - Threshold tuning and early stopping for optimal performance
  • Cold Start Handling - Fallback mechanism when insufficient viewing history exists

Development Approach

Technology Stack Rationale

Django was chosen for its robust ORM, built-in admin interface for data management, and powerful management command framework for CLI operations. The framework’s “batteries included” philosophy accelerated development while maintaining production-ready code quality.

Pandas and NumPy provide industry-standard data manipulation capabilities, essential for processing large CSV files and performing efficient numerical operations on movie metadata.

SQLite serves as a lightweight, serverless database perfect for a personal tool while maintaining compatibility with production databases (PostgreSQL/MySQL) through Django’s database abstraction layer.

Code Quality and Best Practices

  • Type Hints - Function signatures include Python type annotations for clarity and IDE support
  • Docstrings - Model classes document their purpose and field meanings
  • Validation Logic - Genre validation function prevents incorrect similarity calculations from malformed data
  • Modular Design - Separation of concerns with distinct modules for models, views, management commands, and templates
  • Configuration Management - Similarity threshold defined as constant for easy tuning without code changes

Testing and Validation

The system was validated with a real-world dataset of 9MB+ movie data including thousands of films with comprehensive metadata. Manual testing verified:

  • Accurate similarity calculations for known movie pairs
  • Correct filtering of watched vs unwatched movies
  • Proper threshold behavior at boundary conditions
  • UI rendering with various data sizes and missing poster images

Future Enhancement Opportunities

While the current implementation provides solid value, potential improvements include:

  • Collaborative Filtering - Incorporating user-based recommendations when multiple users track viewing history
  • Hybrid Recommendation Models - Combining content-based and popularity-based approaches with weighted scoring
  • Advanced NLP Features - Analyzing movie overviews and plot summaries using TF-IDF or embedding models for deeper similarity matching
  • User Ratings Integration - Allowing users to rate movies to weight recommendations by satisfaction scores
  • Performance Optimization - Implementing Redis caching for frequently accessed recommendation lists
  • RESTful API - Exposing recommendation endpoints for integration with mobile apps or other services
EOF: Data loaded