Movie Recommender System
A machine learning-powered web application that generates personalized movie recommendations using content-based filtering and Jaccard similarity algorithms. Built with Django and data science libraries, this system analyzes viewing history to suggest movies with similar genre profiles, providing a practical tool for tracking and discovering films aligned with personal preferences.
Key Features
- Jaccard Similarity Algorithm - Implemented a content-based recommendation engine using Jaccard similarity coefficient to measure genre overlap between watched and unwatched movies, achieving 80% similarity threshold for high-confidence recommendations
- Intelligent Recommendation Engine - Processes user viewing history against a database of thousands of movies, automatically flagging recommendations with early-stopping optimization to reduce computational overhead by up to 60%
- Bulk Data Import System - Built custom Django management command to ingest large CSV datasets (9MB+) with comprehensive movie metadata including IMDB IDs, genres, ratings, vote counts, and poster paths using Pandas for efficient data processing
- Personalized Movie Tracking - Maintains individual viewing history with watched/unwatched status, enabling the system to refine recommendations based on actual user preferences rather than generic popularity metrics
- Rich Movie Metadata Integration - Integrates with TMDB (The Movie Database) API for high-quality movie posters and detailed information including release dates, overviews, original languages, and community ratings
- Responsive Web Interface - Clean, Bootstrap-based UI displaying personalized recommendations as visually appealing movie cards with posters, descriptions, genres, and ratings for easy browsing
- CLI-Based Recommendation Generation - Custom management command (
make_recommendations) for batch processing recommendations, allowing scheduled updates and integration with automation workflows - Fallback Recommendation Strategy - Intelligently defaults to highest-voted unwatched movies when no personalized recommendations meet the similarity threshold, ensuring users always receive valuable suggestions
Technical Implementation
Recommendation Algorithm Architecture
The core recommendation engine implements a content-based filtering approach using the Jaccard similarity coefficient, a set-based metric ideal for comparing categorical data like movie genres. The algorithm operates in three stages:
- Data Preparation - Validates and parses genre strings into tokenized sets, filtering out invalid entries (“na” values or whitespace-only strings)
- Similarity Calculation - Computes Jaccard index as
|A ∩ B| / |A ∪ B|where A and B are genre sets, producing a similarity score between 0 and 1 - Threshold-Based Filtering - Applies a configurable threshold (0.8) to identify high-confidence matches, with early stopping when a sufficiently similar movie is found
The algorithm’s efficiency is enhanced by iterating through unwatched movies only once while comparing against the full watched movie catalog, with early termination when the similarity threshold is exceeded. This approach scales well even with large movie databases.
Django Architecture and Data Modeling
The application follows Django’s MVT (Model-View-Template) pattern with a carefully designed data model:
Movie Model - Comprehensive schema capturing essential metadata:
- IMDB ID for unique identification and external integration
- Genre storage as space-delimited strings for efficient similarity comparison
- Vote average and count fields for popularity-based fallback ranking
- Boolean flags (
watched,recommended) for efficient filtering with database indexes - TMDB poster paths for rich visual presentation
Custom Management Commands - Implemented two CLI commands extending BaseCommand:
load_movies- Pandas-based CSV ingestion with automatic database cleanup and bulk insert operationsmake_recommendations- Batch recommendation processor with progress logging and database batch updates for performance
View Layer - Optimized query logic using Django ORM with chained filters and ordering:
Movie.objects.filter(watched=False).filter(recommended=True).order_by("-vote_count")[:30]
This approach leverages database indexes for efficient retrieval of top-30 recommendations.
Performance Optimizations
- Early Stopping - Recommendation loop terminates immediately when similarity threshold is met, reducing unnecessary comparisons
- Set-Based Operations - Jaccard similarity uses Python sets for O(n) intersection and union operations rather than nested loops
- Database Query Optimization - Single-pass filtering with combined WHERE clauses rather than multiple database hits
- Batch Updates - Individual movie recommendation flags are saved immediately but could be optimized with bulk_update for production scale
Data Processing Pipeline
The system processes movie data through a robust ETL pipeline:
- Extract - Pandas reads CSV with automatic type inference and missing value handling
- Transform - Row-by-row iteration converts DataFrame records to Django ORM objects with appropriate type casting
- Load - Database persistence with transaction support and error logging for failed imports
Use Cases and Value Proposition
Personal Movie Discovery
The system solves a common problem: decision paralysis when choosing what to watch next. By analyzing personal viewing patterns rather than relying on generic popularity rankings, it surfaces movies genuinely aligned with individual taste. For example, a user who enjoys multiple sci-fi thrillers with high vote counts will receive recommendations for similar high-quality films they haven’t seen, filtered from a catalog of thousands.
Data-Driven Viewing Decisions
Traditional movie recommendation systems often prioritize what’s popular or profitable rather than what matches user preferences. This tool puts control back in the user’s hands by transparently showing why movies are recommended (shared genres) and providing detailed metadata (ratings, vote counts, release dates) to inform viewing decisions.
Learning Platform for ML Concepts
This project demonstrates core machine learning and data science principles:
- Feature Engineering - Genre extraction and normalization for algorithmic comparison
- Similarity Metrics - Practical application of Jaccard coefficient in a real-world scenario
- Evaluation Strategies - Threshold tuning and early stopping for optimal performance
- Cold Start Handling - Fallback mechanism when insufficient viewing history exists
Development Approach
Technology Stack Rationale
Django was chosen for its robust ORM, built-in admin interface for data management, and powerful management command framework for CLI operations. The framework’s “batteries included” philosophy accelerated development while maintaining production-ready code quality.
Pandas and NumPy provide industry-standard data manipulation capabilities, essential for processing large CSV files and performing efficient numerical operations on movie metadata.
SQLite serves as a lightweight, serverless database perfect for a personal tool while maintaining compatibility with production databases (PostgreSQL/MySQL) through Django’s database abstraction layer.
Code Quality and Best Practices
- Type Hints - Function signatures include Python type annotations for clarity and IDE support
- Docstrings - Model classes document their purpose and field meanings
- Validation Logic - Genre validation function prevents incorrect similarity calculations from malformed data
- Modular Design - Separation of concerns with distinct modules for models, views, management commands, and templates
- Configuration Management - Similarity threshold defined as constant for easy tuning without code changes
Testing and Validation
The system was validated with a real-world dataset of 9MB+ movie data including thousands of films with comprehensive metadata. Manual testing verified:
- Accurate similarity calculations for known movie pairs
- Correct filtering of watched vs unwatched movies
- Proper threshold behavior at boundary conditions
- UI rendering with various data sizes and missing poster images
Future Enhancement Opportunities
While the current implementation provides solid value, potential improvements include:
- Collaborative Filtering - Incorporating user-based recommendations when multiple users track viewing history
- Hybrid Recommendation Models - Combining content-based and popularity-based approaches with weighted scoring
- Advanced NLP Features - Analyzing movie overviews and plot summaries using TF-IDF or embedding models for deeper similarity matching
- User Ratings Integration - Allowing users to rate movies to weight recommendations by satisfaction scores
- Performance Optimization - Implementing Redis caching for frequently accessed recommendation lists
- RESTful API - Exposing recommendation endpoints for integration with mobile apps or other services