Popularity of Youtube Cooking Videos Prediction

Project description: We focused our project on cooking videos specifically as we wanted to avoid highly polarizing topics, to make it more likely that the popularity of a video is based on the content not the subject. We used the youtube api and youtube-dl to gather the data, a CNN to transform the thumbnail into a numeric feature, and used PCA to transform the vectorized texts into informative numeric features. Finally we used developed a metric that was a PCA transform of a videos likes, favorites, and views. One of the key difficulties was splitting the datasets to ensure that there was no leakage. To do this the CNN was trained on a separate dataset that was not used to train the final model.

Figure 1: Training Results of the CNN

The CNN train and validate mean squared error during training on the video thumbnails.

CNN Model Definition | CNN Model Training | CNN Model testing

Figure 2: Feature Importance

Many of the features in this plot are engineered from the existing data, and the full description of each feature can be seen in Table B1 of Appendix B in our report.

Figure 3: Linear Regression Base Model

The output of a base linear regression model on the test dataset. We can see that there are two outliers severely skewing the results.

Figure 4: Ensemble Stacking Regressor

The stacking regressor on the test dataset, performed significantly better than the base linear regression model.

Final Modeling Training and Analysis | Final Dataset Construction | Full Report