Topic Analysis and Word2Vec Model of the Yelp Dataset

Project description: There are four major steps that make up this project. The first was preprocessing the data using spaCy, this included tokenization, lemmatization, punctuation removal and named entity recognition. Next was the phrase modeling step using gensim, creating bigrams and trigrams from the cleaned unigram text. Then we turn to the modeling phase, where we first focused on the finding the optimal number of topics for extraction, the results of which are shown in Figure 1. In this case we landed on 30 topics to use in the gensim LDA model on the full dataset. Figure 2 shows the resulting topics and terms (best seen interactively). Finally A gensim Word2Vec model was trained on the full corpus of 41 million sentences, to find semantic similarity and vector embedding space relationships of terms. This can be seen in Figure 3 (also best seen interactively).
Preprocessing Code | Preprocessing Analysis

Topic Analysis and Word2Vec Model of the Yelp Dataset

Figure 1: Search Optimal Number of Topics

Figure 2: Visualization of the Extracted topics using pyLDAvis

Figure 3: Word2Vec Embedding