In this article, we are going to look into an unsupervised machine learning model, which aims to cluster documents into topics – the algorithm is called Latent Dirichlet Allocation (LDA).
Essentially, we can assume that documents which are written about similar things, use a similar group of words. LDA tries to map all the documents in to a topic. LDA says:
- Documents are probababiliy distributions over latent topics, which means that every document contains a number of topics
- Topics are probability distributions over words; which means that topics have words that are most frequently associated to them
The latent topics in a document can be found by searching for groups of words that regularly occur together.
First off, we need to do the usual – import the libraries we need and read in our data.
import spacy import pandas as pd nlp = spacy.load('en_core_web_sm') df = pd.read_csv('textdoc.csv')
Next, we need to bring in our count vectorizer. I’ve described vectorization in this article. Within the vectorizer definition, we also remove the stop words (‘like ‘And’ or ‘The’). In the brackets I have defined max_df (max document frequency) as 0.9 (90%). That means, discard words which appear in more than 90% of the documents, as they probably aren’t critical to the key topics of the doucment.
I’ve also defined min_df (min document frequency), which we could set to a percentage too, but in this case, I have set it to the number of documents it should appear in. If it appears in 2 or more doucments, it’s probably not just a misspelled word.
We then create our vocabulary (our Document Term Matrix) by applying that vectorizer to the article column of our dataframe.
from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer(max_df=0.9, min_df=2, stop_words='english') document_term_matrix = cv.fit_transform(df['Article'])
Next, we import the LDA library and create an instance of the LDA model. Here, I have set n_components to 10. That means, I want the model to find 10 distinct topics within the corpus of documents.
from sklearn.decomposition import LatentDirichletAllocation lda = LatentDirichletAllocation(n_components=10, random_state=42)
Now, let’s fit the model we created to our document term matrix:
If we now take a look at lda.components_, we will see it has 10 items; because, we set the model to generate 10 topics.
Let’s look into a single topic, so we can see what happens. In the below, we select topic 0 and sort the array of top 10 highest probability topics. The index positions it outputs are related to the index positions of count vectorizer. Which means, we can pull the topic names.
single_topic = lda.components_ top_ten = single_topic.argsort()[-10:]
So let’s loop through the top 10 and print out the associated topic name:
for index in top_ten: print(cv.get_feature_names()[index])
So that topic has some words associated with it. Let’s see what the other topics the model has identified are. In the below, we’re looping through each of the 10 lda.components_ and we’re extracting the feature names for the top 15 terms in the array.
for index, topic in enumerate(lda.components_): listy =  print('FOR TOPIC ' + str(index)) for index in topic.argsort()[-15:]: listy.append([cv.get_feature_names()[index]]) print(listy)
Now, we can add the topic it’s been assigned back to the original dataframe.
topic_results = lda.transform(document_term_matrix) df['topic'] = topic_results.argmax(axis=1) #take the highest topic probability and assign topic name
If we look at the topic_results output, we can understand what is happening here:
In the above, each nested list is related to one article and has a length of 10. That is, for each article, we say what the likelihood of belonging to each topic is – the highest number, determines what it is most likely. These are percentage values, you can see that they’re pretty horrible at the moment, but if you convert the scientific format of document 0, index 1, you get: 0.911263140; hence 91.1%. We can neaten it up like below:
import numpy as np np.round(topic_results, 2)
Remembering that the index starts from 0, you can see that row 1, we expect to be associated to topic 1 and he same for row 2:
We can add the likelihoods of each topic to our analysis, which helps us to see clearly whether the document is comprised of two major topics.
import numpy as np clean_topics = np.round(topic_results, 2) topicx = pd.DataFrame(clean_topics) result = pd.concat([df, topicx], axis=1, sort=False)