Platform: MacOS 64-bit. Constructor, 3.0.0, create installer from conda packages / BSD 3-Clause. Gensim, 3.8.0, Topic Modelling for Humans / LGPL-2.1. Anaconda Navigator is a desktop graphical user interface (GUI) included in Anaconda® distribution that allows you to launch applications and easily manage conda packages, environments, and channels without using command-line commands. Navigator can search for packages on Anaconda Cloud or in a local Anaconda Repository.
![]()
This is the 10th article in my series of articles on Python for NLP. In my previous article, I explained how the StanfordCoreNLP library can be used to perform different NLP tasks.
In this article, we will explore the Gensim library, which is another extremely useful NLP library for Python. Gensim was primarily developed for topic modeling. However, it now supports a variety of other NLP tasks such as converting words to vectors (word2vec), document to vectors (doc2vec), finding text similarity, and text summarization.
In this article and the next article of the series, we will see how the Gensim library is used to perform these tasks.
Installing Gensim
If you use pip installer to install your Python libraries, you can use the following command to download the Gensim library:
Alternatively, if you use the Anaconda distribution of Python, you can execute the following command to install the Gensim library:
Let's now see how we can perform different NLP tasks using the Gensim library.
Creating Dictionaries
Statistical algorithms work with numbers, however, natural languages contain data in the form of text. Therefore, a mechanism is needed to convert words to numbers. Similarly, after applying different types of processes on the numbers, we need to convert numbers back to text.
![]()
One way to achieve this type of functionality is to create a dictionary that assigns a numeric ID to every unique word in the document. The dictionary can then be used to find the numeric equivalent of a word and vice versa.
Creating Dictionaries using In-Memory Objects
It is super easy to create dictionaries that map words to IDs using Python's Gensim library. Look at the following script:
In the script above, we first import the
gensim library along with the corpora module from the library. Next, we have some text (which is the first part of the first paragraph of the Wikipedia article on Artificial Intelligence) stored in the text variable.
To create a dictionary, we need a list of words from our text (also known as tokens). In the following line, we split our document into sentences and then the sentences into words.
We are now ready to create our dictionary. To do so, we can use the
Dictionary object of the corpora module and pass it the list of tokens.
Finally, to print the contents of the newly created dictionary, we can use the
token2id object of the Dictionary class. The output of the script above looks like this:
The output shows each unique word in our text along with the numeric ID that the word has been assigned. The word or token is the key of the dictionary and the ID is the value. You can also see the Id assigned to the individual word using the following script:
In the script above, we pass the word 'study' as the key to our dictionary. In the output, you should see the corresponding value i.e. the ID of the word 'study', which is 40.
Similarly, you can use the following script to find the key or word for a specific ID.
To print the tokens and their corresponding IDs we used a for-loop. However, you can directly print the tokens and their IDs by printing the dictionary, as shown here:
The output is as follows:
The output might not be as clear as the one printed using the loop, although it still serves its purpose.
Let's now see how we can add more tokens to an existing dictionary using a new document. Look at the following script:
In the script above we have a new document that contains the second part of the first paragraph of the Wikipedia article on Artificial Intelligence. We split the text into tokens and then simply call the
add_documents method to add the tokens to our existing dictionary. Finally, we print the updated dictionary on the console.
The output of the code looks like this:
You can see that now we have 65 tokens in our dictionary, while previously we had 45 tokens.
Creating Dictionaries using Text Files
In the previous section, we had in-memory text. What if we want to create a dictionary by reading a text file from the hard drive? To do so, we can use the
simple_process method from the gensim.utils library. The advantage of using this method is that it reads the text file line by line and returns the tokens in the line. You don't have to load the complete text file in the memory in order to create a dictionary.
Before executing the next example, create a file 'file1.txt' and add the following text to the file (this is the first half of the first paragraph of the Wikipedia article on Global Warming).
Now let's create a dictionary that will contain tokens from the text file 'file1.txt':
In the script above we read the text file 'file1.txt' line-by-line using the
simple_preprocess method. The method returns tokens in each line of the document. The tokens are then used to create the dictionary. In the output, you should see the tokens and their corresponding IDs, as shown below:
Similarly, we can create a dictionary by reading multiple text files. Create another file 'file2.txt' and add the following text to the file (the second part of the first paragraph of the Wikipedia article on Global Warming):
Save the 'file2.txt' in the same directory as the 'file1.txt'.
The following script reads both the files and then creates a dictionary based on the text in the two files:
In the script above we have a method
ReturnTokens , which takes the directory path that contains 'file1.txt' and 'file2.txt' as the only parameter. Inside the method we iterate through all the files in the directory and then read each file line by line. The simple_preprocess method creates tokens for each line. The tokens for each line are returned to the calling function using the 'yield' keyword.
In the output, you should see the following tokens along with their IDs:
Creating Bag of Words Corpus
Dictionaries contain mappings between words and their corresponding numeric values. Bag of words corpora in the Gensim library are based on dictionaries and contain the ID of each word along with the frequency of occurrence of the word.
Creating Bag of Words Corpus from In-Memory Objects
Look at the following script:
In the script above, we have text which we split into tokens. Next, we initialize a
Dictionary object from the corpora module. The object contains a method doc2bow , which basically performs two tasks:
The output of the above script looks like this:
The output might not make sense to you. Let me explain it. The first tuple (0,1) basically means that the word with ID 0 occurred 1 time in the text. Similarly, (25, 3) means that the word with ID 25 occurred three times in the document.
Let's now print the word and the frequency count to make things clear. Add the following lines of code at the end of the previous script:
![]()
The output looks like this:
From the output, you can see that the word 'intelligence' appears three times. Similarly, the word 'that' appears twice.
Creating Bag of Words Corpus from Text Files
Like dictionaries, we can also create a bag of words corpus by reading a text file. Look at the following code:
In the script above, we created a bag of words corpus using 'file1.txt'. In the output, you should see the words in the first paragraph for the Global Warming article on Wikipedia.
The output, shows that the words like 'of', 'the', 'by', and 'and' occur twice.
Similarly, you can create a bag of words corpus using multiple text files, as shown below:
The output of the script above looks like this:
Creating TF-IDF Corpus
The bag of words approach works fine for converting text to numbers. However, it has one drawback. It assigns a score to a word based on its occurrence in a particular document. It doesn't take into account the fact that the word might also have a high frequency of occurrences in other documents as well. TF-IDF resolves this issue.
The term frequency is calculated as:
And the Inverse Document Frequency is calculated as:
Using the Gensim library, we can easily create a TF-IDF corpus:
To find the TF-IDF value, we can use the
TfidfModel class from the models module of the Gensim library. We simply have to pass the bag of word corpus as a parameter to the constructor of the TfidfModel class. In the output, you will see all of the words in the three sentences, along with their TF-IDF values:
Downloading Built-In Gensim Models and Datasets
Gensim comes with a variety of built-in datasets and word embedding models that can be directly used.
To download a built-in model or dataset, we can use the
downloader class from the gensim library. We can then call the load method on the downloader class to download the desired package. Look at the following code:
With the commands above, we download the 'glove-wiki-gigaword-100' word embedding model, which is basically based on Wikipedia text and is 100 dimensional. Let's try to find the words similar to 'toyota' using our word embedding model. Use the following code to do so:
In the output, you should see the following results:
You can see all the results are very relevant to the word 'toyota'. The number in the fraction corresponds to the similarity index. Higher similarity index means that the word is more relevant.
Conclusion
The Gensim library is one of the most popular Python libraries for NLP. In this article, we briefly explored how the Gensim library can be used to perform tasks like a dictionary and corpus creation. We also saw how to download built-in Gensim modules. In our next article, we will see how to perform topic modeling via the Gensim library.
![]() Comments are closed.
|
AuthorWrite something about yourself. No need to be fancy, just an overview. Archives
December 2022
Categories |