In this second session how embeddings are used in neural approaches to natural language processing (NLP) will be introduced. NLP is an academic discipline that is broadly concerned with producing algorithms that 'do stuff' with language. This might involve making algorithms that aid linguistic analysis, or it might involve making algorithms that allow simulation of conversation.
Currently, neural approaches -- that is approaches that make use of neural nets -- are the dominant paradigm in NLP. Neural nets are algorithms that mimic how brain works by having layers of 'neurons', where each neuron is defined by some mathematical function, that repeatedly transforms algorithms' inputs. By 'training' them on data, we can find the neuron parameters that best enable the neural net to perform some desired task. We won't go into how neural nets work here, suffice to say that what is called 'artificial intelligence' today -- ChatGPT, Claude, DeepSeek etc. -- are massive, NLP neural nets designed to simulate conversation.
We will go over two topics: word embeddings and embedding-based topic modelling.
We will learn about how to generate and use word embeddings in Python through Word2Vec. Word2Vec is a small neural network developed by researchers at Google in 2013. Their paper detailing Word2Vec can be found here. However, before getting hands on experience with Word2Vec, one should have an intuition about what embeddings in NLP are.
In NLP, embeddings are vector representations of linguistic items. Pretty much any kind of linguistic item can in principle be represented with a vector: word occurrences (tokens), words, sentences, paragraphs, documents, etc.
A vector is a sequence of numbers that encodes some useful information about whatever it is one wants to represent. We are all familiar from secondary school maths with 2-dimensional vectors like coordinates. A coordinate like (2, 3) might represent a location in space, and might encode information about how to reach the location -- move 2 steps to the right and 3 steps forwards.
Vectors can have any number of dimensions. In NLP, often vector representations of linguistic items have 100s of dimensions, and just as coordinates can encode information about how to arrive at a location, NLP vector representations can encode information about the meaning of linguistic items
How are embeddings made to encode useful linguistic information? This is where the design of the neural network generating embeddings comes in. Neural network architecture, training procedure and training datasets are all carefully designed to enable production of useful linguistic embeddings.
Word2Vec is a small neural network developed by researchers at Google in 2013. Their paper detailing Word2Vec can be found here. Word2Vec produces word embeddings -- embedding representations of words. Word embeddings are sometimes called static embeddings in order to distinguish them from embedding representations of word occurrences or tokens. We will train a Word2Vec model on the Gutenberg dataset and then play around a bit with the word embeddings produced through this training.
To get started with Word2Vec, make sure you have navigated into your Python project directory and have activated your virtual environment:
#### navigate to the directory containing your virtual environment
cd C:\Users\YourUsername\Documents\Projects\your_python_project
#### activate your virtual environment
#### Mac/Linux
source env1/bin/activate
#### Windows
env1\Scripts\activate
Then make sure you have installed the gensim
package:
pip install gensim
Now, we will create a new script static_embeddings.py
:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re
from gensim.models import Word2Vec
import gensim
import json
import os
def main():
pass
if __name__ == '__main__':
main()
All we've done so far is set up the proper layout of the script and imported all the packages we will need. The packages from nltk
and re
will enable us to conduct preprocessing. The packages from gensim
give us a Python implementation of Word2Vec. Finally, json
and os
allow us to load/save JSON files and create/manage directories. Run this script to see if all packages get imported successfully.
Before training Word2Vec on our dataset, some preprocessing is needed. Word2Vec does not handle punctuation well, so we will have to get rid of those. We'll also get rid of any numbers to get rid of any potential page numbers, list markers etc.
A note on stopwords: For any text in any language, there will always be words that are present as a consequence of following grammatical rules -- words such as 'the', 'a', 'that' etc. These are called stopwords. For many NLP procedures, removal of stopwords is necessary. But for other procedures, stopword removal is not necessary and can even worsen the quality of embeddings. For the procedures we are learning about, Word2Vec and BERTopic, stopword removal is not necessary, but we'll write some code for stopword removal anyway just so that you know how to do it.
Let's add in a preprocessing function that does all this:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re
from gensim.models import Word2Vec
import gensim
import json
import os
def preprocess(d, separators=['\n\n', '. ', '.--'], titles=['mr', 'mrs', 'ms', 'dr']):
#### split into sentences
sentences = []
for k in d:
text = d[k]['content']
# #### examine the text
# print(repr(text))
# quit()
#### lower case
text = text.lower()
#### split into sentences
for sep in separators:
text = text.replace(sep, '<SEP>')
#### handle titles
for title in titles:
text = text.replace(f'{title}<SEP>', f'{title} ')
#### split into sentences
sentences += text.split('<SEP>')
####clean sentences
processed = []
# sentences = sentences[:10]
token_count = 0
for s in sentences:
#### remove new lines
s = s.replace('\n', ' ')
#### tokenize
tokens = word_tokenize(s)
# print(tokens)
# quit()
#### remove non alphabetic characters
#### remove stop words
stop_words = set(stopwords.words('english'))
# print(stop_words)
# quit()
cleaned = []
for t in tokens:
t = re.sub(r'[^a-z]', '', t)
# if t in stop_words: continue
cleaned.append(t)
processed.append(cleaned)
token_count += len(cleaned)
# print('####')
# print(s)
# print(sentences[i])
# quit()
print(f'Total number of tokens: {token_count}')
return processed
def main():
sentences = preprocess(d)
if __name__ == '__main__':
main()
This function has two central loops. The first loop splits text into sentences. This is more complicated than you might think. The grammatical indicators which we use to parse long sections of text into individual sentences, which we interpret so easily, are actually pretty complicated, and we need to tell Python to distinguish between sentences separated by full stops, while telling Python to ignore full stops in honorific titles like 'Mr. ' or 'Ms. ', to take into account separation between paragraphs, and so on.
To deal with this, the code uses a list of all grammatical indicators that might indicate boundaries between sentences, and then replaces them with a single <SEP>
that we store as a variable sep
. Then, we can just call text.split(<SEP>)
to split text into sentences.
You'll notice that the grammatical indicators that we have grouped under <SEP>
are passed into the function as an argument, but as an argument that looks a bit different to the kinds of function arguments covered in the first session. Unlike these, it has an x=y
form. These kind of arguments are called keyword arguments.
Keyword arguments provide default values that are used if you don't specify them when calling the function. Unlike regular arguments that must be provided in the correct order, keyword arguments can be omitted entirely and the function will use their default values instead. For example, in the preprocess(d, separators=['\n\n', '. ', '.--'], titles=['mr', 'mrs', 'ms', 'dr'])
function, if you call it as simply preprocess(d)
, it will automatically use the default separators and titles lists. If you want to change what grammatical indicators are used for preprocessing, you can redefine the value of the separators
keyword argument when calling the function, e.g. preprocess(d, separators=[';', ', '])
.
The second loop cleans each sentence by:
You'll notice a commented-out line for removing stopwords (# if t in stop_words: continue
). Let's uncomment it to see how it handles stop word removal -- we'll recomment it for Word2Vec training. The line a bit above this line, stop_words = set(stopwords.words('english'))
gives us a set -- don't worry too much about what sets are, they are data structures that are a little bit like lists in that they can store multiple values -- from nltk
containing stop words. The line if t in stop_words: continue
checks if a token in a sentence is present within the stopwords set. If it is, it's a stopword, and it is not included in the final dataset.
Once this function is defined, we call it in the main()
block.
Now we have done some preprocessing, we can train the Word2Vec model. Luckily, gensim
makes this very easy to do:
def main():
#### preprocess data
sentences = preprocess(d)
#### initialise and train w2v model
model = Word2Vec(sentences=sentences, vector_size=300, workers=4)
model.train(sentences, total_examples=model.corpus_count, epochs=3)
#### save model
model_dir = f'{os.getcwd()}/models'
if not os.path.exists(model_dir):
os.makedirs(model_dir)
model.save(f'{model_dir}/{model_name}')
if __name__ == '__main__':
main()
All we need are these 2 lines of code to train. The first line essentially defines what kind of model we want. It sets up a Word2Vec model to be trained in a way specified by the parameter arguments you can see -- to be trained on the sentences specified by the sentences
keyword argument, to produce embeddings with 300 dimensions as specified by the vector_size
parameter.
The second line trains the model set up in the first line.
Finally, we save the trained model so we can use its embeddings in the future if we want.
By producing word embeddings, we have encoded relations of meaning into a vector space, such that the embeddings of two words that we would recognise as having similar meanings will occupy positions within vector space that are close to each other. So, we can expect that embeddings that are close to each other -- which we can measure using Euclidean distance or cosine similarity -- will represent words that are similar in meaning
We can ask our Word2Vec model to give us the words that are closest to a given word in embedding space with the .wv.most_similar()
method. Let's add in some code that will let us look at the results of using this method, and so judge how well Word2Vec has encoded relations of meaning similarity into vector space:
def main():
#### preprocess data
sentences = preprocess(d)
#### initialise and train w2v model
model = Word2Vec(sentences=sentences, vector_size=300, workers=4)
model.train(sentences, total_examples=model.corpus_count, epochs=3)
##### save model
model_dir = f'{os.getcwd()}/models'
if not os.path.exists(model_dir):
os.makedirs(model_dir)
model.save(f'{model_dir}/{model_name}')
#### explore similar words
words = ['marriage', 'family', 'man', 'woman', 'love', 'king', 'queen']
for w in words:
print('\n####')
print(w)
print(model.wv.most_similar(w, topn=10))
if __name__ == '__main__':
main()
Run the code. How well has Word2Vec captured relations of similar meanings for these words?
Topic modelling refers to any procedure designed to automatically uncover the topics expressed in some dataset of texts. One of the most popular topic modelling procedures at the moment is BERTopic, which is what we'll be learning about for the rest of the sessions.
BERTopic is based on a large language model from the pre-ChatGPT generation, BERT. BERT was designed by researchers at Google, and represented the cutting edge of NLP from 2018 until November 2022, when ChatGPT was released. Like Word2Vec, BERT is designed to create high quality linguistic embeddings, which can then be used for any number of tasks.
Unlike Word2Vec, it does not produce embedding representations of individual words. It produced representations of word occurrences, also known as token embeddings. So if a text has 10 occurrences of 'phone', it will not produce a single embedding representing 'phone' like Word2Vec. It will produce 10 embeddings, one for each occurrence of 'phone'. This means that compared to Word2Vec, BERT can capture the contextually dependent character of meaning much more effectively. It would be able to, for example, distinguish between occurrences of 'phone' that refer to iPhones and occurrences that refer to Android phones. BERT is also a pre-trained model, meaning that it was trained by Google on (at the time) a vast amount of text before being released to the public, and so is available as an already trained model by default that can then be trained further, or fine-tuned, on other datasets. This means that BERT embeddings already encode a lot of linguistic information out of the box, and can be effectively used without having to train/fine-tune it oneself. All this made BERT the best language model at the time.
BERTopic makes use of this to create an effective topic modelling procedure. It creates embedding representations of whole documents from BERT token embeddings, and then clusters these document embeddings. Documents similar in meaning get clustered together, and are taken to represent documents that express an underlying topic. To give an idea of what these topics are, BERTopic produces key term representations of each topic, which consist of the most characteristic terms of each cluster of documents.
Let's write a script to use BERTopic on the Gutenberg dataset:
import pandas as pd
from umap import UMAP
from hdbscan import HDBSCAN
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
import json
import os
from tqdm import tqdm
def preprocess(d):
docs = []
for k in d:
text = d[k]['content'].split('\n\n')
for doc in text:
#### remove new lines
doc = doc.replace('\n', ' ').strip()
docs.append(doc)
return docs
def main():
#### set up directory for saving our bertopic model
model_dir = f'{os.getcwd()}/models'
if not os.path.exists(model_dir):
os.makedirs(model_dir)
model_name = 'gutenberg_bertopic'
#### load gutenberg dictionary from last session
data_dir = f'{os.getcwd()}/data'
with open(f'{data_dir}/gutenberg_data.json', 'r') as f:
d = json.load(f)
#### split texts into docs
docs = preprocess(d)
# print(len(docs))
#### embed docs
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = sentence_model.encode(docs, show_progress_bar=True)
#### parameter tuning
umap_model = UMAP(n_neighbors=15, n_components=10, metric='cosine')
hdbscan_model = HDBSCAN(min_cluster_size=10)
#### train model
model = BERTopic(
embedding_model=sentence_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
verbose=True
).fit(docs, embeddings)
#### save model and topic info
model.save(f'{model_dir}/{model_name}', serialization='pickle')
topic_info = model.get_topic_info()
topic_info.to_csv(f'{os.getcwd()}/{model_name}_info.csv', index=False)
if __name__ == '__main__':
main()
You'll notice the preprocessing here is much more minimal than for Word2Vec. All we are doing here is splitting text into paragraphs, and BERTopic truncates text if it is too long. This isn't a problem for analysing social media content, which tends to be very short, but for longer texts like the fiction of the Gutenberg dataset, this needs to be managed.
Furthermore, unlike Word2Vec, getting rid of punctuation is not a good idea for BERTopic, since it makes use punctuation to create high quality document embeddings.
The meat of the script comes after we have set up a directory for saving a trained BERTopic model, loaded the Gutenberg dictionary and preprocessed our data:
# Create document embeddings with BERT
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = sentence_model.encode(docs, show_progress_bar=True)
# Configure dimensionality reduction and clustering
umap_model = UMAP(n_neighbors=15, n_components=10, metric='cosine')
hdbscan_model = HDBSCAN(min_cluster_size=10)
# Train BERTopic model
model = BERTopic(
embedding_model=sentence_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
verbose=True
).fit(docs, embeddings)
There are three main steps here:
SentenceTransformer()
. This class uses a BERT model to construct document embeddings from token embeddings. After creating an instance of the SentenceTransformer()
class, we then call its method .encode()
to get our instance to produce document embeddings.BERTopic()
class, and then call the .fit()
method to train the model. In this context, 'training' refers to the dimension reduction and clustering processes. We do not actually do any fine-tuning on the BERT model being used -- we rely entirely on the default pre-trained token embeddings that come out of the boxSomething to get familiar with here are the steps of parameter selection and training. In Python, most of the time the hard work of implementing a complex machine learning process has already been done by somebody else. We just import the classes that they wrote into our script, organise our data, and then input our data into imported classes. The only extra work we have to do is specify the parameters with which to conduct training -- which refers to any process of fitting a model to data (even simple linear regression can be thought of as a trainable machine learning algorithm) -- and then just calling whatever method is used to execute training.
Finally, let's examine the topics we've produced:
#### save model and topic info
model.save(f'{model_dir}/{model_name}', serialization='pickle')
topic_info = model.get_topic_info()
topic_info.to_csv(f'{os.getcwd()}/{model_name}_info.csv', index=False)
We save our trained BERTopic model in case we want to use it in the future, and get information about our topics with the .get_topic_info()
method. We save this topic information in our project directory as a .csv file. Open it and see what you make of the topics.