Hello everyone. This session covers the basics of Python and text preprocessing. A github repository containing the code used for this session can be found here.
The first thing to do when getting started with coding in Python is to understand how the work environment -- the tools used for coding and running scripts -- is set up. There are lots of options for such tools, e.g. integrated development environments, notebooks, but before exploring those options, one should know the most basic set up possible, for which you need just 2 things:
We'll be using this minimal set up for the rest of my sessions.
Now, if you've followed the setup for this session, you should already have Python installed and have selected a text editor. The next thing to do is to set up a virtual environment. Knowing how to manage virtual environments is useful because dealing with multiple Python projects can get messy.
One project might require an older version of a particular programme, while another project you are working on might require a newer version of the same programme. Instead of upgrading/reverting programme versions everytime you want to work on different Python projects, virtual environments lets you work with different versions of programmes simultaneously.
To set up a virtual environment, follow these steps:
C:\Users\YourUsername\Documents\Projects\my_python_project
cd
command to navigate to the folder you created in (1), for example:cd C:\Users\YourUsername\Documents\Projects\my_python_project
env1
:python -m venv env1
You now have a virtual environment called env1
in the folder you selected/created in (1). You should get into the habit of navigating to the directory of your environment and activating it whenever you want to work on a Python project with the following commands:
#### navigate to the directory containing your virtual environment
cd C:\Users\YourUsername\Documents\Projects\your_python_project
#### activate your virtual environment
#### Mac/Linux
source env1/bin/activate
#### Windows
env1\Scripts\activate
To deactivate your virtual environment, simply enter:
deactivate
You are now in a position to write your first script! Using your text editor, create a file called first_script.py
in the same directory containing your virtual environment. With your text editor, enter the following line into first_script.py
:
print('hello')
Save your file, and in the command line (ensuring you are in the same directory as first_script.py
), and enter the command:
python first_script.py
You should get the following output in your command line:
>>> hello
Congratulations on your first script! Note the workflow here: create script with text editor, save as .py
file, run the file in command line. This is the fundamental process of using Python.
If we want to make scripts that can do a bit more than just print text, we need to make sure our code is organised so that:
Scripts can get pretty complicated very quickly. Without keeping things organised, scripts that can do more complex tasks can become a nightmare to read through and maintain.
The basic tool for ensuring good organisation is the function. Functions allow us to package code we know we are going to use repeatedly in a neat way, and to enable that bit of repeatable code to work with many different inputs. For example, suppose in our script we know we are going to be printing a lot of similar stuff. Instead of rewriting the various combinations of what we want to print repeatedly, we can define a single function that gives us a basic template of what we want to print:
def greetings(name, country):
print(f'Hello, my name is {name} and I am from {country}.')
Here, we give the name and country of our greeting as arguments, i.e. as inputs to the function. Now, whenever we want our script to introduce a different person, instead of repeatedly typing out variations of the greeting, we can simply call the function with different arguments:
greetings('John', 'Great Britain')
greetings('Monica', 'Italy')
Note the formatting used to define the function. After the first line defining function arguments, the rest of the function is indented 4 spaces to the right. Remember this indentation format -- whenever one indents in Python, one indents with 4 spaces.
You'll notice that if we do not put the arguments of greetings()
in quotation marks, the code will not work. This is because Python is structured around the idea of types of values. In Python, values can be:
str
(strings): Text values like 'John' or "Hello world"int
(integers): Whole numbers like 42 or -7float
(floating point): Decimal numbers like 3.14 or -0.001bool
(boolean): True or False valueslist
, dict
, tuple
: Different ways to organize collections of valuesWhen we put text in quotation marks, we're creating a string value. Without the quotation marks, Python would think we're trying to refer to a variable called 'John', which doesn't exist in our code.
Variables are names that refer to stored values. Think of them as labeled containers that hold data. For example:
name = "John"
age = 25
height = 1.85
countries_visited = ["France", "Japan", "Brazil"]
Here, name
, age
, height
, and countries_visited
are variables that store different types of values. We can use these variables in our functions:
person_name = "Monica"
person_country = "Italy"
greetings(person_name, person_country)
This will produce the same output as greetings('Monica', 'Italy')
because the variables contain those string values.
Understanding value types and variables is crucial because Python treats different types differently. For example, the +
operator adds numbers but concatenates strings:
result1 = 5 + 10 # result1 equals 15
result2 = "5" + "10" # result2 equals "510"
To ensure organisation, it's a good idea to make sure your scripts always have this structure:
#### write imports at the beginning, e.g.:
import os
#### write functions and classes here, e.g.:
def function(arg):
return arg * 2
def main():
####call functions and classes here, e.g.:
function(2)
if __name__ == '__main__':
main()
Do not worry about what imports are for now - we'll cover those in 5 minutes. The key parts to understand in this structure are:
main()
function contains the code that actually runs when you execute the scriptif __name__ == '__main__'
line is a special Python construct that ensures your code only runs when the script is executed directly (not when imported by another script)Let's modify our first script to follow this pattern:
def greetings(name, country):
print(f'Hello, my name is {name} and I am from {country}.')
def main():
greetings('John', 'Great Britain')
greetings('Monica', 'Italy')
if __name__ == '__main__':
main()
When you run this script, it will produce exactly the same output as before, but is now organised in a way that makes it much easier to maintain and update, since now you know where functions are defined, where functions are called and where imports are performed. This structure will become especially valuable as your scripts grow larger and more complex.
In Python scripts, we can import other Python scripts -- known as packages -- written by other people. This makes life much easier. It means that we do not have to, for example, write a neural network from scratch everytime we want to do some machine learning. We can just import an appropriate network that someone else has written.
Before we can import a package, we need to install it into our virtual environment. There are multiple 'package managers' that enable you to install packages. In my sessions, we will only be using the default package manager, pip
.
Let us use pip
to install nltk
, the Natural Language Toolkit. This package will help us learn the basics of using text as data:
pip install nltk
Now, create a new python script in your Python project directory entitled preprocessing.py
using your text editor. In the way just discussed, add in the main
function and if __name__ == '__main__':
block to keep things organised. We can use pass
as a placeholder before adding in function calls to main
. At the very top of the script, let us import nltk
:
import nltk
def main():
pass ####placeholder -- we can remove this when we start adding stuff to main()
if __name__ == '__main__':
main()
We will be using the Gutenberg dataset from nltk
to learn text preprocessing. With nltk
, we sometimes need to download additional resources to import nltk
packages that don't come with the initial nltk
installation. We can use the nltk.download()
function to download these additional resources. We will use this function to download the resources needed to import the Gutenberg dataset:
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
def main():
pass ####placeholder -- we can remove this when we start adding stuff to main()
if __name__ == '__main__':
main()
When we use from nltk.corpus import gutenberg
, we're using a more specific import syntax that lets us import just one particular component from a package. This differs from import nltk
, which imports the entire package. The advantage of using from x import y
is that we can directly use the imported component without having to prefix it with the package name. So, we can now write gutenberg
instead of having to prefix with nltk.corpus
all the time: nltk.corpus.gutenberg
. This keeps our code concise.
So we've managed to import some classic literary texts from the Gutenberg dataset to practise preprocessing. But we have no concrete understanding of the data. What does it look like? The Gutenberg dataset is stored as a class. We will not go into the details of classes now -- if we have time, we will go through them in one of the other 2 sessions. For now, it is enough to say that a class is a way of bundling together a collection of related functions. These functions are known as the methods of a class.
So, Gutenberg texts are stored in the class gutenberg
, and we can access this data through the methods of gutenberg
. One of these methods is .fileids()
, which returns a list of all available texts in the Gutenberg corpus. Let's print the output of this method to see what's available:
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
def main():
print(gutenberg.fileids())
if __name__ == '__main__':
main()
As you can see, we get a list of filenames representing classic texts like 'austen-emma.txt', 'shakespeare-macbeth.txt', and 'melville-moby_dick.txt'. Let's examine these texts in more detail by writing a for loop and using the the .raw()
method:
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
def main():
# print(gutenberg.fileids())
for file_id in gutenberg.fileids():
print('####')
print(gutenberg.raw(file_id)[:100])
if __name__ == '__main__':
main()
With this for loop, we iterate through each filename in gutenberg.fileids()
, and print the result of inputting each filename into the gutenberg.raw()
method. We only print the first 100 characters of each result to keep things legible using slicing, which we will cover in the next section. We also print '####' at each iteration so we can see where one iteration ends and another begins more clearly.
For loops are essential in Python, and allow you to perform operations on every element of any kind of data structure automatically, as opposed to, for example, manually typing out a print statement for every filename in gutenberg.fileids()
.
This process of essentially using print statements and loops to see the contents of a dataset is an important step that enables one to get a more intuitive understanding of the overall structure of contents of a dataset.
You'll notice that I've added #
at the beginning of one of the lines the main function in the code above. The #
symbol is how we create comments in Python. Any text that follows a #
on a line is ignored by Python when running the script. Comments are useful for temporarily disabling code without deleting it and leaving notes for yourself or other programmers who might read your code later. I'll be commenting out code we have already gone through and is no longer needed.
Through the previous section, we have gained some familiarity with lists. We can tell a value is a list by looking at its brackets -- if it is enclosed with square brackets []
, it is a list. We can use slicing and indexing to look at parts of lists. For example, we can modify our code printing the filenames of the Gutenberg dataset so that gutenberg.fileids()
only prints the first 3 filenames:
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
def main():
print(gutenberg.fileids()[:3])
if __name__ == '__main__':
main()
Let's go through some examples to understand the logic of indexing and slicing:
my_list[0]
: Access the first element (indexing starts at 0)my_list[-1]
: Access the last elementmy_list[2:5]
: Get elements from index 2 up to (but not including) index 5my_list[:3]
: Get first three elementsmy_list[3:]
: Get all elements from index 3 to the endThis slicing syntax works on other sequence types too, like strings:
sentence = "Python is amazing"
print(sentence[0]) # 'P'
print(sentence[:6]) # 'Python'
print(sentence[10:]) # 'amazing'
The other basic data structure you ought to be familiar with is the dictionary, which are enclosed by curly brackets {}
and are organised in terms of key value pairs. Let's write a function converting Gutenberg data into a dictionary to understand what this means:
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
def convert_to_dict():
d = {}
for file_id in gutenberg.fileids():
d[file_id] = {}
d[file_id]['title'] = file_id
d[file_id]['content'] = gutenberg.raw(file_id)
d[file_id]['word_count'] = len(gutenberg.words(file_id))
return d
def main():
# print(gutenberg.fileids())
# for file_id in gutenberg.fileids():
# print('####')
# print(gutenberg.raw(file_id)[:100])
d = convert_to_dict()
print(d['austen-emma.txt'])
if __name__ == '__main__':
main()
Let's go over what this function is doing. First, we create an empty dictionary d = {}
that we'll populate with our Gutenberg data.
We then iterate through each file ID in the Gutenberg corpus. For each text, we create a nested dictionary structure with several pieces of information:
d[file_id] = {} # initialise empty dict for each file
d[file_id]['title'] = file_id # file title
d[file_id]['content'] = gutenberg.raw(file_id) # store the full text
d[file_id]['word_count'] = len(gutenberg.words(file_id)) # count words
If you print d['austen-emma.txt']
in the main function, you'll be able to see the structured information we have stored in d for 'Emma'.
The string ID 'austen-emma.txt'
we enter into the square brackets of d['austen-emma.txt']
is the key of this dictionary, and what is returned by d['austen-emma.txt']
is the value of the dictionary. In this case our value is another, second dictionary (hence it's structure is 'nested'), the values of which we can access using the second dictionary's keys:
print(d['austen-emma.txt']['title'])
print(d['austen-emma.txt']['word_count'])
A couple of things to remember about dictionaries:
Dictionaries are a very common way of storing text data, especially if you are working with APIs and web scraping, so it is important to get familiar with them. An important part of using them is saving them to and loading them from your computer's storage. The standard way of doing this is using the JSON file format, which is designed for storing dictionaries and lists. The advantage of JSON is that it is used across many programming languages, so you don't have to worry about compatibility when sharing data in JSON format.
The package necessary for saving and loading with JSON comes preinstalled with Python, so we can just import it without having to install it first with pip
. Let us modify our preprocessing
script to save and load the dictionary of Gutenberg texts we just created:
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
import json
import os
def convert_to_dict():
d = {}
for file_id in gutenberg.fileids():
d[file_id] = {}
d[file_id]['title'] = file_id
d[file_id]['content'] = gutenberg.raw(file_id)
d[file_id]['word_count'] = len(gutenberg.words(file_id))
return d
def main():
# print(gutenberg.fileids())
# for file_id in gutenberg.fileids():
# print('####')
# print(gutenberg.raw(file_id)[:100])
d = convert_to_dict()
# print(d['austen-emma.txt'])
#### create save directory
save_dir = 'C:/Users/YourUsername/Documents/Projects/my_python_project/data'
if not os.path.exists(save_dir):
os.makedirs(save_dir)
#### save d as JSON file in save_dir
with open(f'{save_dir}/gutenberg_data.json', 'w') as f:
json.dump(d, f)
#### load d from JSON file we just created
with open(f'{save_dir}/gutenberg_data.json', 'r') as f:
d = json.load(f)
if __name__ == '__main__':
main()
Let's go through the modifications:
os
and json
. The package os
provides useful functions for creating and managing directories we can use to decide where to save our dictionary, while json
allows us to store Python objects as JSON objects. Like json
, os
does not need to be installed with pip
.save_dir
.save_dir
does not actually exist as a directory, so we have to create it. We do this by first checking if there is a directory of the path of save_dir
with the function os.path.exists()
, and if there isn't, we create a directory of that path with os.makedirs()
.if condition:
followed by indented code on the next line. If condition
returns the boolean value False
, the next line's indented code will not run. If condition
returns the value True
, the next line's code will run.json.dump()
and json.load()
. We use these functions within the with open() as f
blocks so that we do not have to think about closing the file when we are done saving/loading. If saving/loading is done within these blocks, closing files is handled automatically.with open() as f
blocks, we are temporarily storing the opened file as the variable f
, which we can then use as arguments for json.dump()
and json.load()
to specify we are saving or loading to the file pointed to by the variable f
.open()
, be very careful to use the correct write and read flags, or you risk wiping your saved data. Whenever you want to open a file, you use the read flag 'r'
, e.g. open(file_path, 'r')
. Whenever you want to save a file, you use the write flag 'w'
, e.g. open(file_path, 'w')
.Hopefully you now feel a bit more comfortable with the basics of Python. We have not covered the basics exhaustively, but we have covered enough to start thinking about working with text data.
An essential part of working with text data is preprocessing, which refers to all the ways text might need to be prepared before analysing it. Some methods of analysis only require minimal preprocessing -- others will require a lot.
Let's go through some of the basics of preprocessing using the Gutenberg dataset. We'll download and import some additional nltk
resources and then write a preprocessing function that covers some common preprocessing procedures:
import nltk
nltk.download('gutenberg')
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import gutenberg, stopwords
from nltk.tokenize import word_tokenize
import json
import os
import re
def convert_to_dict():
d = {}
for file_id in gutenberg.fileids():
d[file_id] = {}
d[file_id]['title'] = file_id
d[file_id]['content'] = gutenberg.raw(file_id)
d[file_id]['word_count'] = len(gutenberg.words(file_id))
return d
def preprocess_text(text):
#### Preprocess text by applying several cleaning steps:
#### 1. Convert to lowercase
#### 2. Tokenize
#### 3. Remove punctuation and numbers
#### 4. Remove stop words
#### convert to lowercase
text = text.lower()
#### tokenize
#### first, let's see what basic split() does
basic_tokens = text.split()
#### now, let's use NLTK's word_tokenize
tokens = word_tokenize(text)
#### remove punctuation and numbers
#### we'll use a regular expression to keep only alphabetic characters
cleaned = []
for token in tokens:
cleaned.append(re.sub(r'[^a-z]', '', token))
tokens = cleaned
#### remove empty strings that might result from the previous step
cleaned = []
for token in tokens:
if token:
cleaned.append(token)
tokens = cleaned
#### remove stop words
stop_words = set(stopwords.words('english'))
cleaned = []
for token in tokens:
if token not in stop_words:
cleaned.append(token)
tokens = cleaned
return tokens
def main():
# print(gutenberg.fileids())
# for file_id in gutenberg.fileids():
# print('####')
# print(gutenberg.raw(file_id)[:100])
d = convert_to_dict()
# print(d['austen-emma.txt'])
#### create save directory
save_dir = 'C:/Users/YourUsername/Documents/Projects/my_python_project/data'
if not os.path.exists(save_dir):
os.makedirs(save_dir)
#### save d as JSON file in save_dir
with open(f'{save_dir}/gutenberg_data.json', 'w') as f:
json.dump(d, f)
#### load d from JSON file we just created
with open(f'{save_dir}/gutenberg_data.json', 'r') as f:
d = json.load(f)
#### select a text to preprocess
text_id = 'shakespeare-macbeth.txt' # Macbeth by Shakespeare
text_content = d[text_id]['content']
#### preprocess with our function
tokens = preprocess_text(text_content)
if __name__ == '__main__':
main()
Let's walk through the preprocessing steps in our preprocess_text()
function:
split()
method, which simply divides text at whitespace. Then we use NLTK's word_tokenize()
, which is more sophisticated and properly handles punctuation separation.re.sub()
function replaces any character that's not a lowercase letter with an empty string.