AYAN-YUE GUPTA

SESSION 1: Basics of Python and text as data

~~~~Table of Contents~~~~

1. Setting up your work environment

Hello everyone. This session covers the basics of Python and text preprocessing. A github repository containing the code used for this session can be found here.

The first thing to do when getting started with coding in Python is to understand how the work environment -- the tools used for coding and running scripts -- is set up. There are lots of options for such tools, e.g. integrated development environments, notebooks, but before exploring those options, one should know the most basic set up possible, for which you need just 2 things:

We'll be using this minimal set up for the rest of my sessions.

Now, if you've followed the setup for this session, you should already have Python installed and have selected a text editor. The next thing to do is to set up a virtual environment. Knowing how to manage virtual environments is useful because dealing with multiple Python projects can get messy.

One project might require an older version of a particular programme, while another project you are working on might require a newer version of the same programme. Instead of upgrading/reverting programme versions everytime you want to work on different Python projects, virtual environments lets you work with different versions of programmes simultaneously.

To set up a virtual environment, follow these steps:

You now have a virtual environment called env1 in the folder you selected/created in (1). You should get into the habit of navigating to the directory of your environment and activating it whenever you want to work on a Python project with the following commands:

#### navigate to the directory containing your virtual environment
cd C:\Users\YourUsername\Documents\Projects\your_python_project

#### activate your virtual environment
#### Mac/Linux
source env1/bin/activate
#### Windows
env1\Scripts\activate

To deactivate your virtual environment, simply enter:

deactivate

↑ Back to top

2. Your first script

You are now in a position to write your first script! Using your text editor, create a file called first_script.py in the same directory containing your virtual environment. With your text editor, enter the following line into first_script.py :

print('hello')

Save your file, and in the command line (ensuring you are in the same directory as first_script.py), and enter the command:

python first_script.py

You should get the following output in your command line:

>>> hello

Congratulations on your first script! Note the workflow here: create script with text editor, save as .py file, run the file in command line. This is the fundamental process of using Python.

↑ Back to top

3. Variables, functions, types

If we want to make scripts that can do a bit more than just print text, we need to make sure our code is organised so that:

Scripts can get pretty complicated very quickly. Without keeping things organised, scripts that can do more complex tasks can become a nightmare to read through and maintain.

The basic tool for ensuring good organisation is the function. Functions allow us to package code we know we are going to use repeatedly in a neat way, and to enable that bit of repeatable code to work with many different inputs. For example, suppose in our script we know we are going to be printing a lot of similar stuff. Instead of rewriting the various combinations of what we want to print repeatedly, we can define a single function that gives us a basic template of what we want to print:

def greetings(name, country):
    
    print(f'Hello, my name is {name} and I am from {country}.')

Here, we give the name and country of our greeting as arguments, i.e. as inputs to the function. Now, whenever we want our script to introduce a different person, instead of repeatedly typing out variations of the greeting, we can simply call the function with different arguments:

greetings('John', 'Great Britain')
greetings('Monica', 'Italy')

Note the formatting used to define the function. After the first line defining function arguments, the rest of the function is indented 4 spaces to the right. Remember this indentation format -- whenever one indents in Python, one indents with 4 spaces.

You'll notice that if we do not put the arguments of greetings() in quotation marks, the code will not work. This is because Python is structured around the idea of types of values. In Python, values can be:

When we put text in quotation marks, we're creating a string value. Without the quotation marks, Python would think we're trying to refer to a variable called 'John', which doesn't exist in our code.

Variables are names that refer to stored values. Think of them as labeled containers that hold data. For example:

name = "John"
age = 25
height = 1.85
countries_visited = ["France", "Japan", "Brazil"]

Here, name, age, height, and countries_visited are variables that store different types of values. We can use these variables in our functions:

person_name = "Monica"
person_country = "Italy"
greetings(person_name, person_country)

This will produce the same output as greetings('Monica', 'Italy') because the variables contain those string values.

Understanding value types and variables is crucial because Python treats different types differently. For example, the + operator adds numbers but concatenates strings:

result1 = 5 + 10       # result1 equals 15
result2 = "5" + "10"   # result2 equals "510"
↑ Back to top

4. Keeping code organised

To ensure organisation, it's a good idea to make sure your scripts always have this structure:

#### write imports at the beginning, e.g.:
import os

#### write functions and classes here, e.g.:
def function(arg):
    
    return arg * 2
  

def main():

    ####call functions and classes here, e.g.:
    function(2)

if __name__ == '__main__':
    main()

Do not worry about what imports are for now - we'll cover those in 5 minutes. The key parts to understand in this structure are:

Let's modify our first script to follow this pattern:

def greetings(name, country):
    print(f'Hello, my name is {name} and I am from {country}.')

def main():
    greetings('John', 'Great Britain')
    greetings('Monica', 'Italy')

if __name__ == '__main__':
    main()

When you run this script, it will produce exactly the same output as before, but is now organised in a way that makes it much easier to maintain and update, since now you know where functions are defined, where functions are called and where imports are performed. This structure will become especially valuable as your scripts grow larger and more complex.

↑ Back to top

5. Imports

In Python scripts, we can import other Python scripts -- known as packages -- written by other people. This makes life much easier. It means that we do not have to, for example, write a neural network from scratch everytime we want to do some machine learning. We can just import an appropriate network that someone else has written.

Before we can import a package, we need to install it into our virtual environment. There are multiple 'package managers' that enable you to install packages. In my sessions, we will only be using the default package manager, pip.

Let us use pip to install nltk, the Natural Language Toolkit. This package will help us learn the basics of using text as data:

pip install nltk

Now, create a new python script in your Python project directory entitled preprocessing.py using your text editor. In the way just discussed, add in the main function and if __name__ == '__main__': block to keep things organised. We can use pass as a placeholder before adding in function calls to main. At the very top of the script, let us import nltk:

import nltk

def main():
    
    pass ####placeholder -- we can remove this when we start adding stuff to main()


if __name__ == '__main__':
    main()

We will be using the Gutenberg dataset from nltk to learn text preprocessing. With nltk, we sometimes need to download additional resources to import nltk packages that don't come with the initial nltk installation. We can use the nltk.download() function to download these additional resources. We will use this function to download the resources needed to import the Gutenberg dataset:

import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg

def main():
    
    pass ####placeholder -- we can remove this when we start adding stuff to main()


if __name__ == '__main__':
    main()

When we use from nltk.corpus import gutenberg, we're using a more specific import syntax that lets us import just one particular component from a package. This differs from import nltk, which imports the entire package. The advantage of using from x import y is that we can directly use the imported component without having to prefix it with the package name. So, we can now write gutenberg instead of having to prefix with nltk.corpus all the time: nltk.corpus.gutenberg. This keeps our code concise.

↑ Back to top

6. Examining the data

So we've managed to import some classic literary texts from the Gutenberg dataset to practise preprocessing. But we have no concrete understanding of the data. What does it look like? The Gutenberg dataset is stored as a class. We will not go into the details of classes now -- if we have time, we will go through them in one of the other 2 sessions. For now, it is enough to say that a class is a way of bundling together a collection of related functions. These functions are known as the methods of a class.

So, Gutenberg texts are stored in the class gutenberg, and we can access this data through the methods of gutenberg. One of these methods is .fileids(), which returns a list of all available texts in the Gutenberg corpus. Let's print the output of this method to see what's available:

import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg

def main():
    
    print(gutenberg.fileids())

if __name__ == '__main__':
    main()

As you can see, we get a list of filenames representing classic texts like 'austen-emma.txt', 'shakespeare-macbeth.txt', and 'melville-moby_dick.txt'. Let's examine these texts in more detail by writing a for loop and using the the .raw() method:

import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg

def main():
    
#    print(gutenberg.fileids())
    for file_id in gutenberg.fileids():
        print('####')
        print(gutenberg.raw(file_id)[:100])
    
if __name__ == '__main__':
    main()

With this for loop, we iterate through each filename in gutenberg.fileids(), and print the result of inputting each filename into the gutenberg.raw() method. We only print the first 100 characters of each result to keep things legible using slicing, which we will cover in the next section. We also print '####' at each iteration so we can see where one iteration ends and another begins more clearly.

For loops are essential in Python, and allow you to perform operations on every element of any kind of data structure automatically, as opposed to, for example, manually typing out a print statement for every filename in gutenberg.fileids().

This process of essentially using print statements and loops to see the contents of a dataset is an important step that enables one to get a more intuitive understanding of the overall structure of contents of a dataset.

You'll notice that I've added # at the beginning of one of the lines the main function in the code above. The # symbol is how we create comments in Python. Any text that follows a # on a line is ignored by Python when running the script. Comments are useful for temporarily disabling code without deleting it and leaving notes for yourself or other programmers who might read your code later. I'll be commenting out code we have already gone through and is no longer needed.

↑ Back to top

7. Lists and dictionaries

Through the previous section, we have gained some familiarity with lists. We can tell a value is a list by looking at its brackets -- if it is enclosed with square brackets [], it is a list. We can use slicing and indexing to look at parts of lists. For example, we can modify our code printing the filenames of the Gutenberg dataset so that gutenberg.fileids() only prints the first 3 filenames:

import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg

def main():
    
    print(gutenberg.fileids()[:3])
    
if __name__ == '__main__':
    main()

Let's go through some examples to understand the logic of indexing and slicing:

This slicing syntax works on other sequence types too, like strings:

sentence = "Python is amazing"
print(sentence[0])     # 'P'
print(sentence[:6])    # 'Python'
print(sentence[10:])   # 'amazing'

The other basic data structure you ought to be familiar with is the dictionary, which are enclosed by curly brackets {} and are organised in terms of key value pairs. Let's write a function converting Gutenberg data into a dictionary to understand what this means:

import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg


def convert_to_dict():

    d = {}
    for file_id in gutenberg.fileids():
        d[file_id] = {}
        d[file_id]['title'] = file_id
        d[file_id]['content'] = gutenberg.raw(file_id)
        d[file_id]['word_count'] = len(gutenberg.words(file_id))
    
    return d


def main():

#    print(gutenberg.fileids())
#    for file_id in gutenberg.fileids():
#        print('####')
#        print(gutenberg.raw(file_id)[:100])

    d = convert_to_dict()
    print(d['austen-emma.txt'])


if __name__ == '__main__':
    main()

Let's go over what this function is doing. First, we create an empty dictionary d = {} that we'll populate with our Gutenberg data.

We then iterate through each file ID in the Gutenberg corpus. For each text, we create a nested dictionary structure with several pieces of information:

d[file_id] = {} # initialise empty dict for each file
d[file_id]['title'] = file_id  # file title
d[file_id]['content'] = gutenberg.raw(file_id)  # store the full text
d[file_id]['word_count'] = len(gutenberg.words(file_id))   # count words

If you print d['austen-emma.txt'] in the main function, you'll be able to see the structured information we have stored in d for 'Emma'.

The string ID 'austen-emma.txt' we enter into the square brackets of d['austen-emma.txt'] is the key of this dictionary, and what is returned by d['austen-emma.txt'] is the value of the dictionary. In this case our value is another, second dictionary (hence it's structure is 'nested'), the values of which we can access using the second dictionary's keys:

print(d['austen-emma.txt']['title'])
print(d['austen-emma.txt']['word_count'])

A couple of things to remember about dictionaries:

↑ Back to top

8. Saving and loading with JSON

Dictionaries are a very common way of storing text data, especially if you are working with APIs and web scraping, so it is important to get familiar with them. An important part of using them is saving them to and loading them from your computer's storage. The standard way of doing this is using the JSON file format, which is designed for storing dictionaries and lists. The advantage of JSON is that it is used across many programming languages, so you don't have to worry about compatibility when sharing data in JSON format.

The package necessary for saving and loading with JSON comes preinstalled with Python, so we can just import it without having to install it first with pip. Let us modify our preprocessing script to save and load the dictionary of Gutenberg texts we just created:

import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
import json
import os


def convert_to_dict():

    d = {}
    for file_id in gutenberg.fileids():
        d[file_id] = {}
        d[file_id]['title'] = file_id
        d[file_id]['content'] = gutenberg.raw(file_id)
        d[file_id]['word_count'] = len(gutenberg.words(file_id))
    
    return d


def main():

#    print(gutenberg.fileids())
#    for file_id in gutenberg.fileids():
#        print('####')
#        print(gutenberg.raw(file_id)[:100])
    
    d = convert_to_dict()
#    print(d['austen-emma.txt'])

    #### create save directory
    save_dir = 'C:/Users/YourUsername/Documents/Projects/my_python_project/data'
    if not os.path.exists(save_dir):
        os.makedirs(save_dir)
    #### save d as JSON file in save_dir
    with open(f'{save_dir}/gutenberg_data.json', 'w') as f:
        json.dump(d, f)
    #### load d from JSON file we just created
    with open(f'{save_dir}/gutenberg_data.json', 'r') as f:
        d = json.load(f)


if __name__ == '__main__':
    main()

Let's go through the modifications:

9. Preprocessing text

Hopefully you now feel a bit more comfortable with the basics of Python. We have not covered the basics exhaustively, but we have covered enough to start thinking about working with text data.

An essential part of working with text data is preprocessing, which refers to all the ways text might need to be prepared before analysing it. Some methods of analysis only require minimal preprocessing -- others will require a lot.

Let's go through some of the basics of preprocessing using the Gutenberg dataset. We'll download and import some additional nltk resources and then write a preprocessing function that covers some common preprocessing procedures:

import nltk
nltk.download('gutenberg')
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import gutenberg, stopwords
from nltk.tokenize import word_tokenize
import json
import os
import re


def convert_to_dict():
    
    d = {}
    for file_id in gutenberg.fileids():
        d[file_id] = {}
        d[file_id]['title'] = file_id
        d[file_id]['content'] = gutenberg.raw(file_id)
        d[file_id]['word_count'] = len(gutenberg.words(file_id))
    
    return d


def preprocess_text(text):
    
    #### Preprocess text by applying several cleaning steps:
    #### 1. Convert to lowercase
    #### 2. Tokenize
    #### 3. Remove punctuation and numbers
    #### 4. Remove stop words
   
    #### convert to lowercase
    text = text.lower()
    
    #### tokenize
    #### first, let's see what basic split() does
    basic_tokens = text.split()
    
    #### now, let's use NLTK's word_tokenize
    tokens = word_tokenize(text)
    
    #### remove punctuation and numbers
    #### we'll use a regular expression to keep only alphabetic characters
    cleaned = []
    for token in tokens:
        cleaned.append(re.sub(r'[^a-z]', '', token))
    tokens = cleaned

    #### remove empty strings that might result from the previous step
    cleaned = []
    for token in tokens:
        if token:
            cleaned.append(token)
    tokens = cleaned
    
    #### remove stop words
    stop_words = set(stopwords.words('english'))
    cleaned = []
    for token in tokens:
        if token not in stop_words:
            cleaned.append(token)
    tokens = cleaned


    return tokens


def main():

#    print(gutenberg.fileids())
#    for file_id in gutenberg.fileids():
#        print('####')
#        print(gutenberg.raw(file_id)[:100])

    d = convert_to_dict()
#    print(d['austen-emma.txt'])

    #### create save directory
    save_dir = 'C:/Users/YourUsername/Documents/Projects/my_python_project/data'
    if not os.path.exists(save_dir):
        os.makedirs(save_dir)
    #### save d as JSON file in save_dir
    with open(f'{save_dir}/gutenberg_data.json', 'w') as f:
        json.dump(d, f)
    #### load d from JSON file we just created
    with open(f'{save_dir}/gutenberg_data.json', 'r') as f:
        d = json.load(f)
    
    #### select a text to preprocess
    text_id = 'shakespeare-macbeth.txt'  # Macbeth by Shakespeare
    text_content = d[text_id]['content']
    
    #### preprocess with our function
    tokens = preprocess_text(text_content)


if __name__ == '__main__':
    main()

Let's walk through the preprocessing steps in our preprocess_text() function:

↑ Back to top