Computational social science makes use of the abundance of data generated by the internet. To make use of it, we need to be able to collect it. There are two main ways to collect data from the internet -- APIs and web scraping.
APIs, or application programming interfaces, are a means for organisations to allow users to collect their data in a controlled manner. For example, Youtube has an API that allows users to collect their data in a manner Youtube can control. APIs are used programmatically -- users create scripts using some programming language to interact with them.
Web scraping, on the other hand, is simply the action of creating scripts to download data from websites. It does not work through an interface that is managed by whoever owns the relevant website. Users just create scripts and run them.
Because web scraping does not work through an interface managed by website owners, it is typically a lot more difficult. API managers use dictionaries to store data in a highly structured manner, and typically provide documentation so that API users understand this structure. With web scraping, data has no structure beyond the HTML and/or Javascript that constitutes the source code of a website, and there is no documentation. Users have to figure out where in the source code the data they want is located, and then write a script that can navigate source code to find the target data.
Let's start with learning about APIs with Youtube's Data API, which allows users to collect data about videos and channels for free. We will write a script to collect some video comments.
APIs generally require some kind of authorisation -- users need to be granted permission to use an API. They are granted permission by being given an API key, which is basically a password. For Youtube, follow the instructions here to get the API key needed to use their API.
Now that you have your key, let's create a script that uses your key to get permission to get data from Youtube's API. First, ensure that you have done the following installations:
pip install --upgrade google-api-python-client
pip install --upgrade google-auth-oauthlib google-auth-httplib2
The first thing we will do in our script is import the packages needed to use Youtube's API with Python, and then use those imported packages to create a client
that can make authorised requests from the API. We'll also import json
and os
so we can manage directories and create JSON files for storing data:
import json
import os
import google_auth_oauthlib.flow
import googleapiclient.discovery
import googleapiclient.errors
def main():
#### create api client
with open('path/to/your/api/key.txt', 'r') as f:
api_key = f.read()
client = googleapiclient.discovery.build('youtube', 'v3', developerKey=api_key)
if __name__ == '__main__':
main()
It's a bad idea to have your API key directly coded as a variable in your scripts, since if you share it others might be able to use it. Often, API access has limits, or is not free, so you don't want others using up your quota or wasting your money. You'll notice in the above I have made my script open a separate .txt file that contains the API key rather than directly coding it in as a variable.
For this to work, copy and paste your API key into a .txt file and store it in your Python project directory. Make sure what you paste contains only the key -- don't include whitespace. Now, whenever you run this script, it will open this .txt file and get the API key from there. This way, others cannot use your key just by reading your script, and you have more control over where your key is by keeping track of your .txt file.
After accessing your API key, the script inputs it as a keyword argument into the googleapiclient.discovery.build()
method, which then returns an object that we store as client
. This client
contains all the methods we need to make authorised requests for data to the API.
Let's use our client
to search for videos related to 'covid':
import json
import os
import google_auth_oauthlib.flow
import googleapiclient.discovery
import googleapiclient.errors
def video_request(term, client, max_results=10, lang='en'):
#### construct request to retrieve videos that a returned by our search term
#### https://developers.google.com/youtube/v3/docs/videos/list
request = client.search().list(
part='snippet',
maxResults=max_results,
q=term,
relevanceLanguage=lang
)
#### make request
return request.execute()
def main():
#### create api client
with open('path/to/your/api/key.txt', 'r') as f:
api_key = f.read()
client = googleapiclient.discovery.build('youtube', 'v3', developerKey=api_key)
#### search for youtube videos
d = video_request('covid', client)
if __name__ == '__main__':
main()
Here, we define a function video_request
that makes use of client
's .search().list()
method to construct a request for videos returned by a term search. The search term is inputted as video_request
's term
argument, which is then passed into .search().list()
's q
keyword argument. We'll also add in keyword arguments to specify how many videos we want and what video language we want.
A big part of using an API involves looking at documentation to figure out how the API's classes and methods work. Take a look at the documentation for the .search().list()
method to understand it.
Documentation in general tends to have a 'Parameters' section and a 'Returns' section. The 'Parameters' section tells you what a function, class or method takes as arguments, while the 'Returns' section details what is outputted by a function, class or method.
After constructing the request, we need to send it to the API. This is done with the request.execute()
function, which we call in the function's return
line so that the function outputs the result of executing the request.
When calling the function, we store this output as the variable d
. If you print d
, you'll see that the output is stored as a dictionary. Examine the printed content to become familiar with d
's structure.
Our goal is to get the comments from the videos returned by our search term 'covid'. In the output of our video_request
function, data about each video is stored under the 'items'
key. We need to go through each entry in this key to retrieve each video's unique video ID. We can then use a video's ID to get the comments people have made on it using the .commentThreads().list()
method. Take a look at the documentation for this method to understand how to use it. You'll notice it has a videoId
argument where we can input a video's ID.
Let's write a function get_vid_comments()
that loops through the video data returned by our video_request
function, retrieves each video's ID, and then uses each ID with the .commentThreads().list()
method to get the comments of each video:
import json
import os
import google_auth_oauthlib.flow
import googleapiclient.discovery
import googleapiclient.errors
def video_request(term, client, max_results=10, lang='en'):
#### construct request to retrieve videos that a returned by our search term
#### https://developers.google.com/youtube/v3/docs/videos/list
request = client.search().list(
part='snippet',
maxResults=max_results,
q=term,
relevanceLanguage=lang
)
#### make request
return request.execute()
def get_vid_comments(d, client, max_results=10):
results = {}
for vid in d['items']:
#### filter out channels and get video id
if vid['id']['kind'] == 'youtube#channel': continue
vid_id = vid['id']['videoId']
#### construct request to retrieve comments from a video
#### https://developers.google.com/youtube/v3/docs/commentThreads/list
request = client.commentThreads().list(
part='id,snippet,replies',
videoId=vid_id,
maxResults=max_results
)
#### make request
comments_d = request.execute()
#### skip videos with no comments
if len(comments_d['items']) == 0: continue
#### construct results dictionary
for item in comments_d['items']:
comment_info = item['snippet']['topLevelComment']['snippet']
content = comment_info['textDisplay']
user = comment_info['authorDisplayName']
date = comment_info['publishedAt']
results[item['id']] = {
'video_id': vid_id,
'content': content,
'user': user,
'date': date
}
return results
def main():
#### create api client
with open('path/to/your/api/key.txt', 'r') as f:
api_key = f.read()
client = googleapiclient.discovery.build('youtube', 'v3', developerKey=api_key)
#### search for youtube videos
d = video_request('covid', client)
#### get youtube comments
results = get_vid_comments(d, client)
#### save results
with open(f'{data_dir}/youtube_data.json', 'w') as f:
json.dump(results, f)
if __name__ == '__main__':
main()
This function loops through the video data we retrieved earlier. At each iteration of this loop, we retrieve the ID of a single video with vid['id']['videoId']
. We then construct our request, and then execute it, storing the output in the variable comments_d
. This output contains comment data. We then perform a second loop through comments_d
to grab the data we want for each comment for storage in the dictionary results
. We then return this data and save it as a JSON file.
Let's turn to web scraping. Make a new script:
from bs4 import BeautifulSoup
import requests
import os
import json
def main():
#### scrape
url = 'https://books.toscrape.com'
####save scraped data
data_dir = f'{os.getcwd()}/data'
if not os.path.exists(data_dir): os.makedirs(data_dir)
with open(f'{data_dir}/scraped_data.json', 'w') as f:
json.dump(product_info, f)
if __name__ == '__main__':
main()
The key thing to note here is that the website we are going to scrape data from is books.toscrape.com. This website is designed for people to practise scraping on, and will allow us to learn about how to write a script that can follow links, parse HTML source code, and find the data we want.
However, outside of this practise context, you should note that scraping will be a lot mor difficult. Some websites specify in their terms and conditions that scraping is not permitted -- you need to watch out for this to avoid ethical problems. Basically all websites will enforce rate limits if you scrape data too fast. This is because their servers may not be able to handle lots of people sending 1000s of requests to the servers every minute.
Let's start writing a function that will take as input url
and give us a dictionary containing all the data we want. If you visit the website, you'll see that it imitates an online bookstore. There are links for each book available. If you click on a book's link, you will get more information about the book -- product description, prices, etc. So, let's write a function that will first visit this mock bookstore, and then visit each book's link, and then store information about each book in a dictionary:
from bs4 import BeautifulSoup
import requests
import os
import json
def scrape(url):
#### send request to website and store response as a variable
response = requests.get(url)
#### check if request was successful
if response.status_code != 200:
print(response.status_code)
quit()
#### parse html content
soup = BeautifulSoup(response.text, 'html.parser')
#### get book page links
books = soup.find_all('article', class_='product_pod')
quit()
def main():
#### scrape
url = 'https://books.toscrape.com'
####save scraped data
data_dir = f'{os.getcwd()}/data'
if not os.path.exists(data_dir): os.makedirs(data_dir)
with open(f'{data_dir}/scraped_data.json', 'w') as f:
json.dump(product_info, f)
if __name__ == '__main__':
main()
We've started this function by first making a request to the book store. If the request is successful, we will get the store's HTML source code, which we can use to get the data we want. We use BeautifulSoup to parse the HTML code, which allows us to use BeautifulSoup's methods to easily retrieve sections of the source code.
Much of web scraping involves looking through HTML source code, which we can do with our browser. Simply right click on the website and press 'View Page Source'. Looking at the book store's source code, we can see that books' links are contained within article
tags with the class "product_pod"
. We can use the final_all()
method to retrieve all these tags, and, therefore, retrieve all books' links.
However, we still need to retrieve the actual book links from these article
tags we've just retrieved. We use the same method to do this: examine source code to find the tags containing the information we want, and use the find_all
method to retrieve:
from bs4 import BeautifulSoup
import requests
import os
import json
def scrape(url):
#### send request to website and store response as a variable
response = requests.get(url)
#### check if request was successful
if response.status_code != 200:
print(response.status_code)
quit()
#### parse html content
soup = BeautifulSoup(response.text, 'html.parser')
#### get book page links
books = soup.find_all('article', class_='product_pod')
links = []
for book in books:
tags = book.find_all('a', href=True)
for t in tags:
# print(t)
link = url + '/' + t['href']
# print(link)
links.append(link)
links = list(set(links))
quit()
def main():
#### scrape
url = 'https://books.toscrape.com'
####save scraped data
data_dir = f'{os.getcwd()}/data'
if not os.path.exists(data_dir): os.makedirs(data_dir)
with open(f'{data_dir}/scraped_data.json', 'w') as f:
json.dump(product_info, f)
if __name__ == '__main__':
main()
So, what we've done here is created a loop that iterates through each of the article
tags. In each iteration, we find all a
tags that contain a book link in an article
tag, and do a second iteration through those. We construct a full link from each book link, and then collect all books' links into list links
. Now, let's complete the function by adding in code that visits each collected book link and stores information about books in a dictionary:
from bs4 import BeautifulSoup
import requests
import os
import json
def scrape(url):
#### send request to website and store response as a variable
response = requests.get(url)
#### check if request was successful
if response.status_code != 200:
print(response.status_code)
quit()
#### parse html content
soup = BeautifulSoup(response.text, 'html.parser')
#### get book page links
books = soup.find_all('article', class_='product_pod')
links = []
for book in books:
tags = book.find_all('a', href=True)
for t in tags:
# print(t)
link = url + '/' + t['href']
# print(link)
links.append(link)
links = list(set(links))
#### visit each book page and get product description
product_info = {}
for l in tqdm(links):
response = requests.get(l)
if response.status_code != 200:
print(response.status_code)
quit()
d = {}
#### parse content
soup = BeautifulSoup(response.text, 'html.parser')
#### get book title
d['title'] = soup.find('title').string
#### get product description
#### get all p tags
tags = soup.find_all('p')
for t in tags:
if not t.attrs:
d['description'] = t.string
#### get price and universal product code
table = soup.find('table', class_="table table-striped")
tags = table.find_all('tr')
for t in tags:
#### get price
if t.find('th').string == 'Price (excl. tax)':
d['price'] = t.find('td').string
#### use universal product code as unique key for
#### each book
if t.find('th').string == 'UPC':
product_info[t.find('td').string] = d
return product_info
def main():
#### scrape
url = 'https://books.toscrape.com'
#### save scraped data
data_dir = f'{os.getcwd()}/data'
if not os.path.exists(data_dir): os.makedirs(data_dir)
with open(f'{data_dir}/scraped_data.json', 'w') as f:
json.dump(product_info, f)
if __name__ == '__main__':
main()
You'll notice that the second loop for visiting each book's link follow the same steps as the first loop for collecting each book's link.
BeautifulSoup
..find_all
and .find()
.After finding each item of information we want, the loop stores each item in the dictionary product_info
.