Scraping the Web for IELTS Questions
Web scraping is the process of collecting information from web pages in an automated way. Itâs a useful skill if youâre looking for, say, IELTS questions for your students to practice with. I used Python and Beautiful Soup to download some practice questions. In this blog post, I document my process and note some useful snippets that I always have to look up. There are five parts to this web scraping process: find the data, decide what to keep, explore the dataâs structure, download it, and then extract it.
Part 1: Find the data
This was the easiest part of the process. I searched online for âIELTS Writing Part 1 Question Examplesâ and chose one of the first results. The first site I looked had some questions on it, but the number of questions was limited, and looking at the structure (Part 3 below), I decided it would be too much work for too few questions to bother scraping it.
The next page I looked at had many more questions. After a cursory look, I decided this is what Iâd go with.
Part 2: Decide what data to keep
Each question is on its own page. I noticed the page had a heading describing the type of chart in the writing question. I thought that might be useful for future planning, so thereâs the first thing to keep. Next, thereâs the question itself, which is just a blob of text followed by an image of a table or diagram. So, Iâd have to get that text and download all the images, as well. Lastly, there was a model answer at the bottom of each page. Since I should be able to write my own model answers fairly easily if needed, I decided not to try and extract that. Just the question type and the question itself (image included) would be enough.
Thinking about storage, I decided it would be reasonable to store each heading and question in a CSV file. I would store the question as minified HTML so it all fits on one CSV line.
Part 3: Explore the structure
Each question page is linked to on a list page. The list of questions is paginated, with 20 links per page. I clicked on the ânext pageâ button and looked at the URL bar in the browser. For each page, thereâs a ?start=...
followed by a multiple of 20, all the way up to the last page at ?start=320
. I decided I would first scrape these pages for their links and save those links to a file. I show how I describe the links in code in the next part.
Back on the question pages, hit Ctrl+Shift+C
and click on a part of the page that has the info you want. Look at the HTML code and try to identify some structural characteristics that will help you extract that data later on. Itâs best if there are tags with IDs, like <h3 id="my-interesting-section-heading">
. IDs are easy to locate later on. Next best are classes, but there can be many HTML tags that have the same classes. Worst is if tags neither stand out from their surroundings (h3
, etc.) nor are unique. The worst case is fairly common, and it was the case for me this time.
As I later discovered, even worse was the fact that the HTML of the question pages was invalid. My browser was doing a lot of cleaning up in the background before it showed me the source. Beautiful Soup, however, wasnât doing that same cleaning up. I would have to do some of that cleaning up myself.
Part 4: Download the data
Having scraped other web sites in the past, I have learned from experience that itâs sometimes best to work in stages. First, download and save the pages to your computer. That way, if something goes wrong in your code as you process the page, you donât have to download the page again. My strategy here is to first download the list pages to extract the links to all the question pages. Then, I download each question page and save it to disk.
Below is the preliminary code to get ready to download the list pages.
# get_question_urls.py
from urllib.parse import urljoin # Helps make URLs from variables
from bs4 import BeautifulSoup # Tool for extracting data from web pages
import requests # Tool for downloading web pages
import time # We will wait between getting each page
import random # ... a random number of seconds
# The common part of every list page's URL
base_url = "https://some-IELTS-site.com/questions"
question_urls_file = "urls.txt"
# Generate the URLs for all the list pages
starts = range(20, 321, 20) #from 20 to 320 (inclusive), in jumps of 20
urls = [base_url] # Start with just the base URL in the list
# Then add the rest of the URLs
urls.extend([urljoin(base_url, f"?start={start}") for start in starts])
There we go. Weâve got the list of URLs all prepared, and we are ready to start scraping. For the scraping, weâll need to use requests
, which we imported just now. Itâs not hard to use, but I always forget how to do it.
Getting a web page follows this pattern:
import requests
response = requests.get(url)
if response.status_code == 200:
do_something()
else:
maybe_handle_other_status_codes()
or_just_fail_with_a_message()
It would be best to use try/except
to catch exceptions and prevent your program from exiting without cleaning up. Cleaning up in our case would mean writing the URLs or web page data that weâve already downloaded. See requests
documentation for the requests that you can handle. The main situations would be if thereâs a 404, a timeout, or a connection issue if your internet sucks.
The try/except
allows a for loop to continue if there are any problems.
Anyway, hereâs the process.
question_urls = []
for url in urls:
try:
# Be a good netizen. Don't just bombard somebody's server with
# tons of requests all at once. Patience, Daniel-San!
wait_time = random.uniform(5, 30)
print(f"Waiting {wait_time} seconds...")
time.sleep(wait_time)
response = requests.get(url)
if response == 200:
soup = BeautifulSoup(response.content, "html.parser")
td_tags = soup.find_all("td", class_="list-title")
for td in td_tags:
href = td.find("a").attrs["href"]
question_urls.append(href)
else:
print(f"Error fetching {url}: Server returned {response.status_code}")
except Exception as e:
print(f"An error occurred: {e}")
Beautiful Soup comes in handy here. soup = BeautifulSoup(response.content, "html.parser")
and then soup.find_all()
gets all the tags that Iâm looking for. I noticed that all the links were in a table, and each td
tag in the table had a class of list-title
. That made it easy to pull those out. Then, I just looked at each of those td
tags and pulled the links out of them. Tada! Now, all thatâs left is to write the links to a file, which is straightforward.
with open(question_urls_file, "w") as f:
for url in question_urls:
file.write(url + "\n")
Part 5: Extract the data
For me, this was the difficult part. As I mentioned earlier, the HTML of each page turned out to be invalid. Specifically, the paragraph tags had heading tags and a bunch of other block content that shouldnât have been in there.
Reading the docs for Beautiful Soup after the fact, I realized that some of my malformed HTML problems might have been solved by using a different parser. By saying BeautifulSoup(myhtml, "html.parser")
, I was using Pythonâs built-in HTML parser, which the docs say is the least awesome of the available parsers. They say lxml
is the best. Live and learn, I guess. Iâll try lxml next time.
Part 5.1: Clean up the data
The HTML for the question pages was messy. Did a few things to clean them up. First, I deleted all the script tags. They took up a lot of space and made everything harder to read. Then, I deleted or unwrap()
ed a few tags (kept just their content).
for script in soup.find_all("script"):
script.decompose()
for span in soup.find_all("span"):
span.unwrap()
for strong in soup.find_all("strong"):
strong.unwrap()
for p in soup.find_all("p"):
move_nested_tag_up(p)
clean_para(p)
The last âforâ uses two functions I wrote to clean up the HTML. I had nested p
tags, and I wanted to move them out of each other.
def move_nested_tag_up(tag):
# I added this after doing testing because I was too lazy to actually delete
# all the print statements I added.
def print(*args, **kwargs):
return
# Items in a tag might not be a tag, so make sure we only look at tags
if not isinstance(tag, Tag) or not tag.name:
raise ValueError("Must be a Tag with a name, not NavigableString, etc.")
parent = tag.parent
if parent is None:
return # No parent to move up from
# If a tag is the same as its parent
if tag.name == parent.name:
tags_to_move = []
reached_start = False
# Find every tag from that tag to the end
for child in parent.children:
if isinstance(child, Tag):
print("reached_start", reached_start)
print("Considering", child)
if tag == child:
reached_start = True
if reached_start:
print("adding", child)
tags_to_move.append(child)
print(tags_to_move)
# Now add the tags that were found to after the parent,
# i.e. mom/dad become sister/brother
for tag in reversed(tags_to_move):
print("Inserting", tag)
parent.insert_after(tag)
The second function cleans up paragraphs by finding tags that shouldnât be in them and doing the same as the above function. I did this by looking at the HTML spec for the list of acceptable tags in a paragraph. Those tags are called âphrasing contentâ tags.
def clean_para(para):
if not isinstance(para, Tag) or not para.name:
raise ValueError("Error: tag must be Tag, not NavigableString, etc")
phrasing_content_tags = {
"a", "abbr", "area", "audio",
"b", "bdi", "bdo", "br",
"button", "canvas", "cite", "code",
"data", "datalist", "del", "dfn",
"em", "embed", "i", "iframe",
"img", "input", "ins", "kbd",
"label", "link", "map", "mark",
"math", "meta", "meter", "noscript",
"object", "output", "picture", "progress",
"q", "ruby", "s", "samp",
"script", "select", "slot", "small",
"span", "strong", "sub", "sup",
"svg", "template", "textarea", "time",
"u", "var", "video", "wbr"
}
tags_to_move_up = []
for child in para.children:
if isinstance(child, Tag):
if child.name not in phrasing_content_tags:
child.name = "p"
tags_to_move_up.append(child)
for tag in tags_to_move_up:
para.insert_after(tag)
These functions are crudeâeven for some simple contexts, they wouldnât preserve the order of the content. But I tested them on a few test cases I extracted from my set of web pages, and everything seemed to work all right.
I stuck all the code examples in the previous section into a loop that cleaned up each file, overwriting it in place (using the technique shown in this Stack Overflow answer).
with open(file_path, "r+") as file:
# Read the file, do stuff, ...
# ... clean_para, move_nested_tag_up, etc.
output_html = soup.prettify()
file.seek(0)
file.write(output_html)
file.truncate()
Part 5.2: Actually extract the data
To get the question type from the heading at the top of each question, I looked for the first h3
tag. Headings look like this:
IELTS Academic Writing Task 1/ Graph Writing - Diagram/ Process Diagram:
Since the types were conveniently separated by slashes, I just took the second one to the end. But I wasnât sure if that would be the same over all of the hundreds of questions, so I tried using the following method, instead.
- Split the heading into a list by parts separated by slashes.
- Get the list index of the first item containing the string âGraphâ using
enumerate()
. Enumerate gives each item in a list a number. I learned about it at Real Python. - Starting from that index till the end, put the parts of the heading back together, keeping the slashes as separator. While weâre at it, remove the extraneous colon at the end.
heading_parts = [p.strip() for p in heading.text.split("/")]
question_type_index = [
index for index, val in enumerate(heading_parts) if "Graph" in val
][0]
question_type = "/".join(heading_parts[question_type_index:]).replace(":","")
With question type in hand, up next was the question itself. Using Beautiful Soupâs next_siblings()
, I started at the heading and collected all the tags between it and the image, which I knew would be at the end of the question. Lastly, I updated the src
of the images in each question to reflect their new home in a sub-folder by the question CSV. And voilĂ ! My data was ready!
Type,Question
Bar Chart,"<p>You should spend about 20 minutes...<p>The graph below shows the ...
Bar Chart,"<p>You should spend about 20 minutes...<p>The chart below shows the ...
Bar Chart,"<p>» You should spend about 20 minutes...<p>The average prices per k...
Bar Chart + Pie Chart,"<p>» You should spend about 20 minutes...<p>The bar char...
Bar Chart + Table + Line Graph,"<p>» You should spend about 20 minutes...<p>The...
Useful snippets
I already mentioned the request/response pattern in Part 4. Here are a few other ones that I found useful.
Wait a random time before doing something:
import time import random for url in urls: wait_time = random.uniform(5,10) time.sleep(wait_time)
Overwrite a file in place, by using
open(myfile, "r+")
, followed byfile.seek(0)
,file.write(stuff)
, andfile.truncate()
.Loop through the next or previous siblings of a tag in Beautiful Soup:
for tag in my_div.next_siblings: if tag.parent: tag.parent.insert_atfter(tag) # Move the tags up next to their parent
As a result of my work here, Iâve got a library of several hundred IELTS Type 1 Tasks available in a convenient, searchable format for my IELTS classes next semester. But boy, is web scraping a lot of work. If I wasnât looking at getting so much data, I would probably just copy them manually or spend time browsing for suitable questions, instead. But in this case, the time investment was worth it. I collected a lot of useful data, and I improved my programming skills while I was at it. That's a win-win!