Scraping the Web for IELTS Questions

Aug 18, 2024, 11:25 PM @ 📑 The Office

Web scraping is the process of collecting information from web pages in an automated way. It’s a useful skill if you’re looking for, say, IELTS questions for your students to practice with. I used Python and Beautiful Soup to download some practice questions. In this blog post, I document my process and note some useful snippets that I always have to look up. There are five parts to this web scraping process: find the data, decide what to keep, explore the data’s structure, download it, and then extract it.

Part 1: Find the data

This was the easiest part of the process. I searched online for “IELTS Writing Part 1 Question Examples” and chose one of the first results. The first site I looked had some questions on it, but the number of questions was limited, and looking at the structure (Part 3 below), I decided it would be too much work for too few questions to bother scraping it.

The next page I looked at had many more questions. After a cursory look, I decided this is what I’d go with.

Part 2: Decide what data to keep

Each question is on its own page. I noticed the page had a heading describing the type of chart in the writing question. I thought that might be useful for future planning, so there’s the first thing to keep. Next, there’s the question itself, which is just a blob of text followed by an image of a table or diagram. So, I’d have to get that text and download all the images, as well. Lastly, there was a model answer at the bottom of each page. Since I should be able to write my own model answers fairly easily if needed, I decided not to try and extract that. Just the question type and the question itself (image included) would be enough.

Thinking about storage, I decided it would be reasonable to store each heading and question in a CSV file. I would store the question as minified HTML so it all fits on one CSV line.

Part 3: Explore the structure

Each question page is linked to on a list page. The list of questions is paginated, with 20 links per page. I clicked on the “next page” button and looked at the URL bar in the browser. For each page, there’s a ?start=... followed by a multiple of 20, all the way up to the last page at ?start=320. I decided I would first scrape these pages for their links and save those links to a file. I show how I describe the links in code in the next part.

Back on the question pages, hit Ctrl+Shift+C and click on a part of the page that has the info you want. Look at the HTML code and try to identify some structural characteristics that will help you extract that data later on. It’s best if there are tags with IDs, like <h3 id="my-interesting-section-heading">. IDs are easy to locate later on. Next best are classes, but there can be many HTML tags that have the same classes. Worst is if tags neither stand out from their surroundings (h3, etc.) nor are unique. The worst case is fairly common, and it was the case for me this time.

As I later discovered, even worse was the fact that the HTML of the question pages was invalid. My browser was doing a lot of cleaning up in the background before it showed me the source. Beautiful Soup, however, wasn’t doing that same cleaning up. I would have to do some of that cleaning up myself.

Part 4: Download the data

Having scraped other web sites in the past, I have learned from experience that it’s sometimes best to work in stages. First, download and save the pages to your computer. That way, if something goes wrong in your code as you process the page, you don’t have to download the page again. My strategy here is to first download the list pages to extract the links to all the question pages. Then, I download each question page and save it to disk.

Below is the preliminary code to get ready to download the list pages.

# get_question_urls.py
from urllib.parse import urljoin # Helps make URLs from variables
from bs4 import BeautifulSoup    # Tool for extracting data from web pages
import requests                  # Tool for downloading web pages
import time                      # We will wait between getting each page
import random                    # ... a random number of seconds

# The common part of every list page's URL
base_url = "https://some-IELTS-site.com/questions"
question_urls_file = "urls.txt"

# Generate the URLs for all the list pages
starts = range(20, 321, 20) #from 20 to 320 (inclusive), in jumps of 20
urls = [base_url] # Start with just the base URL in the list
# Then add the rest of the URLs
urls.extend([urljoin(base_url, f"?start={start}") for start in starts])

There we go. We’ve got the list of URLs all prepared, and we are ready to start scraping. For the scraping, we’ll need to use requests, which we imported just now. It’s not hard to use, but I always forget how to do it.

✏

Getting a web page follows this pattern:

import requests

response = requests.get(url)
if response.status_code == 200:
    do_something()
else:
    maybe_handle_other_status_codes()
    or_just_fail_with_a_message()

It would be best to use try/except to catch exceptions and prevent your program from exiting without cleaning up. Cleaning up in our case would mean writing the URLs or web page data that we’ve already downloaded. See requests documentation for the requests that you can handle. The main situations would be if there’s a 404, a timeout, or a connection issue if your internet sucks.

The try/except allows a for loop to continue if there are any problems.

Anyway, here’s the process.

question_urls = []

for url in urls:
    try:
        # Be a good netizen. Don't just bombard somebody's server with
        # tons of requests all at once. Patience, Daniel-San!
        wait_time = random.uniform(5, 30)
        print(f"Waiting {wait_time} seconds...")
        time.sleep(wait_time)

        response = requests.get(url)
        if response == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            td_tags = soup.find_all("td", class_="list-title")
            for td in td_tags:
                href = td.find("a").attrs["href"]
                question_urls.append(href)
        else:
            print(f"Error fetching {url}: Server returned {response.status_code}")
    except Exception as e:
        print(f"An error occurred: {e}")

Beautiful Soup comes in handy here. soup = BeautifulSoup(response.content, "html.parser") and then soup.find_all() gets all the tags that I’m looking for. I noticed that all the links were in a table, and each td tag in the table had a class of list-title. That made it easy to pull those out. Then, I just looked at each of those td tags and pulled the links out of them. Tada! Now, all that’s left is to write the links to a file, which is straightforward.

with open(question_urls_file, "w") as f:
    for url in question_urls:
        file.write(url + "\n")

Part 5: Extract the data

For me, this was the difficult part. As I mentioned earlier, the HTML of each page turned out to be invalid. Specifically, the paragraph tags had heading tags and a bunch of other block content that shouldn’t have been in there.

đŸ€”

Reading the docs for Beautiful Soup after the fact, I realized that some of my malformed HTML problems might have been solved by using a different parser. By saying BeautifulSoup(myhtml, "html.parser"), I was using Python’s built-in HTML parser, which the docs say is the least awesome of the available parsers. They say lxml is the best. Live and learn, I guess. I’ll try lxml next time.

Part 5.1: Clean up the data

The HTML for the question pages was messy. Did a few things to clean them up. First, I deleted all the script tags. They took up a lot of space and made everything harder to read. Then, I deleted or unwrap()ed a few tags (kept just their content).

for script in soup.find_all("script"):
    script.decompose()
for span in soup.find_all("span"):
    span.unwrap()
for strong in soup.find_all("strong"):
    strong.unwrap()
for p in soup.find_all("p"):
    move_nested_tag_up(p)
    clean_para(p)

The last “for” uses two functions I wrote to clean up the HTML. I had nested p tags, and I wanted to move them out of each other.

def move_nested_tag_up(tag):
    # I added this after doing testing because I was too lazy to actually delete
    # all the print statements I added.
    def print(*args, **kwargs):
        return
    # Items in a tag might not be a tag, so make sure we only look at tags
    if not isinstance(tag, Tag) or not tag.name:
        raise ValueError("Must be a Tag with a name, not NavigableString, etc.")
    parent = tag.parent

    if parent is None:
        return  # No parent to move up from

    # If a tag is the same as its parent
    if tag.name == parent.name:
        tags_to_move = []
        reached_start = False
        # Find every tag from that tag to the end
        for child in parent.children:
            if isinstance(child, Tag):
                print("reached_start", reached_start)
                print("Considering", child)
                if tag == child:
                    reached_start = True
                if reached_start:
                    print("adding", child)
                    tags_to_move.append(child)
        print(tags_to_move)
        # Now add the tags that were found to after the parent,
        # i.e. mom/dad become sister/brother
        for tag in reversed(tags_to_move):
            print("Inserting", tag)
            parent.insert_after(tag)

The second function cleans up paragraphs by finding tags that shouldn’t be in them and doing the same as the above function. I did this by looking at the HTML spec for the list of acceptable tags in a paragraph. Those tags are called “phrasing content” tags.

def clean_para(para):
    if not isinstance(para, Tag) or not para.name:
        raise ValueError("Error: tag must be Tag, not NavigableString, etc")

    phrasing_content_tags = {
        "a", "abbr", "area", "audio",
        "b", "bdi", "bdo", "br",
        "button", "canvas", "cite", "code",
        "data", "datalist", "del", "dfn",
        "em", "embed", "i", "iframe",
        "img", "input", "ins", "kbd",
        "label", "link", "map", "mark",
        "math", "meta", "meter", "noscript",
        "object", "output", "picture", "progress",
        "q", "ruby", "s", "samp",
        "script", "select", "slot", "small",
        "span", "strong", "sub", "sup",
        "svg", "template", "textarea", "time",
        "u", "var", "video", "wbr"
    }
    tags_to_move_up = []
    for child in para.children:
        if isinstance(child, Tag):
            if child.name not in phrasing_content_tags:
                child.name = "p"
                tags_to_move_up.append(child)

    for tag in tags_to_move_up:
        para.insert_after(tag)

These functions are crude—even for some simple contexts, they wouldn’t preserve the order of the content. But I tested them on a few test cases I extracted from my set of web pages, and everything seemed to work all right.

I stuck all the code examples in the previous section into a loop that cleaned up each file, overwriting it in place (using the technique shown in this Stack Overflow answer).

with open(file_path, "r+") as file:
    # Read the file, do stuff, ...
    # ... clean_para, move_nested_tag_up, etc.
    output_html = soup.prettify()
    file.seek(0)
    file.write(output_html)
    file.truncate()

Part 5.2: Actually extract the data

To get the question type from the heading at the top of each question, I looked for the first h3 tag. Headings look like this:

IELTS Academic Writing Task 1/ Graph Writing - Diagram/ Process Diagram:

Since the types were conveniently separated by slashes, I just took the second one to the end. But I wasn’t sure if that would be the same over all of the hundreds of questions, so I tried using the following method, instead.

  1. Split the heading into a list by parts separated by slashes.
  2. Get the list index of the first item containing the string “Graph” using enumerate(). Enumerate gives each item in a list a number. I learned about it at Real Python.
  3. Starting from that index till the end, put the parts of the heading back together, keeping the slashes as separator. While we’re at it, remove the extraneous colon at the end.
heading_parts = [p.strip() for p in heading.text.split("/")]
question_type_index = [
    index for index, val in enumerate(heading_parts) if "Graph" in val
][0]
question_type = "/".join(heading_parts[question_type_index:]).replace(":","")

With question type in hand, up next was the question itself. Using Beautiful Soup’s next_siblings(), I started at the heading and collected all the tags between it and the image, which I knew would be at the end of the question. Lastly, I updated the src of the images in each question to reflect their new home in a sub-folder by the question CSV. And voilà! My data was ready!

Type,Question
Bar Chart,"<p>You should spend about 20 minutes...<p>The graph below shows the ...
Bar Chart,"<p>You should spend about 20 minutes...<p>The chart below shows the ...
Bar Chart,"<p>» You should spend about 20 minutes...<p>The average prices per k...
Bar Chart + Pie Chart,"<p>» You should spend about 20 minutes...<p>The bar char...
Bar Chart + Table + Line Graph,"<p>» You should spend about 20 minutes...<p>The...

Useful snippets

I already mentioned the request/response pattern in Part 4. Here are a few other ones that I found useful.

  • Wait a random time before doing something:

    import time
    import random
    
    for url in urls:
        wait_time = random.uniform(5,10)
        time.sleep(wait_time)
    
  • Overwrite a file in place, by using open(myfile, "r+"), followed by file.seek(0), file.write(stuff), and file.truncate().

  • Loop through the next or previous siblings of a tag in Beautiful Soup:

    for tag in my_div.next_siblings:
        if tag.parent:
            tag.parent.insert_atfter(tag) # Move the tags up next to their parent
    

As a result of my work here, I’ve got a library of several hundred IELTS Type 1 Tasks available in a convenient, searchable format for my IELTS classes next semester. But boy, is web scraping a lot of work. If I wasn’t looking at getting so much data, I would probably just copy them manually or spend time browsing for suitable questions, instead. But in this case, the time investment was worth it. I collected a lot of useful data, and I improved my programming skills while I was at it. That's a win-win!


Profile

Written by Randy Josleyn—Language learner, language teacher, music lover. Living in Beijing, Boise, and elsewhere