BeautifulSoup#

Data collection can be achieved through the creation of ad-hoc scripts employing BeautifulSoup, a Python library that coordinates several modules and further libraries for “pulling data out of HTML and XML files” [Richardson, 2023].
Scripts included here (and in the book) represent a simplified version of the ones included in [Bondi and Di Cristofaro, 2023].

Extracting the data#

¹CATLISM, 162-163

Extract links from HTML pages ¹ ¹`CATLISM, 162-163`#

Script [s5.01] #

# Import modules for: regular expressions; loading files using regular expression; reading/writing CSV files;
# using BeautifulSoup
import re
from glob import glob
import csv
from bs4 import BeautifulSoup

# List all filenames with the .html extension and store the list in the variable 'files'
files = glob("*.html")
# Create the header (i.e. the first row containing the column names) for the output CSV file
csv_header = ["link", "downloaded"]

# For each found HTML file do:
for file in files:
    # Open the file
    f = open(file, encoding="utf-8")
    # From the original filename, strip the '.html' extension
    filename = file.replace(".html", "")
    # Read the contents of the file through BeautifulSoup and store them inside the variable 'soup'
    soup = BeautifulSoup(f, "lxml")
    # Create the CSV output file (named after the original HTML one) to write the output contents
    with open(filename + "_links.csv", "a") as file_output:
        # Start writing the output file
        writer = csv.writer(file_output)
        # Write the header
        writer.writerow(csv_header)
        # Create an empty list to store the collected URLs
        url = []
        # Find all URLs matching the regular expression, and for each one do:
        for link in soup.find_all("a", {"href": re.compile(r".*?theses/available.*?")}):
            # Write the found URL and write it to the output file
            writer.writerow([link["href"], "n"])

²CATLISM, 164-166

Download HTML pages ² ²`CATLISM, 164-166`#

Script [s5.02a] #

# Import modules for: reading/writing CSV files; using regular expressions; working with SSL certificates;
# loading files using regular expression; pausing the script; using BeautifulSoup
import csv
import re
import ssl
from glob import glob
from time import sleep
from bs4 import BeautifulSoup

# the variable and the import below allow the crawler to browse https pages when invalid certificates are used in the
# website to be scraped. Adapted from:
# https://stackoverflow.com/questions/50236117/scraping-ssl-certificate-verify-failed-error-for-http-en-wikipedia-org
ssl._create_default_https_context = ssl._create_unverified_context
from selenium import webdriver

# Define the options to pass to Firefox webdriver (i.e. the Selenium mechanisms that will control Firefox through the
# instructions imparted by this script)
options = webdriver.FirefoxOptions()
options.add_argument("--headless")
# Create the Firefox webdriver
driver = webdriver.Firefox(options=options)

# Find all the filenames containing the string 'links.csv' preceded by any character(s)
files = glob("*links.csv")

# Set the headers Firefox will use to access the web pages
headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "Accept-Encoding": "gzip, deflate",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
}

# For each file found, do:
for file in files:
    # Open and read the file as CSV
    reader_csv = csv.reader(open(file, "r"))
    # Skip the first line of the CSV containing the header
    next(reader_csv, None)
    # Create a list containing all the rows of the CSV
    rows = [r for r in reader_csv]
    # For each row (i.e. each link) in the list do:
    for row in rows:
        # Search for the string corresponding to the URN in the link
        urn_search = re.search(r".*?available/(.*?)/", row[0])
        # Extract the URN and store it in the variable 'urn'
        urn = urn_search.group(1)
        # Use the webdriver to read the page indicated by the link
        driver.get(row[0])
        # Read the source code of the page using BeautifulSoup, and store it in the variable 'soup'
        soup = BeautifulSoup(driver.page_source, "lxml")
        # Open the output file inside the subfolder 'downloaded', using the URN as its filename, followed by '.html'
        with open("downloaded/" + urn + ".html", "a") as file_output:
            # Write the source code of the page into the output file
            file_output.write(str(soup))
        # Wait 4 secons before restarting the loop
        sleep(4)

³CATLISM, 166

Use `requests` in script `[s5.02a]` ³ ³`CATLISM, 166`#

Script [s5.02b] #

# Use 'requests' to download the page from the URL appearing in column 1 of the row
r = requests.get(row[0])
# Read the contents of the HTML page and store them inside of the variable 'soup'
soup = BeautifulSoup(r.text, "lxml")

⁴CATLISM, 166; 168-171

Extract metadata from the downloaded HTML pages ⁴ ⁴`CATLISM, 166; 168-171`#

Script [s5.03] #

# Import modules for: loading files using regular expression; reading/writing CSV files; using BeautifulSoup
import os
import glob
import csv
from bs4 import BeautifulSoup

# Set the filename of the CSV file where metadata will be/is stored
metadata_file = "metadata_all.csv"

# Check if the file already exists; if it does, then:
if os.path.isfile(metadata_file):
    # Open the file in 'appending' mode ('a') - so that every time a new content is written to it, it is added to the end of
    # the file -, and initiate a 'writer' to write the contents in CSV format, using the tav character as delimiter
    metadata_writer = csv.writer(
        open("metadata_all.csv", "a", encoding="utf-8"), delimiter="\t"
    )
# If the file does not exist
else:
    # Create the file in 'appending' mode ('a') - so that every time a new content is written to it, it is added to the end of
    # the file -, and initiate a 'writer' to write the contents in CSV format, using the tav character as delimiter
    metadata_writer = csv.writer(
        open("metadata_all.csv", "a", encoding="utf-8"), delimiter="\t"
    )
    # Write as first row the names of the columns
    metadata_writer.writerow(
        [
            "doc_id",
            "tipo_tesi",
            "autore",
            "urn",
            "titolo_it",
            "titolo_en",
            "struttura",
            "corso_di_studi",
            "keywords",
            "data",
            "disponibilità",
            "abstract",
        ]
    )

# Create a list of all the filenames with '.html' extension contained in the subfolder 'downloaded' and all of its possible subfolders
files = sorted(glob.glob("./downloaded/*.html", recursive=True))

# For each filename found do:
for file in files:
    # Open the file
    f = open(file, encoding="utf8")
    # Remove the '.html' extension from the filename
    filename = file.replace(".html", "")
    # Read the contents of the file with BeautifulSoup and store them inside of the variable 'soup'
    soup = BeautifulSoup(f, "lxml")
    # Find the <table> element tag and assign its contents to the variable 'table'
    table = soup.find("table")
    # Inside <table>, find the <tbody> element tag and store its contents inside the variable 'tbody'
    tbody = table.find("tbody")
    # Find metadata elements by searching for the <th> element tag containing the relevant label (indicated by 'text="LABEL"'),
    # and extract the text from the next adjacent <td> element tag (where the metadata value is stored)
    tipotesi = tbody.find("th", text="Tipo di tesi").find_next("td").text.strip()
    autore = tbody.find("th", text="Autore").find_next("td").text.strip()
    urn = tbody.find("th", text="URN").find_next("td").text.strip()
    titolo_it = tbody.find("th", text="Titolo").find_next("td").text.strip()
    titolo_en = tbody.find("th", text="Titolo in inglese").find_next("td").text.strip()
    corso_di_studi = (
        tbody.find("th", text="Corso di studi").find_next("td").text.strip()
    )
    keywords = tbody.find("th", text="Parole chiave").find_next("td").text.strip()
    data = tbody.find("th", text="Data inizio appello").find_next("td").text.strip()
    disponibilita = tbody.find("th", text="Disponibilità").find_next("td").text.strip()
    abstract = tbody.find_all("td", {"colspan": "2"})[1].text

    # The following verification is required as some catalogue cards contain a field named 'Settore scientifico disciplinare'
    # (Disciplinary scientific area), while others have 'Struttura' (Facility) instead. Either way, the resulting value is stored
    # inside of a metadata attribute labelled 'struttura'.
    # Check if a <th> tag with value 'Settore scientifico disciplinare' exists; if so:
    if tbody.find("th", text="Settore scientifico disciplinare") is not None:
        # Extract the value and save it to the variable 'struttura'
        struttura = (
            tbody.find("th", text="Settore scientifico disciplinare")
            .find_next("td")
            .text.strip()
        )
    # If it does not exist, extract the value from the <th> element tag with value 'Struttura'
    else:
        struttura = tbody.find("th", text="Struttura").find_next("td").text.strip()

    # Save all the extracted metadata elements to a list called 'metadata_line', in the order they are to be written in the output CSV
    metadata_line = [
        filename,
        tipotesi,
        autore,
        urn,
        titolo_it,
        titolo_en,
        struttura,
        corso_di_studi,
        keywords,
        data,
        disponibilita,
        abstract,
    ]
    # Write the values stored in 'metadata_line' as one row in the CSV file
    metadata_writer.writerow(metadata_line)

⁵CATLISM, 174-175

Download PDF files linked in HTML pages ⁵ ⁵`CATLISM, 174-175`#

Script [s5.04] #

# Import modules for: regular expressions; downloading data from URLs; loading files using regular expression;
# let the script wait a number of seconds before proceeding; using BeautifulSoup
import re
import urllib.request
from glob import glob
from time import sleep
from bs4 import BeautifulSoup

# Create a list of all the filenames with '.html' extension contained in the subfolder 'downloaded' and all of its possible subfolders
files = glob("./downloaded/*.html")

# For each filename found do:
for file in files:
    # Open the file
    f = open(file, "r", encoding="utf-8")
    # Read the contents of the file with BeautifulSoup and store them inside of the variable 'soup'
    soup = BeautifulSoup(f, "lxml")
    # Check if at least one <a> element tag with the string 'pdf' in its link is present in the HTML code
    # (i.e. if the page contains at least one link to a PDF file); if so do:
    if soup.find_all("a", {"href": re.compile("pdf")}):
        # For each link found, initiate a counter to preserve the order in which the files appear in the 
        # catalogue card, starting from 0 (the file appearing at the top), and do:
        for counter, link in enumerate(soup.find_all("a", {"href": re.compile("pdf")})):
            # Construct the download link by appending the website path to the partial link found in the 'href' attribute
            file_link = "https://morethesis.unimore.it" + (link["href"])
            # Extract the URN from 'file_link', and assign it to the variable 'urn_code'
            urn_code = re.search("(etd.*?)/", file_link).group(1)
            # Extract the original filename from 'link'
            filename = link.get_text()
            # Download the PDF file(s) to the sub-folder 'pdfs' (it must be created manually if it does not already exist),
            # assigning each file a name according to the structure URN_PROGRESSIVE-NUMBER_ORIGINAL-FILENAME.pdf
            urllib.request.urlretrieve(
                file_link, "pdfs/" + urn_code + "_" + str(counter) + "_" + filename
            )
            # Wait 4 seconds before downloading any other file
            sleep(4)
    # If no <a> element tag with the string 'pdf' is found, move to the next catalogue card
    else:
        continue

⁶CATLISM, 176

Extract the contents of PDF files as plain-text ⁶ ⁶`CATLISM, 176`#

Script [s5.05] #

# Import modules for: loading files using regular expression; using 'textract' functionalities
from glob import glob
import textract

# List all filenames with the .pdf extension
files = glob("*.pdf")

# For each filename in the list, do:
for file in files:
    # Remove the '.pdf' extension and save the resulting filename to the variable 'filename'
    filename = file.replace(".pdf", "")
    # Open and process the file through 'textract', using UTF-8 as output encoding
    doc = textract.process(file, output_encoding="utf-8")
    # Create and open the output file, and write the extracted contents as raw bytes ("wb")
    with open(filename + ".txt", "wb") as file_output:
        file_output.write(doc)

⁷CATLISM, 177-180

Create an XML corpus combining the metadata from HTML pages and the contents of PDF files ⁷ ⁷`CATLISM, 177-180`#

Script [s5.06] #

# Import modules for: loading files using regular expression; using regular expressions; using dataframes;
# reading/writing XML files
import glob
import re
import pandas as pd
from lxml import etree

# Create a function to remove illegal XML characters. These are control characters identified by code points included in the
# ranges defined in the 'return' output. In XML 1.0 the only control characters allowed are tab, line feed, and carriage return (often 
# interpreted as whitespaces or line-breaks), represented by Unicode code points U+0009, U+000A, U+000D (written in hexadecimal 
# format in the function, e.g. 0x9 for U+0009). Function adapted from:
# https://github.com/faizan170/resume-job-match-nlp/blob/573484a9b180950ddd373615e2f09ae163d7b0ae/main.py
def remove_control_characters(c):
    # Read the Unicode code point of the character, and store it into the 'codepoint' variable
    codepoint = ord(c)
    # Return the character if it is an XML allowed one
    return (
        0x20 <= codepoint <= 0xD7FF or
        codepoint in (0x9, 0xA, 0xD) or
        0xE000 <= codepoint <= 0xFFFD or
        0x10000 <= codepoint <= 0x10FFFF
        )

# Create an empty list to store the found filenames 
list_of_filenames = []
# Compile a regular expression to capture the URN code from the filenames that have the string '_0' - indicating that they are the 
# first file  for each single URN
urnRegex = re.compile(
    "etd-[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9]_0*.txt"
)
# List all filenames including the URN regular expression plus the '_0' indicating that they are the first file
# for each single URN
files = sorted(
    glob.glob(
        "etd-[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9]_0*.txt"
    )
)
# Add the found filenames to the list 'list_of_filenames'
list_of_filenames.append(files)


# Create a metadata database (mdb) using the metadata csv file; set the urn as index, and remove duplicates.
# This is needed since the same thesis can appear more than once if it is catalogued under different categories on MoreThesis.
mdb = pd.read_csv("metadata_all.csv", sep="\t", encoding="utf-8")
mdb = mdb.set_index("urn")
mdb = mdb.groupby(mdb.index).first()

# For each filename in the found ones:
for file in files:
    # Extract the URN from the filename
    urn = re.search("(etd-[0-9]{8}-[0-9]{6})_[0-9]{1,2}.*", file).group(1)
    # Create the output filename by appending '.xml' to the URN
    output_file = urn + ".xml"
    # Create the root tag element <doc> to include all the generated XML contents
    doc = etree.Element("doc")
    # Assign a number of attributes to <doc>, extracting their values from 'mdb' using the URN as key to find them - except
    # for the URN itself
    doc.attrib["urn"] = urn
    doc.attrib["type"] = mdb.loc[urn, "tipo_tesi"]
    # Remove the comma between the author's surname and name
    doc.attrib["author"] = re.sub(",", "", mdb.loc[urn, "autore"])
    doc.attrib["title"] = mdb.loc[urn, "titolo_it"]
    # As not all the theses may have an English title, check if it is so, and assign value 'na' when not available
    doc.attrib["title_en"] = mdb.loc[urn, "titolo_en"] if not type(None) else "na"
    doc.attrib["department"] = mdb.loc[urn, "struttura"]
    doc.attrib["degree"] = mdb.loc[urn, "corso_di_studi"]
    # Extract the date (in the format YYYY-MM-DD) and capture each part into a group
    date = re.search("([0-9]{4})-([0-9]{2})-([0-9]{2})", mdb.loc[urn, "data"])
    doc.attrib["date_y"] = date.group(1)
    doc.attrib["date_m"] = date.group(2)
    doc.attrib["date_d"] = date.group(3)

    # Create an empty list to contain the cleaned contents of the thesis    
    all_thesis_texts = []

    # For each .txt file containing the processed URN, do:
    for f in sorted(glob.glob(urn + "*.txt")):
        # Open the file and read its contents)
        one_file = open(f, "r", encoding="utf8").read()
        # Using the function 'remove_control_characters', clean the contents from characters that are illegal in XML, and store the
        # resulting cleaned text in the variable 'cleaned_text'
        cleaned_text = ''.join(c for c in one_file if remove_control_characters(c))
        # Add 'cleaned_text' to the list 'all_thesis_texts'
        all_thesis_texts.append(cleaned_text)
    
    # Assign all the texts in 'all_thesis_texts' as text of the <doc> element tag
    doc.text = " ".join(all_thesis_texts)
    # Build the XML structure with all the elements collected so far
    tree = etree.ElementTree(doc)
    # Write the resulting XML structure to the output file, using utf-8 encoding, adding the XML declaration
    # at the start of the file and graphically formatting the layout ('pretty_print')
    tree.write(output_file, pretty_print=True, xml_declaration=True, encoding="utf-8")

⁸CATLISM, 171-173

Basic structure of the metadata table included in MoreThesis pages ⁸ ⁸`CATLISM, 171-173`#

Example [e5.08]#

<table border="3" cellpadding="5" cellspacing="5" class="metadata_table">
    <tbody>
        <tr>
            <th width="30%">Tipo di tesi</th>
            <td width="70%">TYPE OF THESIS</td>
        </tr>
        <tr>
            <th>Autore</th>
            <td>SURNAME, NAME</td>
        </tr>
        [...]
        <tr>
            <th>Commissione</th>
            <td>
                <table>
                    <tbody>
                        <tr>
                            <th align="left">Nome Commissario</th>
                            <th align="left">Qualifica</th>
                        </tr>
                        <tr>
                            <td align="left">SURNAME NAME</td>
                            <td align="left">Primo relatore</td>
                        </tr>
                        <tr>
                            <td align="left">SURNAME NAME</td>
                            <td align="left">Coordinatore Dott Ric</td>
                        </tr>
                    </tbody>
                </table>
            </td>
        </tr>
        <tr>
            <th>Parole chiave</th>
            <td>
                <ul>
                    <li>KEYWORD1
                    </li>
                    <li>KEYWORD2
                    </li>
                    <li>KEYWORD3
                    </li>
                    <li>KEYWORD4
                    </li>
                    <li>KEYWORD5
                    </li>
                </ul>
            </td>
        </tr>
        [...]
        <th>File</th>
        <td>
            <table border="2" cellpadding="3" cellspacing="3">
                <tbody>
                    [...]
                    <tr align="center">
                        <td> </td>
                        <td align="left"><a
                                href="/theses/available/etd-NNNNNNNN-NNNNNN/unrestricted/FILENAME.pdf"><b>FILENAME.pdf</b></a>
                        </td>
                        <td>5.93 Mb</td>
                        <td bgcolor="#cccccc">00:27:28</td>
                        <td>00:14:07</td>
                        <td bgcolor="#cccccc">00:12:21</td>
                        <td>00:06:10</td>
                        <td bgcolor="#cccccc">00:00:31</td>
                    </tr>
                    [...]
                </tbody>
            </table>
        </td>
    </tbody>
</table>

BeautifulSoup

Contents

BeautifulSoup#

Extracting the data#

Extract links from HTML pages 1 1CATLISM, 162-163#

Download HTML pages 2 2CATLISM, 164-166#

Use requests in script [s5.02a] 3 3CATLISM, 166#

Extract metadata from the downloaded HTML pages 4 4CATLISM, 166; 168-171#

Download PDF files linked in HTML pages 5 5CATLISM, 174-175#

Extract the contents of PDF files as plain-text 6 6CATLISM, 176#

Create an XML corpus combining the metadata from HTML pages and the contents of PDF files 7 7CATLISM, 177-180#

Basic structure of the metadata table included in MoreThesis pages 8 8CATLISM, 171-173#