The communicative modus operandi of online child sexual groomers#

¹www.swansea.ac.uk/project-dragon-s/²CATLISM, 350

The case study [utilised] in this section is based on the research conducted at Swansea University Department of Applied Linguistics under the supervision of Prof. Nuria Lorenzo-Dus as part of the project ‘Online Grooming Discourse’ funded by EPSRC–CHERISH-DE and NSPCC (Lead Investigator: Prof. Nuria Lorenzo-Dus). In 2021 the project evolved into Project Dragon-S ¹ ¹www.swansea.ac.uk/project-dragon-s/ (Developing Resistance Against Grooming Online – Spot and Shield). Outputs from the research are published in [] and [], which served as basis for the contents [described]. ² ²CATLISM, 350

³CATLISM, 353-357

Collecting the data from Perverted Justice ³ ³`CATLISM, 353-357`#

Script [s6.05] (adapted from [])#

# Import modules for: regular expressions and for working with local files; List to enforce the type of data collected
# (this is only required for Python < 3.9), and selected functions from Selenium
import re
import os
from typing import List
import pandas as pd
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException

# Define the scraper as a class of objects
class ChatLogScraper(object):
    # Define the first function that sets the initial parameters: the starting URL, the regular expression to match the chatlog contents, as well as further options for Selenium
    def __init__(self):
        """Init function, defines some class variables."""
        self.home_url = "http://www.perverted-justice.com/?con=full"

        # Setup and compile the regular expression for later
        master_matcher = r"([\s\w\d]+)[:-]?\s(?:\(.*\s(\d+:\d+:\d+\s[AP]M)\))?:?((.*)(\s\d+:\d+\s[AP]M)|(.*))"
        self.chat_instance = re.compile(
            master_matcher, re.IGNORECASE
        )  # ignore case may not be necessary

        # Instantiate the firefox driver in headless mode, disable all css, images, etc
        here = os.path.dirname(os.path.realpath(__file__))
        executable = os.path.join(here, "chromedriver")
        # Set the headless command to run Firefox without a graphical interface
        options = Options()
        options.add_argument("--headless")
        self.driver = Chrome(executable_path=executable, chrome_options=options)

    # Define the 'start' function that searches and returns all the links found in the web pages and returns them as strings
    def start(self) -> List[str]:
        """Main function to be run, go to the home page, find the list of cases,
        then send a request to the scrape function to get the data from that page

        :return: list of links to scrap
        """
        print("loading main page")
        self.driver.get(self.home_url)

        main_pane = self.driver.find_element_by_id("mainbox")
        all_cases = main_pane.find_elements(
            By.TAG_NAME, "li"
        )  # every case is under an LI tag
        # We'll load the href links into an array to get later
        links = []
        for case in all_cases:
            a_tags = case.find_elements(By.TAG_NAME, "a")
            # The first a tag, is the link that we need
            links.append(a_tags[0].get_attribute("href"))
        return links

    # Define the 'scrape_page' function which, starting from the previously collected URLs, parses the content of each chatlog page and extracts the username, content (statement) and timestamp of each message
    def scrape_page(self, page_url: str) -> List[dict]:
        """Go to the page url, use the regular expression to extract the chatdata, store
        this into a temporary pandas data frame to be returned once the page is complete.

        :param page_url: (str) the page to scrap
        :return: pandas DataFrame of all chat instances on this page
        """
        self.driver.get(page_url)
        try:
            page_text = self.driver.find_element(By.CLASS_NAME, "chatLog").text
        except NoSuchElementException:
            print("could not get convo for", page_url)
            return []  # Some pages don't contain chats
        conversations = []

        # Next, we'll run the regex on the chat-log and extract the info into a formatted pandas DF
        matches = re.findall(self.chat_instance, page_text)
        for match in matches:
            # Clean up false negatives
            if (
                "com Conversation" not in match[0]
                and "Text Messaging" not in match[0]
                and "Yahoo Instant" not in match[0]
            ):
                username = match[0]
                if match[4]:
                    statement = match[3]
                    time = match[4]
                else:
                    statement = match[5]
                    time = match[1]
                conversations.append(
                    {"username": username, "statement": statement, "time": time}
                )
        return conversations


# The functions above are executed and the collected data is saved to a CSV file and a JSON file
if __name__ == "__main__":
    chatlogscrapper = ChatLogScraper()
    conversations = []
    links = chatlogscrapper.start()

    try:
        for index, link in enumerate(links):
            print("getting", link)
            conversations += chatlogscrapper.scrape_page(link)
    finally:
        conversations = pd.DataFrame(conversations)
        conversations.to_csv("output.csv", index=False)
        conversations.to_json("output.json")

⁴CATLISM, 360-366

Creating the final corpus ⁴ ⁴`CATLISM, 360-366`#

⁵CATLISM, 360

Script [s6.06] applies the steps described in sections ‘Emoticons’, ‘Duration and Turns’, and ‘Metadata’ to the collected CSV files and outputs an XML fle for each of them using the structure exemplified in [e6.08] ⁵ ⁵CATLISM, 360

Script [s6.06] #

# Import (in order) the modules to: read/write CSV files; find files using regular expressions;
# work with regular expressions; generate random strings; to generate random numbers; randomise data;
# create dictionaries (the 'defaultdict' has the ability to handle missing data in dictionaries,
# in contrast to Python's default dictionary); read/write XML files
import csv
import glob
import re
import string
from random import randint
import random
from collections import defaultdict
from lxml import etree

# List all CSV files in the current folder
csvfiles = glob.glob("*.csv")

# Create two dictionaries: one ('user_types_dict') to store usernames mapped against their 'role' (g = groomer, d = decoy);
# the other ('timings_dict') to store usernames mapped against the total amount of time it interacted with one or more decoys
user_types_dict = defaultdict(list)
timings_dict = defaultdict(list)

# Open the metadata file (named .cs to avoid being read as a CSV chat log file) and read it as a csv file
metadata_file = csv.reader(
    open("metadata_file.cs", "r", encoding="utf-8"), delimiter="\t"
)
# For each row, do:
for row in metadata_file:
    # Read the username and its role and add the information to the dictionary 'user_types_dict'
    user_types_dict[row[0].lower()].append(row[1])
    # Read the total amount of time a groomer interacted with a decoy, and assign the value to the dictionary 'timings_dict'
    timings_dict[row[0].lower()].append(row[2])


# Create a function to add the type (g or d) to the username passed to the function during the data processing
def get_user_type(text):
    # Read the username, convert it to lowercase, and store it in the variable 'user'
    user = text.lower()
    # If 'user' is found in 'user_types_dict', extract its type label and store it in the variable 'usertype'
    if user in user_types_dict:
        usertype = str(user_types_dict[user][0])
    # Else if not found, assign the value 'na' to the variable 'usertype'
    else:
        usertype = "na"
    # Output the value of 'usertype'
    return usertype


# Create a function to add the total time of interaction to the corpus during the data processing, using the same rationale and
# operations employed in 'get_user_type'
def get_user_timing(text):
    username = text.lower()
    if username in timings_dict:
        timing = str(timings_dict[username])
    else:
        timing = "na"
    return timing


# Build the emoticons conversion steps; adapted from the emoticons.py function by Brendan O'Connor
# https://github.com/aritter/twitter_nlp/blob/65f3d77134c40d920db8d431c5c6faef1c051c94/python/emoticons.py
# Define the regular expression that will be used during the data processing to identify emoticons
regex_compile = lambda pat: re.compile(pat, re.UNICODE)
# Define the characters for eyes, nose, mouth to be used in the regular expressions; each
NormalEyes = r"[:=]"
Wink = r"[;]"
NoseArea = r"(|o|O|-)"
HappyMouths = r"[D\)\]]"
SadMouths = r"[\(\[]"
KissMouths = r"[\*]"
Tongue = r"[pP]"

# Construct the possible combinations into regular expressions
happysmiley_regex = (
    "("
    + NormalEyes
    + "|"
    + Wink
    + ")"
    + NoseArea
    + "("
    + HappyMouths
    + "|"
    + Tongue
    + ")"
)
sadsmiley_regex = "(" + NormalEyes + "|" + Wink + ")" + NoseArea + "(" + SadMouths + ")"
kisssmiley_regex = (
    "(" + NormalEyes + "|" + Wink + ")" + NoseArea + "(" + KissMouths + ")"
)
# Compile the regular expressions using the previously defined 'regex_compile'
happysmiley_compile = regex_compile(happysmiley_regex)
sadsmiley_compile = regex_compile(sadsmiley_regex)
kisssmiley_compile = regex_compile(kisssmiley_regex)

# Define the root <corpus> XML element tag of the output file
corpus = etree.Element("corpus")

# For each CSV chat log file, do:
for csvfile in csvfiles:
    # Create the <text> root element tag as child of the <corpus> root element
    text_tag = etree.SubElement(corpus, "text")

    # Create a function to generate a random ID using the 'random_number' variable (defined further below), plus a
    # set of randomly chosen letters
    def id_generator(N):
        return "".join(
            random.choices(
                string.ascii_uppercase + string.ascii_lowercase + string.digits, k=N
            )
        )

    # Generate a random number to be used for the creation of the unique <text> ID
    random_number = str(randint(0, 100000000))
    # Generate a random ID and assign it as value of the <text> attribute 'id'
    text_tag.attrib["id"] = str(id_generator(10) + random_number)

    # Create an empty list to store the usernames found in the chat log
    usernames_list = []

    # Open the chat log file and read it as a csv file
    input_csv = csv.reader(open(csvfile, "r", newline="", encoding="utf-8"))
    # Store the filename without extension inside the variable 'filename_without_csv'
    filename_without_csv = csvfile.replace(".csv", "")
    # Skip the first line of the CSV chat log file containing the columns header
    next(input_csv, None)
    # Iterate over each row and store them inside of the variable 'rows'
    rows = [r for r in input_csv]
    # For each row, count its position (starting from 1; this is equal to the turn number in the chat) and store it in
    # the variable 'line_number', then do:
    for line_number, row in enumerate(rows, start=1):
        # Create the <u> element tag for the chat turn (i.e. the chat message)
        turn_tag = etree.SubElement(text_tag, "u")
        # Assign the row position in the csv file as value of the <u> attribute 'turn'
        turn_tag.attrib["turn"] = str(line_number)
        # Read the username from the first column of the chat log file, clean it from any potential leading or trailing whitespace,
        # and assign it as value of the <u> attribute 'username'
        turn_tag.attrib["username"] = str(row[0]).strip()
        # Write the username (without any potential whitespace) to the list of usernames for this chat log
        usernames_list.append(str(row[0]).strip())
        # Read the timestamp from the third column of the chat log file and assign it as value of the <u> attribute 'time'
        turn_tag.attrib["time"] = str(row[2])
        # Read the date on which the message was sent from the fourth column of the chat log file, and assign it as value of
        # the <u> attribute 'date'
        turn_tag.attrib["date"] = str(row[3])
        # Using the 'get_user_type' function with the username as input, extract the type of user and assign it as value of the
        # <u> attribute 'usertype'
        turn_tag.attrib["usertype"] = get_user_type(str(row[0]).strip())
        # Read the chat message from the second column of the chat log file and store it inside a variable
        message = row[1]
        # Test for the presence of emoticons in the message using the three previously compiled regular expressions, and if found substitute it with the respective substitution-label (§_HAPPY-SMILEY_§, §_SAD-SMILEY_§, or §_KISS-SMILEY_§)
        if happysmiley_compile.search(message):
            message = re.sub(happysmiley_compile, " §HAPPY-SMILEY§ ", message)
        elif sadsmiley_compile.search(message):
            message = re.sub(sadsmiley_compile, " §SAD-SMILEY§ ", message)
        elif kisssmiley_compile.search(message):
            message = re.sub(kisssmiley_compile, " §KISS-SMILEY§ ", message)
        # Assign the formatted message as text of the <u> element tag
        turn_tag.text = message

    # Read all the unique values in the list of usernames, and for each one do:
    for username in set(usernames_list):
        # Get the user type
        user_type = get_user_type(username)
        # If the user type is equal to 'g' (i.e. groomer), get the total amount of time they spent chatting and assign it to
        # the <text> attribute 'timing', and add the groomer's username as value of the <text> attribute 'user'
        if user_type == "g":
            text_tag.attrib["timing"] = re.sub(
                "(\[|\]|')", "", get_user_timing(username)
            )
            text_tag.attrib["user"] = username

# Create the XML structure by adding all the extracted elements to the main 'corpus' tag
tree = etree.ElementTree(corpus)
# The resulting XML structure is written to the XML file named after the original CSV chat log file using utf-8 encoding,
# adding the XML declaration at the beginning and graphically formatting the layout ('pretty_print')
tree.write(
    filename_without_csv + ".xml",
    pretty_print=True,
    xml_declaration=True,
    encoding="utf-8",
)

⁶CATLISM, 352-353

Sample from the final corpus ⁶ ⁶`CATLISM, 352-353`#

Example [e6.08]#

<?xml version='1.0' encoding='UTF-8'?>
<corpus>
    <text id="dz1cVUMyIB99080150" timing="1333" user="luv2licku68">
        <u turn="1" username="luv2licku68" time="8:54:50 PM" date="04112010" usertype="g"> hey there, how are you doing?</u>
        <u turn="2" username="katierella1013" time="8:54:01 PM" date="04112010" usertype="d"> hi a/s/l?  §_HAPPY-SMILEY_§ </u>
        ...
        <u turn="19" username="katierella1013" time="8:59:13 PM" date="04112010" usertype="d">
            well i
            <normalised orig="dunno" auto="true">don't know</normalised>
            , like the stuff some people say and there's like so much going on, so many conversations
            <normalised orig="n" auto="true">and</normalised>
            stuff
        </u>
    </text>
    ...
</corpus>

⁷CATLISM, 368

Example of the interactive plot created for the visual exploration of collocations ⁷ ⁷`CATLISM, 368`#

Figure 6.3 Example of the interactive plot created for the visual exploration of collocations — *Figure 6.3* Example of the interactive plot created for the visual exploration of collocations#

Consult the original interactive plot

The communicative modus operandi of online child sexual groomers

Contents

The communicative modus operandi of online child sexual groomers#

Collecting the data from Perverted Justice 3 3CATLISM, 353-357#

Creating the final corpus 4 4CATLISM, 360-366#

Sample from the final corpus 6 6CATLISM, 352-353#

Example of the interactive plot created for the visual exploration of collocations 7 7CATLISM, 368#

Collecting the data from Perverted Justice ³ ³`CATLISM, 353-357`#

Creating the final corpus ⁴ ⁴`CATLISM, 360-366`#

Sample from the final corpus ⁶ ⁶`CATLISM, 352-353`#

Example of the interactive plot created for the visual exploration of collocations ⁷ ⁷`CATLISM, 368`#