Analysing crypto-drug market fora#

1www.swansea.ac.uk/gdpo/2CATLISM, 314

The case study [utilised] in this section is based on the research conducted at Swansea University Department of Applied Linguistics under the supervision of Prof. Nuria Lorenzo-Dus and in collaboration with the Global Drug Policy Observer1www.swansea.ac.uk/gdpo/ (GDPO) as part of the project ‘Trust in Crypto-Drug Markets’ funded by EPSRC–CHERISH-DE (Principal Investigator: Prof. Nuria Lorenzo-Dus). Outputs from the research are published in [], [], and [], which served as basis for the [contents described].2CATLISM, 314

3CATLISM, 317Sample selection of HTML files from Silk Road 1 and 23CATLISM, 317

A sample selection of HTML files from Silk Road 1 and 2, sourced from the Darknet Market Archives [], may be downloaded by clicking the button below. The file shown in the book is named index.php?topic=101030.0.

Download sample of Silk Road 1 and 2 HTML files

4CATLISM, 320-321

Overall structure of a post in Silk Road 14CATLISM, 320-321#

Example [e6.01]#
 1<div id="forumposts">
 2  <div class="windowbg">
 3    <span class="topslice"><span></span></span>
 4    <div class="post_wrapper">
 5      <div class="poster">
 6        <h4>
 7          <a href="[link to the user's profile]" title="View the profile of [USERNAME]">[USERNAME]</a>
 8        </h4>
 9        <ul class="reset smalltext" id="msg_[post ID]_extra_info">
10          <li class="postgroup">[group to which the user belongs to]</li>
11          <li class="stars">
12            [this section includes links to the icons used to visualise the user's achievements]
13          </li>
14          <li class="postcount">Posts: [number of posts written by the user]</li>
15          <li class="karma">Karma: +[number]/-[number]</li>
16          <li class="profile">
17            <ul>
18              <li>
19                <a href="[link to the user's profile]"><img src="[link to the user's profile image]" alt="View Profile"
20                    title="View Profile" /></a>
21              </li>
22            </ul>
23          </li>
24        </ul>
25      </div>
26      <div class="postarea">
27        <div class="flow_hidden">
28          <div class="keyinfo">
29            <div class="messageicon">
30              <img src="[link to the icon used for the message]" alt="" />
31            </div>
32            <h5 id="subject_[post ID]">
33              <a href="[link to the post]" rel="nofollow">[title of the post]</a>
34            </h5>
35            <div class="smalltext">
36              &#171; <strong> on:</strong> [date on which the message was posted, in the format January 06, 2013, 03:29
37              am &#187];
38            </div>
39            <div id="msg_[post ID]_quick_mod"></div>
40          </div>
41        </div>
42        <div class="post">
43          <div class="inner" id="msg_[post ID]">
44            [text of the post]
45          </div>
46        </div>
47      </div>
48      <div class="moderatorbar">
49        <div class="smalltext modified" id="modified_[post ID]"></div>
50        <div class="smalltext reportlinks">
51          <img src="http://dkn255hz262ypmii.onion/Themes/default/images/ip.gif" alt="" />
52          Logged
53        </div>
54      </div>
55    </div>
56    <span class="botslice"><span></span></span>
57  </div>
58</div>
5CATLISM, 323-328

Extracting the data from HTML pages to XML format5CATLISM, 323-328#

Script [s6.01] #
  1# Import modules for:  loading files using regular expression; regular expressions; generating random numbers;
  2# using BeautifulSoup; reading timestamps as date objects; working with XML files
  3import glob
  4import re
  5from random import randint
  6from bs4 import BeautifulSoup
  7from dateutil import parser
  8from lxml import etree
  9
 10# Create a dictionary with the replacement labels for the subfora names
 11replacements = {
 12    "bounties": "B01",
 13    "bug reports": "B02",
 14    "cryptocurrency": "B03",
 15    "customer support": "B04",
 16    "drug safety": "B05",
 17    "feature requests": "B06",
 18    "legal": "B07",
 19    "newbie discussion": "B08",
 20    "off topic": "B09",
 21    "philosophy, economics and justice": "B10",
 22    "press corner": "B11",
 23    "product offers": "B12",
 24    "product requests": "B13",
 25    "rumor mill": "B14",
 26    "security": "B15",
 27    "shipping": "B16",
 28    "silk road discussion": "B17",
 29    "the ross ulbricht case  &amp;  theories": "B18",
 30}
 31
 32# Create the function to convert the subforum names into their equivalent (arbitrarily chosen) labels
 33def assign_subforum_label(text):
 34    # Remove any leading or trailing whitespace from the subforum name, and convert all letters to lower case
 35    text = text.lower()
 36    # Transform the name into its equivalent label, storing it into the variable 'label'
 37    label = replacements[text]
 38    # Return the variable 'label' as output
 39    return label
 40
 41
 42# Find all the files in the indicated subfolder, sorting them alphabetically
 43files = sorted(glob.glob("./SR2_files/*.*"))
 44
 45# For each file in the list of files, do:
 46for file in files:
 47    # Create the main 'doc' element tag, which serves in the XML output as root element; here one separate 'doc' is created for
 48    # each original input file
 49    doc = etree.Element("doc")
 50    # Extract the name of the file excluding the leading "index.php?=" string, and store it inside the variable 'filename'
 51    filename = re.search("index.*\=(.*)", file).group(1).strip()
 52    # Assign it as value of the <doc> attribute 'filename'
 53    doc.attrib["filename"] = filename
 54    # Extract the numerical sequence appearing in the filename, to be later used as part of the XML 'id' attribute
 55    filename_number = re.search("index.*\=([0-9]{1,10})", file).group(1).strip()
 56    # Append a random number to the extracted 'filename_number' to generate a pseudo-random value for the <doc> element
 57    # tag 'id' attribute
 58    random_id = str(filename_number) + "_" + str(randint(0, 100000))
 59    # Assign the generated 'random_id' as value of the attribute 'id'
 60    doc.attrib["id"] = random_id
 61    # Open the input file
 62    f = open(file, encoding="utf-8")
 63    # Read the file with BeautifulSoup
 64    soup = BeautifulSoup(f, "lxml")
 65
 66    # Extract the title of the thread
 67    thread_title = soup.find("title").get_text()
 68    # Get the <div> element containing the forum breadcrumns, i.e. the hierarchical menu showing the path of the forum
 69    # thread being extracted, in bullet point format
 70    navigate_section = soup.find("div", {"class": "navigate_section"})
 71    # Write to 'subforum_name' the next-to-last element from the bullet point list, which indicates the forum section containing
 72    # the thread being extracted; and replace the right double angle quotes (ASCII code character '187') from the name of the
 73    # subforum with nothing
 74    subforum_name = (
 75        navigate_section.find_all("li")[-2].get_text().replace(chr(187), "").strip()
 76    )
 77    # Assign to 'subforum_label' the custom label for the forum section, derived from 'subforum_name' using the
 78    # 'assign_subforum_label' function
 79    subforum_label = assign_subforum_label(subforum_name)
 80
 81    # Get the <div> element containing all the posts
 82    posts_section = soup.find("div", {"id": "forumposts"})
 83    # For each single post do:
 84    for single_post in posts_section.find_all("div", {"class": "post_wrapper"}):
 85        # Create a <text> element tag to enclose the post
 86        text_tag = etree.SubElement(doc, "text")
 87        # Assign the title of thread to the <text> attribute 'title'
 88        text_tag.attrib["title"] = thread_title
 89        # Assign the name of the subforum to the <text> attribute 'subforum'
 90        text_tag.attrib["subforum"] = subforum_name
 91        # Assign the subforum label to the <text> attribute 'subforum_label'
 92        text_tag.attrib["subforum_label"] = subforum_label
 93        # Extract the username of the author of the post
 94        username = single_post.find("div", {"class": "poster"}).h4.get_text().strip()
 95        # Assign it as value of the attribute 'author' in the text metadata fields
 96        text_tag.attrib["author"] = username
 97
 98        # Get the <div> element containing the details of the post
 99        post_details = single_post.find("div", {"class": "keyinfo"})
100        # Extract the progressive number of the post, where number 1 is the first reply to the post, etc...
101        get_post_number = re.search(
102            ".*#([0-9]{1,5}).*",
103            post_details.find("div", {"class": "smalltext"}).get_text(),
104        )
105        # Assign the number of the post from 'get_post_number' to the variable 'post_number', or set it to 0 if the post
106        # is the first of the thread - since no number is included in the details of the first post
107        post_number = get_post_number.group(1) if get_post_number is not None else 0
108        # Assign it as value of the attribute 'post_number' in the <text> metadata fields
109        text_tag.attrib["post_number"] = str(post_number)
110
111        # Get the string of text containing the date on which the message was posted
112        get_post_date = re.search(
113            ".*on:(.*)", post_details.find("div", {"class": "smalltext"}).get_text()
114        ).group(1)
115        # Convert the string date into a datetime object, after removing the right double angle quotes
116        post_date = parser.parse(get_post_date.replace(chr(187), ""))
117        # Extract the time of posting (hours and minutes) using the custom format HHMM - built through the 'strftime' method - and
118        # save it to the variable 'post_date_time'
119        post_date_time = post_date.strftime("%H%M")
120        # Extract the day, month, year from the datetime object (using the standard notation '.day', '.month', '.year' to
121        # obtain 2 and 4-digit formats) and save each one to a different variable, then assign all the date elements to
122        # different metadata attributes
123        text_tag.attrib["date_d"] = str(post_date.day)
124        text_tag.attrib["date_m"] = str(post_date.month)
125        text_tag.attrib["date_y"] = str(post_date.year)
126        text_tag.attrib["date_time"] = str(post_date_time)
127
128        # Get the <div> containing the content of the post, i.e. the message
129        post_section = single_post.find("div", {"class": "post"})
130        # Check if the message contains "a quote", i.e. if the message the current one replies to is included inside the post; if
131        # so, exclude it from the extraction, extract the message and assign it to the variable 'post_content'
132        try:
133            post_section.find("div", {"class": "quoteheader"}).extract()
134            post_section.find("blockquote").extract()
135            post_content = post_section.get_text()
136        # If no quotation is present, extract the message and store it inside the variable 'post_content'
137        except AttributeError:
138            post_content = post_section.get_text()
139        # Write the extracted message as text of the 'text' element tag
140        text_tag.text = post_content
141
142    # Build the XML structure with all the elements collected so far
143    tree = etree.ElementTree(doc)
144    # Write the resulting XML structure to a file named after the input filename, using utf-8 encoding, adding the XML declaration
145    # at the start of the file and graphically formatting the layout ('pretty_print')
146    tree.write(
147        filename + ".xml", pretty_print=True, xml_declaration=True, encoding="utf-8"
148    )

Meta-representation of contents extracted in XML format through [s6.01] (Silk Road corpus)#

Example [e6.02]#
1<?xml version='1.0' encoding='UTF-8'?>
2<doc filename="FILENAME" id="UNIQUE_ID_FROM_FILENAME">
3    <text title="TITLE" subforum_label="[SUBFORUM_LABEL]" subforum="FULL_NAME_OF_THE_SUBFORUM_SECTION" author="[USERNAME]" post_number="NUMBER" date_d="NUMBER" date_m="NUMBER" date_y="NUMBER" date_time="NUMBER">
4        MESSAGE, INCLUDING A
5        <url orig="ORIGINAL_HYPERLINK">
6            URL
7        </url>
8        IN ITS CONTENTS
9    </text>
6CATLISM, 328

Meta-representation of the XML structure of the reference corpus (DPM corpus)6CATLISM, 328#

Example [e6.03]#
1<?xml version='1.0' encoding='UTF-8'?>
2<text id="[UNIQUE_NUMERICAL_ID]" cqpyear="YYYY" cqpdescription="TITLE_OF_THE_REPORT">
3    TEXTUAL CONTENT OF THE DOCUMENT
4</text>

Example of page from Silk Road 1 forum#

Figure 6.1 Example of a page collected from the Silk Road 1 forum

Figure 6.1 Example of a page collected from the Silk Road 1 forum#