Analysing crypto-drug market fora#
1www.swansea.ac.uk/gdpo/2CATLISM, 314
The case study [utilised] in this section is based on the research conducted at Swansea University Department of Applied Linguistics under the supervision of Prof. Nuria Lorenzo-Dus and in collaboration with the Global Drug Policy Observer1www.swansea.ac.uk/gdpo/ (GDPO) as part of the project ‘Trust in Crypto-Drug Markets’ funded by EPSRC–CHERISH-DE (Principal Investigator: Prof. Nuria Lorenzo-Dus). Outputs from the research are published in [Horton-Eddison and Di Cristofaro, 2017], [Di Cristofaro and Horton-Eddison, 2017], and [Lorenzo-Dus and Di Cristofaro, 2018], which served as basis for the [contents described].2
CATLISM, 314
CATLISM, 317
Sample selection of HTML files from Silk Road 1 and 23CATLISM, 317
A sample selection of HTML files from Silk Road 1 and 2, sourced from the Darknet Market Archives [Branwen et al., 2015], may be downloaded by clicking the button below. The file shown in the book is named index.php?topic=101030.0
.
CATLISM, 320-321
Overall structure of a post in Silk Road 14CATLISM, 320-321
#
1<div id="forumposts">
2 <div class="windowbg">
3 <span class="topslice"><span></span></span>
4 <div class="post_wrapper">
5 <div class="poster">
6 <h4>
7 <a href="[link to the user's profile]" title="View the profile of [USERNAME]">[USERNAME]</a>
8 </h4>
9 <ul class="reset smalltext" id="msg_[post ID]_extra_info">
10 <li class="postgroup">[group to which the user belongs to]</li>
11 <li class="stars">
12 [this section includes links to the icons used to visualise the user's achievements]
13 </li>
14 <li class="postcount">Posts: [number of posts written by the user]</li>
15 <li class="karma">Karma: +[number]/-[number]</li>
16 <li class="profile">
17 <ul>
18 <li>
19 <a href="[link to the user's profile]"><img src="[link to the user's profile image]" alt="View Profile"
20 title="View Profile" /></a>
21 </li>
22 </ul>
23 </li>
24 </ul>
25 </div>
26 <div class="postarea">
27 <div class="flow_hidden">
28 <div class="keyinfo">
29 <div class="messageicon">
30 <img src="[link to the icon used for the message]" alt="" />
31 </div>
32 <h5 id="subject_[post ID]">
33 <a href="[link to the post]" rel="nofollow">[title of the post]</a>
34 </h5>
35 <div class="smalltext">
36 « <strong> on:</strong> [date on which the message was posted, in the format January 06, 2013, 03:29
37 am »];
38 </div>
39 <div id="msg_[post ID]_quick_mod"></div>
40 </div>
41 </div>
42 <div class="post">
43 <div class="inner" id="msg_[post ID]">
44 [text of the post]
45 </div>
46 </div>
47 </div>
48 <div class="moderatorbar">
49 <div class="smalltext modified" id="modified_[post ID]"></div>
50 <div class="smalltext reportlinks">
51 <img src="http://dkn255hz262ypmii.onion/Themes/default/images/ip.gif" alt="" />
52 Logged
53 </div>
54 </div>
55 </div>
56 <span class="botslice"><span></span></span>
57 </div>
58</div>
CATLISM, 323-328
Extracting the data from HTML pages to XML format5CATLISM, 323-328
#
1# Import modules for: loading files using regular expression; regular expressions; generating random numbers;
2# using BeautifulSoup; reading timestamps as date objects; working with XML files
3import glob
4import re
5from random import randint
6from bs4 import BeautifulSoup
7from dateutil import parser
8from lxml import etree
9
10# Create a dictionary with the replacement labels for the subfora names
11replacements = {
12 "bounties": "B01",
13 "bug reports": "B02",
14 "cryptocurrency": "B03",
15 "customer support": "B04",
16 "drug safety": "B05",
17 "feature requests": "B06",
18 "legal": "B07",
19 "newbie discussion": "B08",
20 "off topic": "B09",
21 "philosophy, economics and justice": "B10",
22 "press corner": "B11",
23 "product offers": "B12",
24 "product requests": "B13",
25 "rumor mill": "B14",
26 "security": "B15",
27 "shipping": "B16",
28 "silk road discussion": "B17",
29 "the ross ulbricht case & theories": "B18",
30}
31
32# Create the function to convert the subforum names into their equivalent (arbitrarily chosen) labels
33def assign_subforum_label(text):
34 # Remove any leading or trailing whitespace from the subforum name, and convert all letters to lower case
35 text = text.lower()
36 # Transform the name into its equivalent label, storing it into the variable 'label'
37 label = replacements[text]
38 # Return the variable 'label' as output
39 return label
40
41
42# Find all the files in the indicated subfolder, sorting them alphabetically
43files = sorted(glob.glob("./SR2_files/*.*"))
44
45# For each file in the list of files, do:
46for file in files:
47 # Create the main 'doc' element tag, which serves in the XML output as root element; here one separate 'doc' is created for
48 # each original input file
49 doc = etree.Element("doc")
50 # Extract the name of the file excluding the leading "index.php?=" string, and store it inside the variable 'filename'
51 filename = re.search("index.*\=(.*)", file).group(1).strip()
52 # Assign it as value of the <doc> attribute 'filename'
53 doc.attrib["filename"] = filename
54 # Extract the numerical sequence appearing in the filename, to be later used as part of the XML 'id' attribute
55 filename_number = re.search("index.*\=([0-9]{1,10})", file).group(1).strip()
56 # Append a random number to the extracted 'filename_number' to generate a pseudo-random value for the <doc> element
57 # tag 'id' attribute
58 random_id = str(filename_number) + "_" + str(randint(0, 100000))
59 # Assign the generated 'random_id' as value of the attribute 'id'
60 doc.attrib["id"] = random_id
61 # Open the input file
62 f = open(file, encoding="utf-8")
63 # Read the file with BeautifulSoup
64 soup = BeautifulSoup(f, "lxml")
65
66 # Extract the title of the thread
67 thread_title = soup.find("title").get_text()
68 # Get the <div> element containing the forum breadcrumns, i.e. the hierarchical menu showing the path of the forum
69 # thread being extracted, in bullet point format
70 navigate_section = soup.find("div", {"class": "navigate_section"})
71 # Write to 'subforum_name' the next-to-last element from the bullet point list, which indicates the forum section containing
72 # the thread being extracted; and replace the right double angle quotes (ASCII code character '187') from the name of the
73 # subforum with nothing
74 subforum_name = (
75 navigate_section.find_all("li")[-2].get_text().replace(chr(187), "").strip()
76 )
77 # Assign to 'subforum_label' the custom label for the forum section, derived from 'subforum_name' using the
78 # 'assign_subforum_label' function
79 subforum_label = assign_subforum_label(subforum_name)
80
81 # Get the <div> element containing all the posts
82 posts_section = soup.find("div", {"id": "forumposts"})
83 # For each single post do:
84 for single_post in posts_section.find_all("div", {"class": "post_wrapper"}):
85 # Create a <text> element tag to enclose the post
86 text_tag = etree.SubElement(doc, "text")
87 # Assign the title of thread to the <text> attribute 'title'
88 text_tag.attrib["title"] = thread_title
89 # Assign the name of the subforum to the <text> attribute 'subforum'
90 text_tag.attrib["subforum"] = subforum_name
91 # Assign the subforum label to the <text> attribute 'subforum_label'
92 text_tag.attrib["subforum_label"] = subforum_label
93 # Extract the username of the author of the post
94 username = single_post.find("div", {"class": "poster"}).h4.get_text().strip()
95 # Assign it as value of the attribute 'author' in the text metadata fields
96 text_tag.attrib["author"] = username
97
98 # Get the <div> element containing the details of the post
99 post_details = single_post.find("div", {"class": "keyinfo"})
100 # Extract the progressive number of the post, where number 1 is the first reply to the post, etc...
101 get_post_number = re.search(
102 ".*#([0-9]{1,5}).*",
103 post_details.find("div", {"class": "smalltext"}).get_text(),
104 )
105 # Assign the number of the post from 'get_post_number' to the variable 'post_number', or set it to 0 if the post
106 # is the first of the thread - since no number is included in the details of the first post
107 post_number = get_post_number.group(1) if get_post_number is not None else 0
108 # Assign it as value of the attribute 'post_number' in the <text> metadata fields
109 text_tag.attrib["post_number"] = str(post_number)
110
111 # Get the string of text containing the date on which the message was posted
112 get_post_date = re.search(
113 ".*on:(.*)", post_details.find("div", {"class": "smalltext"}).get_text()
114 ).group(1)
115 # Convert the string date into a datetime object, after removing the right double angle quotes
116 post_date = parser.parse(get_post_date.replace(chr(187), ""))
117 # Extract the time of posting (hours and minutes) using the custom format HHMM - built through the 'strftime' method - and
118 # save it to the variable 'post_date_time'
119 post_date_time = post_date.strftime("%H%M")
120 # Extract the day, month, year from the datetime object (using the standard notation '.day', '.month', '.year' to
121 # obtain 2 and 4-digit formats) and save each one to a different variable, then assign all the date elements to
122 # different metadata attributes
123 text_tag.attrib["date_d"] = str(post_date.day)
124 text_tag.attrib["date_m"] = str(post_date.month)
125 text_tag.attrib["date_y"] = str(post_date.year)
126 text_tag.attrib["date_time"] = str(post_date_time)
127
128 # Get the <div> containing the content of the post, i.e. the message
129 post_section = single_post.find("div", {"class": "post"})
130 # Check if the message contains "a quote", i.e. if the message the current one replies to is included inside the post; if
131 # so, exclude it from the extraction, extract the message and assign it to the variable 'post_content'
132 try:
133 post_section.find("div", {"class": "quoteheader"}).extract()
134 post_section.find("blockquote").extract()
135 post_content = post_section.get_text()
136 # If no quotation is present, extract the message and store it inside the variable 'post_content'
137 except AttributeError:
138 post_content = post_section.get_text()
139 # Write the extracted message as text of the 'text' element tag
140 text_tag.text = post_content
141
142 # Build the XML structure with all the elements collected so far
143 tree = etree.ElementTree(doc)
144 # Write the resulting XML structure to a file named after the input filename, using utf-8 encoding, adding the XML declaration
145 # at the start of the file and graphically formatting the layout ('pretty_print')
146 tree.write(
147 filename + ".xml", pretty_print=True, xml_declaration=True, encoding="utf-8"
148 )
Meta-representation of contents extracted in XML format through [s6.01]
(Silk Road corpus)#
1<?xml version='1.0' encoding='UTF-8'?>
2<doc filename="FILENAME" id="UNIQUE_ID_FROM_FILENAME">
3 <text title="TITLE" subforum_label="[SUBFORUM_LABEL]" subforum="FULL_NAME_OF_THE_SUBFORUM_SECTION" author="[USERNAME]" post_number="NUMBER" date_d="NUMBER" date_m="NUMBER" date_y="NUMBER" date_time="NUMBER">
4 MESSAGE, INCLUDING A
5 <url orig="ORIGINAL_HYPERLINK">
6 URL
7 </url>
8 IN ITS CONTENTS
9 </text>
CATLISM, 328
Meta-representation of the XML structure of the reference corpus (DPM corpus)6CATLISM, 328
#
1<?xml version='1.0' encoding='UTF-8'?>
2<text id="[UNIQUE_NUMERICAL_ID]" cqpyear="YYYY" cqpdescription="TITLE_OF_THE_REPORT">
3 TEXTUAL CONTENT OF THE DOCUMENT
4</text>