Reference list for online contents#

The table below lists commands, scripts, and figures included in the book and contained in the online compendium: accessing the online version of each content can be achieved by clicking on the relevant ID field.
Elements are organised alphabetically according to the ID value. Clicking on a column header changes the sorting order; the included Search field looks for a string across all contents (i.e. the search is applied to all columns at once).

ID

description

page

e4.01

Default appearance of terminal when conda environment is active

101

c4.01

Command to create a new virtual environment in conda

101

c4.02

Command to activate a virtual environment in conda

101

c4.03

Command to deactivate the virtual environment in conda

101

e4.02

Default appearance of terminal when conda environment myenv is active

101

c4.04

Command to install pip in conda

101

c4.06

Initiate git in a local folder

124

c4.07

Clone a remote repository

124

c4.08

Add all changes (even from previously untracked files) to the local git database (i.e. stage the changes)

125

c4.09

Record (commit) all changes, along with a textual description of what has been changed

125

c4.10

Send (push) all changes to the remote repository

125

c4.11

Obtain (fetch) all changes from the remote repository

125

c4.12

Include/apply (fetch) all changes from the remote repository to the local repository

125

c4.13

Obtain and include/apply (pull) all changes from the remote repository to the local repository

125

c5.01

Install archivebox

152

c5.02

Setup archivebox

153

c5.03

Start archivebox

153

c5.04

Install trafilatura

156

c5.05

Start trafilatura CLI version

156

c5.06

Start trafilatura GUI version

156

c5.07

Install package gooey

157

c5.08

Use trafilatura to download .txt version of URLs contained in a list

157

c5.09

Use trafilatura to download .xml version of URLs contained in a list

157

e5.04

Example of the XML structure created by trafilatura

158-159

c5.10

Use trafilatura to extract XML files from local HTML files, including formatting, links, and images

159

s5.01

Extract links from HTML pages using BeautifulSoup

162-163

s5.02a

Download and scrape HTML pages from links extracted with s5.01 using Selenium and BeautifulSoup

164-166

s5.02b

Download and scrape HTML pages from links extracted with s5.01 using requests and BeautifulSoup

166

s5.03

Extract metadata from the downloaded HTML pages using BeautifulSoup

166-171

e5.08

Basic structure of the metadata table included in MoreThesis pages

171-173

s5.04

Download PDF files linked in HTML pages

174-175

s5.05

Extract the contents of PDF files as plain-text using textract

176

s5.06

Create an XML corpus combining the metadata from HTML pages and the contents of PDF files using lxml

177-180

c5.11

Install snscrape

183

c5.12

Basic snscrape syntax

183

c5.13

Accessing the “help” section for a specific snscrape scraper

183

c5.14

Basic syntax to scrape tweets to .jsonl using snscrape advanced search query

191

c5.15

Example of how to scrape tweets to .jsonl using snscrape advanced search query

191

c5.16

Example of snscrape advanced search query using operators

192

c5.17

Example of snscrape advanced search query using operators

192

c5.18

Example of snscrape advanced search query using operators

192

c5.19

Example of snscrape advanced search query using operators

192

c5.20

Example of snscrape advanced search query using operators

192

c5.21

Install pandas

193

c5.22

Use script [s5.07] to scrape tweets using a list of queries

193

s5.07

Scrape tweets with snscrape using a list of queries

193-196

e5.09

Example of filename saved by script [s5.07]

196

Table 5.12

Metadata data points collected by snscrape

197-203

e5.10

Example of data extracted with [s5.08]

204

s5.08

Convert tweets extracted with snscrape to XML format

204-206

c5.23

Install instaloader

206

c5.24

Basic instaloader syntax

206

c5.25

Example of instaloader scraping command; download comments and geolocations

208

Table 5.17

Metadata data points collected by instaloader for posts

209-218

Table 5.18

Metadata data points collected by instaloader for comments

218

e5.11

Example of data extracted with [s5.09]

220

s5.09

Convert Instagram posts and comments extracted with instaloader to XML format

220-226

c5.26

Install facebook-scraper

228

c5.27

Basic facebook-scraper syntax

228

Table 5.21

Metadata data points collected by facebook-scraper for posts

229-233

Table 5.22

Metadata data points collected by facebook-scraper for profiles

233-235

Table 5.23

Metadata data points collected by facebook-scraper for groups

236

e5.12

Example of data extracted with [s5.10]

236-237

s5.10

Convert Facebook posts and comments extracted with facebook-scraper to XML format

237-242

s5.11

Get profile details from Facebook using facebook-scraper

242-245

s5.12

Implement the collection of profile details ([s5.11]) into [s5.10]

245-246

c5.28

Install youtube-dl

247

c5.29

Basic youtube-dl syntax

247

e5.14

Example of TTML format

252-253

e5.15

Example of SRV format without auto-captioning

253

e5.16

Example of SRV format with auto-captioning

253-254

c5.30

Install youtube-comment-downloader

254

c5.31

Basic youtube-comment-downloader syntax

254

Table 5.28

Metadata data points collected by youtube-dl for videos

255-262

Table 5.29

Metadata data points collected by youtube-comment-downloader for comments

262

c5.32

Extracting video details, metadata, and subtitles from Youtube without multimedia files

263

e5.18

Example of data extracted with [s5.13]

264

e5.19

Example of data extracted with [s5.14]

264

s5.13

Extract collected Youtube data (everything except comments) to XML format

264-269

s5.14

Extract collected Youtube comments to XML format

269-272

s5.15

Sample usage of the module dateutil.parser to parse a date in string format

274

Figure 5.8

Example of recognised spelling variants in VARD

276

Figure 5.9

Example of unrecognised spelling variants in VARD

277

e5.21

Example of normalised data in XML format generated with VARD

278

c5.33

Install textract

279

c5.34

Basic textract syntax

279

s5.16

Identify a set of predefined languages in .txt files and write a summary report in spreadsheet format

281-283

e5.23

Example of hashtags transformed through [s5.17]

286

s5.17

Segment hashtags and transform them into XML tags in a XML corpus file

287-288

e5.24

Regular expression to capture usernames/username handles

289

e5.25

Regular expression to capture simple URLs

289

e5.26

Regular expression to capture complex URLs

289

e5.27

Regular expression to capture cashtags

289

c5.35

Install stanza

291

s5.18

Install stanza language models

291

e5.28

Example of data in XML format extracted with [s5.20]

292

s5.19

Annotate .txt files and output the results in XML format

292-294

s5.20

Annotate .txt files and output the results in XML verticalised format

294-295

e5.29

Example of XML verticalised format

296

Figure 5.10

OpenRefine main page

298

Figure 5.11

Preview for CSV import in OpenRefine

299

Figure 5.12

Preview for JSON import in OpenRefine

299

Figure 5.13

Preview for XML import – step 1 – in OpenRefine

300

Figure 5.14

Preview for XML import – step 2 – in OpenRefine

300

Figure 5.15

Using ‘facets’ (filters) in OpenRefine

301

Figure 6.1

Example of a page collected from the Silk Road 1 forum

317

e6.01

Example (modified) of the post structure in Silk Road 1 HTML pages

320-321

e6.02

XML meta-structure of the data extracted through [s6.01] (Silk Road 1 corpus)

321-322

s6.01

Convert Silk Road 1 HTML pages to XML format using BeautifulSoup

323-328

e6.03

XML meta-structure of the documents included in the DPM corpus

328

c6.01

Scrape tweets created after a specific date with twint

337

c6.02

Scrape tweets created after a specific date with snscrape (replicating the results produced with [c6.01]

338

s6.02

Convert tweets extracted with twint in CSV format to XML

338-340

e6.04

Example of data extracted with [s6.02]

340

e6.05

Example of syntax used by WordPress to show all posts available in a website

342

s6.03

Collect (crawl) all posts links from a WordPress website

342-344

e6.06

Example of a message containing an emoji

344

e6.07

Examples of emoji transliterations applied to [e6.06] through [s6.04]

344-345

s6.04

Function to transliterate emojis using the emoji module

345-346

e6.08

Example of data extracted with [s6.06] (PJ corpus)

352-353

s6.05

Collect all chatlogs from perverted-justice.com

353-357

s6.06

Convert PJ chatlogs into XML format

360-366

Figure 6.3

Example of the interactive plot created for the visual exploration of collocations

368

Figure 5.1

#LancsBox main interface

149

Figure 5.2

#LancsBox data collection interface

149