Analysing the language of far-right groups on Twitter and Facebook#


The case study [utilised] in this section is based on the research conducted at Swansea University Department of Applied Linguistics under the supervision of Prof. Nuria Lorenzo-Dus, as part of the project ‘Developing Interdisciplinary and Industry Collaboration to Tackle Far-Right Extremist Use of Social Media for Propaganda and Recruitment’ (Principal Investigator Dr. Lella Nouri). Outputs from the research are published in [] and [] , which served as the basis for the contents [described].1CATLISM, 330


Similar to the tools and techiques described in Twitter, as of late June 2023 a number of changes have made anonymous access to tweets impossible, and consequently rendering the use of the tool twint (a project abandoned in the spring of 2022) and the techniques exemplified below useless. They are included to support the contents of the book.

Collecting Twitter data for the corpus#


Collecting Twitter data for the corpus using twint2CATLISM, 337#

Command [c6.01] #
twint -u Twitter –since 2015-06-16 -o twint_output.csv -csv

Collecting Twitter data for the corpus using snscrape3CATLISM, 338#

Command [c6.02] #
snscrape -v --progress -jsonl twitter-search 'since:2015–06–16 from:Twitter' > snscrape_output.json
4CATLISM, 338-340

Extract the data from twint output to XML format4CATLISM, 338-340#

Script [s6.02] #
 1# Import the required modules to read/write csv and xml files; and to read/write XML files
 2import csv
 3from lxml import etree
 5# Create the root element of the XML structure - <corpus> -, which will contain all the extracted tweets
 6# as elements defined by the <text> element tag (one tweet = one text)
 7corpus = etree.Element("corpus")
 9# Open the csv file containing the data collected by twint
10csv_data = open("twint_output.csv", "r", newline="", encoding="utf-8")
11# Read the content of the file, where each value is separated from the others by a tab (\t) character
12csv_data_reader = csv.reader(csv_data, delimiter="\t")
13# Skip the first row, containing the header
14next(csv_data_reader, None)
15# Create a list of all the rows and store it in the variable 'rows'
16rows = [r for r in csv_data_reader]
17# For each row in the list of rows, do:
18for row in rows:
19    # Extract each relevant value and store it inside a single variable, by indicating the
20    # column number where the value is stored in the original data. Python counts from 0, so e.g. column
21    # number 7 is read as number 6
22    tweet_id = row[0]
23    tweet_date = row[2]
24    tweet_username = row[6]
25    tweet_user_realname = row[7]
26    tweet_content = row[10]
27    # The following variables should contain the value 0 if no urls are included or the tweet is not a retweet,
28    # and the value 1 when links are present or the tweet is a retweet. For urls, the script checks whether the length
29    # of the original value is shorter than 3 characters (i.e. it only contains the empty square brackets), in which
30    # case it assigns the value 0 as no value is present in the data. For 'tweet_isretweet' it checks if the word 'False'
31    # appears in the data and assigns 0 if it does or 1 otherwise.
32    tweet_urls_present = 0 if len(row[13]) < 3 else 1
33    tweet_isretweet = 0 if row[21] == "False" else 1
34    # The extracted values are assigned as values to the attributes of <text>.
35    # The actual content of the tweet is then enclosed inside of <text> using the notation '.text'
36    etree.SubElement(
37        corpus,
38        "text",
39        id=str(tweet_id),
40        csv_date_created=str(tweet_date),
41        csv_username=str(tweet_username),
42        csv_user_realname=str(tweet_user_realname),
43        csv_urls_present=str(tweet_urls_present),
44        csv_isretweet=str(tweet_isretweet),
45    ).text = str(tweet_content)
46# The XML structure is created by adding all the extracted elements to the main <corpus> element tag
47tree = etree.ElementTree(corpus)
48# The resulting XML structure is written to the output.xml file using utf-8 encoding, adding the XML declaration
49# at the beginning and graphically formatting the layout ('pretty_print')
50tree.write("twint_out.xml", pretty_print=True, xml_declaration=True, encoding="utf-8")

Example of data extracted with [s6.02]5CATLISM, 340#

Example [e6.04]#
1<?xml version='1.0' encoding='UTF-8'?>
3  <text id="1492120137205526528" csv_date_created="2022-02-11T12:55:37+00:00" csv_username="Twitter" csv_user_realname="Twitter" csv_urls_present="0" csv_isretweet="0">oh good you're up :grinning-face-with-big-eyes:, here are a million Tweets to look at</text>
4  <text id="1491089523291394052" csv_date_created="2022-02-08T16:40:20+00:00" csv_username="Twitter" csv_user_realname="Twitter" csv_urls_present="0" csv_isretweet="0">@Seipati-Sanity guess what we voted</text>

Collecting Wordpress blog data for the corpus#


Syntax used for the Wordpress blog pages6CATLISM, 342#

Example [e6.05]#[N]/

Transliterating emojis in the corpus 9CATLISM#

Function to transliterate emojis (using two different output formats)#

The function defined in [s6.04] can be imported into any script and be used with the included syntax

Script [s6.04] #
 1# Import the required module to transliterate emojis
 2import emoji
 4# Define the function called 'demojize')
 5def demojize(text, output):
 6    """Converts emoji(s) found in a string of text into their transliterated CLDR version; input is:
 8    text: the string of text with one or more emojis
 9    output: the format of the 'output.
11    If 'output' is set to 'default', the result for 🙃 is {upside-down_face}
12    If 'output' is set to custom, result is {upside^down^face}
14    Usage follows the syntax
15    demojize(INPUT, FORMAT)
16    """
18    # If 'output' is set to 'default', apply the standard transliteration using square brackets as delimiters
19    if output == "default":
20        return emoji.demojize(text, delimiters=("{", "}"))
21    # Else if set to 'custom' do:
22    elif output == "custom":
23        # Create a list and store inside of it the text to be processed
24        out_text = list(text)
25        # Use the function 'emoji_count' to count the total number of identified emojis
26        emoji_count = emoji.emoji_count(out_text)
27        # For each identified emoji do:
28        for i in range(emoji_count):
29            # Take the first (and only) emoji in the list of emojis found, created through the function 'emoji_list'.
30            # The function create, for each emoji, three data-points: 'emoji' containing the actual emoji;
31            # 'match_start' indicates the positional value of the first character of the emoji; 'match_end the positional
32            # value of the last character of the emoji.
33            first_emoji = emoji.emoji_list(out_text)[0]
34            # Store the three aforementioned data-points in three separate variables
35            found_emoji = first_emoji["emoji"]
36            emoji_start = first_emoji["match_start"]
37            emoji_end = first_emoji["match_end"]
38            # Apply the standard demojize function to the identified emoji, and replace the underscore _ with the character ^
39            demojized = str(
40                " " + emoji.demojize(found_emoji, delimiters=("{", "}")) + " "
41            ).replace("_", "^")
42            # Replace the hyphen with the character ^
43            demojized = demojized.replace("-", "^")
44            # Replace the emoji with its transliterated version in the original text
45            out_text[emoji_start:emoji_end] = demojized
46        # Return the full text with transliterated emojis
47        return "".join(out_text)

Example of message before the transliteration#

Example [e6.06]#
oh good you’re up 😄, here are a million Tweets to look at

Example [e6.06] after emojis transliteration through [s6.04]#

Example [e6.07]#
# using option 'default'
<text>oh good you're up {grinning_face_with_smiling_eyes}, here are a million Tweets to look at</text>

# using option 'custom'
<text>oh good you're up {grinning^face^with^smiling^eyes}, here are a million Tweets to look at</text>