Analysing the language of far-right groups on Twitter and Facebook#

¹CATLISM, 330

The case study [utilised] in this section is based on the research conducted at Swansea University Department of Applied Linguistics under the supervision of Prof. Nuria Lorenzo-Dus, as part of the project ‘Developing Interdisciplinary and Industry Collaboration to Tackle Far-Right Extremist Use of Social Media for Propaganda and Recruitment’ (Principal Investigator Dr. Lella Nouri). Outputs from the research are published in [] and [] , which served as the basis for the contents [described]. ¹ ¹CATLISM, 330

Warning

Similar to the tools and techiques described in Twitter, as of late June 2023 a number of changes have made anonymous access to tweets impossible, and consequently rendering the use of the tool twint (a project abandoned in the spring of 2022) and the techniques exemplified below useless. They are included to support the contents of the book.

Collecting Twitter data for the corpus#

²CATLISM, 337

Collecting Twitter data for the corpus using `twint` ² ²`CATLISM, 337`#

Command [c6.01] #

twint -u Twitter –since 2015-06-16 -o twint_output.csv -csv

³CATLISM, 338

Collecting Twitter data for the corpus using `snscrape` ³ ³`CATLISM, 338`#

Command [c6.02] #

snscrape -v --progress -jsonl twitter-search 'since:2015–06–16 from:Twitter' > snscrape_output.json

⁴CATLISM, 338-340

Extract the data from `twint` output to XML format ⁴ ⁴`CATLISM, 338-340`#

Script [s6.02] #

# Import the required modules to read/write csv and xml files; and to read/write XML files
import csv
from lxml import etree

# Create the root element of the XML structure - <corpus> -, which will contain all the extracted tweets
# as elements defined by the <text> element tag (one tweet = one text)
corpus = etree.Element("corpus")

# Open the csv file containing the data collected by twint
csv_data = open("twint_output.csv", "r", newline="", encoding="utf-8")
# Read the content of the file, where each value is separated from the others by a tab (\t) character
csv_data_reader = csv.reader(csv_data, delimiter="\t")
# Skip the first row, containing the header
next(csv_data_reader, None)
# Create a list of all the rows and store it in the variable 'rows'
rows = [r for r in csv_data_reader]
# For each row in the list of rows, do:
for row in rows:
    # Extract each relevant value and store it inside a single variable, by indicating the
    # column number where the value is stored in the original data. Python counts from 0, so e.g. column
    # number 7 is read as number 6
    tweet_id = row[0]
    tweet_date = row[2]
    tweet_username = row[6]
    tweet_user_realname = row[7]
    tweet_content = row[10]
    # The following variables should contain the value 0 if no urls are included or the tweet is not a retweet,
    # and the value 1 when links are present or the tweet is a retweet. For urls, the script checks whether the length
    # of the original value is shorter than 3 characters (i.e. it only contains the empty square brackets), in which
    # case it assigns the value 0 as no value is present in the data. For 'tweet_isretweet' it checks if the word 'False'
    # appears in the data and assigns 0 if it does or 1 otherwise.
    tweet_urls_present = 0 if len(row[13]) < 3 else 1
    tweet_isretweet = 0 if row[21] == "False" else 1
    # The extracted values are assigned as values to the attributes of <text>.
    # The actual content of the tweet is then enclosed inside of <text> using the notation '.text'
    etree.SubElement(
        corpus,
        "text",
        id=str(tweet_id),
        csv_date_created=str(tweet_date),
        csv_username=str(tweet_username),
        csv_user_realname=str(tweet_user_realname),
        csv_urls_present=str(tweet_urls_present),
        csv_isretweet=str(tweet_isretweet),
    ).text = str(tweet_content)
# The XML structure is created by adding all the extracted elements to the main <corpus> element tag
tree = etree.ElementTree(corpus)
# The resulting XML structure is written to the output.xml file using utf-8 encoding, adding the XML declaration
# at the beginning and graphically formatting the layout ('pretty_print')
tree.write("twint_out.xml", pretty_print=True, xml_declaration=True, encoding="utf-8")

⁵CATLISM, 340

Example of data extracted with `[s6.02]` ⁵ ⁵`CATLISM, 340`#

Example [e6.04]#

<?xml version='1.0' encoding='UTF-8'?>
<corpus>
  <text id="1492120137205526528" csv_date_created="2022-02-11T12:55:37+00:00" csv_username="Twitter" csv_user_realname="Twitter" csv_urls_present="0" csv_isretweet="0">oh good you're up :grinning-face-with-big-eyes:, here are a million Tweets to look at</text>
  <text id="1491089523291394052" csv_date_created="2022-02-08T16:40:20+00:00" csv_username="Twitter" csv_user_realname="Twitter" csv_urls_present="0" csv_isretweet="0">@Seipati-Sanity guess what we voted</text>
</corpus>

Collecting Wordpress blog data for the corpus#

⁶CATLISM, 342

Syntax used for the Wordpress blog pages ⁶ ⁶`CATLISM, 342`#

Example [e6.05]#

https://example.com/page/[N]/

⁷CATLISM, 342-344

Collecting links to all the posts in a blog ⁷ ⁷`CATLISM, 342-344`#

⁸CATLISM, 341

Script [s6.03] [uses] the module requests and BeautifulSoup […] to crawl the blog pages containing links to the posts and to scrape each post’s URL, respectively. The procedure relies on the default structure of WordPress blogs, whereby users can browse all the available posts in pages where only a preview is shown; these pages are accessible through links formatted as in example [e6.05], where [N] is an incremental number starting from 2 indicating the second – up to the nth – page of the blog (the first page is instead missing the /page/[N]/ string); and the string page indicates the URL path under which the contents are archived. Depending on how the creator of a blog sets up the website, strings such as blog, news, and articles may appear instead of page; in such cases it is sufficient to replace the latter with the relevant label in script [s6.03]. A further step is then required to adapt it to any other blog built using WordPress by changing the starting web address of the blog in lines 30 and 34. ⁸ ⁸CATLISM, 341

Script [s6.03] #

# Import modules for: using regular expressions;  pausing the script ('sleep'); to collect data from the web;
# and to use BeautifulSoup
import re
from time import sleep
import requests
from bs4 import BeautifulSoup

# Compile the regular expression to capture heading tags (<h1>, <h2>, etc...)
heading_tags = re.compile("^h[1-6]$")
# Define the headers to be used for crawling the data, so that the script will be "seen" by the server
# as originating from a Chrome browser running on macOS
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"
}
# Create an empty list to store the collected URLs
links = []
# Initialise a counter to generate the incremental page numbers
start_page = 1
# Set the number of Wordpress pages to be collected
max_page_number = 2

while True:
    # Check if number of the page to be collected is greater than the total number of pages to be collected; if so, write the results to the output file and stop the collection
    if start_page > max_page_number:
        with open("links_list.txt", "w", encoding="utf-8") as output_file:
            output_file.write("\n".join(links))
        break
    # Check if the counter 'start_page' is set to the first page; if so the URL to crawl is the main page
    elif start_page == 1:
        url = "https://example.com/page/"
    # If the counter is set to 2 or more, then the URL follows a different format
    else:
        # Construct the URL by including (through the use of format and the {} notation) the number of the counter
        url = "https://example.com/page/{}/".format(start_page)
    # Get the content of the URL, using the headers defined in line 11
    r = requests.get(url, headers=headers)
    # Read the collected HTML content in BeautifulSoup using the 'lxml' parser
    soup = BeautifulSoup(r.content, "lxml")
    # Find the section that contains the list of articles; oftentimes it is included in the <main> element tag, but this may
    # change depending on the Wordpress theme adopted and on the organisation of the contents on the website
    main_section = soup.find("main")
    # Find all instances of the <article> tag, identifying the elements that contain the articles information; similar to the
    # <main> element tag, this may vary. A post/content may be labelled as e.g. 'article' or 'news', and may be identified by an
    # element tag such as <article> or <div class="news">. The script should therefore be adapted depending on the structure of the
    # website being scraped; if e.g. a <div class="news"> is employed, the syntax ("div", {"class": "news"}) should be used
    # instead of ("article")
    articles = main_section.find_all("article")
    # For each article found
    for article in articles:
        # Find the first heading tag
        article_title = article.find(heading_tags)
        # For each heading tag, find all the hyperlinks tags (<a>)
        for url in article_title.find_all("a"):
            print(url)
            # Extract the 'href' attribute containing the URL to the article, and add it to the 'links' list
            links.append(url["href"])
        # Wait two seconds before collecting the next article, to avoid making too many requests to the server
        sleep(2)
    # Increase the counter by 1
    start_page += 1

⁹CATLISM

Transliterating emojis in the corpus ⁹ ⁹`CATLISM`#

Function to transliterate emojis (using two different output formats)#

The function defined in [s6.04] can be imported into any script and be used with the included syntax

demojize(INPUT, OUTPUT_FORMAT)

Script [s6.04] #

# Import the required module to transliterate emojis
import emoji

# Define the function called 'demojize')
def demojize(text, output):
    """Converts emoji(s) found in a string of text into their transliterated CLDR version; input is:

    text: the string of text with one or more emojis
    output: the format of the 'output.

    If 'output' is set to 'default', the result for 🙃 is {upside-down_face}
    If 'output' is set to custom, result is {upside^down^face}

    Usage follows the syntax
    demojize(INPUT, FORMAT)
    """

    # If 'output' is set to 'default', apply the standard transliteration using square brackets as delimiters
    if output == "default":
        return emoji.demojize(text, delimiters=("{", "}"))
    # Else if set to 'custom' do:
    elif output == "custom":
        # Create a list and store inside of it the text to be processed
        out_text = list(text)
        # Use the function 'emoji_count' to count the total number of identified emojis
        emoji_count = emoji.emoji_count(out_text)
        # For each identified emoji do:
        for i in range(emoji_count):
            # Take the first (and only) emoji in the list of emojis found, created through the function 'emoji_list'.
            # The function create, for each emoji, three data-points: 'emoji' containing the actual emoji;
            # 'match_start' indicates the positional value of the first character of the emoji; 'match_end the positional
            # value of the last character of the emoji.
            first_emoji = emoji.emoji_list(out_text)[0]
            # Store the three aforementioned data-points in three separate variables
            found_emoji = first_emoji["emoji"]
            emoji_start = first_emoji["match_start"]
            emoji_end = first_emoji["match_end"]
            # Apply the standard demojize function to the identified emoji, and replace the underscore _ with the character ^
            demojized = str(
                " " + emoji.demojize(found_emoji, delimiters=("{", "}")) + " "
            ).replace("_", "^")
            # Replace the hyphen with the character ^
            demojized = demojized.replace("-", "^")
            # Replace the emoji with its transliterated version in the original text
            out_text[emoji_start:emoji_end] = demojized
        # Return the full text with transliterated emojis
        return "".join(out_text)

Example of message before the transliteration#

Example [e6.06]#

oh good you’re up 😄, here are a million Tweets to look at

Example `[e6.06]` after emojis transliteration through `[s6.04]`#

Example [e6.07]#

# using option 'default'
<text>oh good you're up {grinning_face_with_smiling_eyes}, here are a million Tweets to look at</text>

# using option 'custom'
<text>oh good you're up {grinning^face^with^smiling^eyes}, here are a million Tweets to look at</text>

Analysing the language of far-right groups on Twitter and Facebook

Contents

Analysing the language of far-right groups on Twitter and Facebook#

Collecting Twitter data for the corpus#

Collecting Twitter data for the corpus using twint 2 2CATLISM, 337#

Collecting Twitter data for the corpus using snscrape 3 3CATLISM, 338#

Extract the data from twint output to XML format 4 4CATLISM, 338-340#

Example of data extracted with [s6.02] 5 5CATLISM, 340#

Collecting Wordpress blog data for the corpus#

Syntax used for the Wordpress blog pages 6 6CATLISM, 342#

Collecting links to all the posts in a blog 7 7CATLISM, 342-344#

Transliterating emojis in the corpus 9 9CATLISM#