Analysing the language of far-right groups on Twitter and Facebook#
1CATLISM, 330
The case study [utilised] in this section is based on the research conducted at Swansea University Department of Applied Linguistics under the supervision of Prof. Nuria Lorenzo-Dus, as part of the project ‘Developing Interdisciplinary and Industry Collaboration to Tackle Far-Right Extremist Use of Social Media for Propaganda and Recruitment’ (Principal Investigator Dr. Lella Nouri). Outputs from the research are published in [Nouri and Lorenzo-Dus, 2019] and [Lorenzo-Dus and Nouri, 2021] , which served as the basis for the contents [described].1
CATLISM, 330
Warning
Similar to the tools and techiques described in Twitter, as of late June 2023 a number of changes have made anonymous access to tweets impossible, and consequently rendering the use of the tool twint
(a project abandoned in the spring of 2022) and the techniques exemplified below useless. They are included to support the contents of the book.
Collecting Twitter data for the corpus#
CATLISM, 337
Collecting Twitter data for the corpus using twint
2CATLISM, 337
#
twint -u Twitter –since 2015-06-16 -o twint_output.csv -csv
CATLISM, 338
Collecting Twitter data for the corpus using snscrape
3CATLISM, 338
#
snscrape -v --progress -jsonl twitter-search 'since:2015–06–16 from:Twitter' > snscrape_output.json
CATLISM, 338-340
Extract the data from twint
output to XML format4CATLISM, 338-340
#
1# Import the required modules to read/write csv and xml files; and to read/write XML files
2import csv
3from lxml import etree
4
5# Create the root element of the XML structure - <corpus> -, which will contain all the extracted tweets
6# as elements defined by the <text> element tag (one tweet = one text)
7corpus = etree.Element("corpus")
8
9# Open the csv file containing the data collected by twint
10csv_data = open("twint_output.csv", "r", newline="", encoding="utf-8")
11# Read the content of the file, where each value is separated from the others by a tab (\t) character
12csv_data_reader = csv.reader(csv_data, delimiter="\t")
13# Skip the first row, containing the header
14next(csv_data_reader, None)
15# Create a list of all the rows and store it in the variable 'rows'
16rows = [r for r in csv_data_reader]
17# For each row in the list of rows, do:
18for row in rows:
19 # Extract each relevant value and store it inside a single variable, by indicating the
20 # column number where the value is stored in the original data. Python counts from 0, so e.g. column
21 # number 7 is read as number 6
22 tweet_id = row[0]
23 tweet_date = row[2]
24 tweet_username = row[6]
25 tweet_user_realname = row[7]
26 tweet_content = row[10]
27 # The following variables should contain the value 0 if no urls are included or the tweet is not a retweet,
28 # and the value 1 when links are present or the tweet is a retweet. For urls, the script checks whether the length
29 # of the original value is shorter than 3 characters (i.e. it only contains the empty square brackets), in which
30 # case it assigns the value 0 as no value is present in the data. For 'tweet_isretweet' it checks if the word 'False'
31 # appears in the data and assigns 0 if it does or 1 otherwise.
32 tweet_urls_present = 0 if len(row[13]) < 3 else 1
33 tweet_isretweet = 0 if row[21] == "False" else 1
34 # The extracted values are assigned as values to the attributes of <text>.
35 # The actual content of the tweet is then enclosed inside of <text> using the notation '.text'
36 etree.SubElement(
37 corpus,
38 "text",
39 id=str(tweet_id),
40 csv_date_created=str(tweet_date),
41 csv_username=str(tweet_username),
42 csv_user_realname=str(tweet_user_realname),
43 csv_urls_present=str(tweet_urls_present),
44 csv_isretweet=str(tweet_isretweet),
45 ).text = str(tweet_content)
46# The XML structure is created by adding all the extracted elements to the main <corpus> element tag
47tree = etree.ElementTree(corpus)
48# The resulting XML structure is written to the output.xml file using utf-8 encoding, adding the XML declaration
49# at the beginning and graphically formatting the layout ('pretty_print')
50tree.write("twint_out.xml", pretty_print=True, xml_declaration=True, encoding="utf-8")
CATLISM, 340
Example of data extracted with [s6.02]
5CATLISM, 340
#
1<?xml version='1.0' encoding='UTF-8'?>
2<corpus>
3 <text id="1492120137205526528" csv_date_created="2022-02-11T12:55:37+00:00" csv_username="Twitter" csv_user_realname="Twitter" csv_urls_present="0" csv_isretweet="0">oh good you're up :grinning-face-with-big-eyes:, here are a million Tweets to look at</text>
4 <text id="1491089523291394052" csv_date_created="2022-02-08T16:40:20+00:00" csv_username="Twitter" csv_user_realname="Twitter" csv_urls_present="0" csv_isretweet="0">@Seipati-Sanity guess what we voted</text>
5</corpus>
Collecting Wordpress blog data for the corpus#
CATLISM, 342
Syntax used for the Wordpress blog pages6CATLISM, 342
#
https://example.com/page/[N]/
CATLISM, 342-344
Collecting links to all the posts in a blog7CATLISM, 342-344
#
8CATLISM, 341
Script
[s6.03]
[uses] the module requests and BeautifulSoup […] to crawl the blog pages containing links to the posts and to scrape each post’s URL, respectively. The procedure relies on the default structure of WordPress blogs, whereby users can browse all the available posts in pages where only a preview is shown; these pages are accessible through links formatted as in example[e6.05]
, where[N]
is an incremental number starting from 2 indicating the second – up to the nth – page of the blog (the first page is instead missing the/page/[N]/
string); and the string page indicates the URL path under which the contents are archived. Depending on how the creator of a blog sets up the website, strings such asblog
,news
, andarticles
may appear instead ofpage
; in such cases it is sufficient to replace the latter with the relevant label in script[s6.03]
. A further step is then required to adapt it to any other blog built using WordPress by changing the starting web address of the blog in lines 30 and 34.8CATLISM, 341
1# Import modules for: using regular expressions; pausing the script ('sleep'); to collect data from the web;
2# and to use BeautifulSoup
3import re
4from time import sleep
5import requests
6from bs4 import BeautifulSoup
7
8# Compile the regular expression to capture heading tags (<h1>, <h2>, etc...)
9heading_tags = re.compile("^h[1-6]$")
10# Define the headers to be used for crawling the data, so that the script will be "seen" by the server
11# as originating from a Chrome browser running on macOS
12headers = {
13 "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"
14}
15# Create an empty list to store the collected URLs
16links = []
17# Initialise a counter to generate the incremental page numbers
18start_page = 1
19# Set the number of Wordpress pages to be collected
20max_page_number = 2
21
22while True:
23 # Check if number of the page to be collected is greater than the total number of pages to be collected; if so, write the results to the output file and stop the collection
24 if start_page > max_page_number:
25 with open("links_list.txt", "w", encoding="utf-8") as output_file:
26 output_file.write("\n".join(links))
27 break
28 # Check if the counter 'start_page' is set to the first page; if so the URL to crawl is the main page
29 elif start_page == 1:
30 url = "https://example.com/page/"
31 # If the counter is set to 2 or more, then the URL follows a different format
32 else:
33 # Construct the URL by including (through the use of format and the {} notation) the number of the counter
34 url = "https://example.com/page/{}/".format(start_page)
35 # Get the content of the URL, using the headers defined in line 11
36 r = requests.get(url, headers=headers)
37 # Read the collected HTML content in BeautifulSoup using the 'lxml' parser
38 soup = BeautifulSoup(r.content, "lxml")
39 # Find the section that contains the list of articles; oftentimes it is included in the <main> element tag, but this may
40 # change depending on the Wordpress theme adopted and on the organisation of the contents on the website
41 main_section = soup.find("main")
42 # Find all instances of the <article> tag, identifying the elements that contain the articles information; similar to the
43 # <main> element tag, this may vary. A post/content may be labelled as e.g. 'article' or 'news', and may be identified by an
44 # element tag such as <article> or <div class="news">. The script should therefore be adapted depending on the structure of the
45 # website being scraped; if e.g. a <div class="news"> is employed, the syntax ("div", {"class": "news"}) should be used
46 # instead of ("article")
47 articles = main_section.find_all("article")
48 # For each article found
49 for article in articles:
50 # Find the first heading tag
51 article_title = article.find(heading_tags)
52 # For each heading tag, find all the hyperlinks tags (<a>)
53 for url in article_title.find_all("a"):
54 print(url)
55 # Extract the 'href' attribute containing the URL to the article, and add it to the 'links' list
56 links.append(url["href"])
57 # Wait two seconds before collecting the next article, to avoid making too many requests to the server
58 sleep(2)
59 # Increase the counter by 1
60 start_page += 1
CATLISM
Transliterating emojis in the corpus 9CATLISM
#
Function to transliterate emojis (using two different output formats)#
The function defined in [s6.04]
can be imported into any script and be used with the included syntax
demojize(INPUT, OUTPUT_FORMAT)
1# Import the required module to transliterate emojis
2import emoji
3
4# Define the function called 'demojize')
5def demojize(text, output):
6 """Converts emoji(s) found in a string of text into their transliterated CLDR version; input is:
7
8 text: the string of text with one or more emojis
9 output: the format of the 'output.
10
11 If 'output' is set to 'default', the result for 🙃 is {upside-down_face}
12 If 'output' is set to custom, result is {upside^down^face}
13
14 Usage follows the syntax
15 demojize(INPUT, FORMAT)
16 """
17
18 # If 'output' is set to 'default', apply the standard transliteration using square brackets as delimiters
19 if output == "default":
20 return emoji.demojize(text, delimiters=("{", "}"))
21 # Else if set to 'custom' do:
22 elif output == "custom":
23 # Create a list and store inside of it the text to be processed
24 out_text = list(text)
25 # Use the function 'emoji_count' to count the total number of identified emojis
26 emoji_count = emoji.emoji_count(out_text)
27 # For each identified emoji do:
28 for i in range(emoji_count):
29 # Take the first (and only) emoji in the list of emojis found, created through the function 'emoji_list'.
30 # The function create, for each emoji, three data-points: 'emoji' containing the actual emoji;
31 # 'match_start' indicates the positional value of the first character of the emoji; 'match_end the positional
32 # value of the last character of the emoji.
33 first_emoji = emoji.emoji_list(out_text)[0]
34 # Store the three aforementioned data-points in three separate variables
35 found_emoji = first_emoji["emoji"]
36 emoji_start = first_emoji["match_start"]
37 emoji_end = first_emoji["match_end"]
38 # Apply the standard demojize function to the identified emoji, and replace the underscore _ with the character ^
39 demojized = str(
40 " " + emoji.demojize(found_emoji, delimiters=("{", "}")) + " "
41 ).replace("_", "^")
42 # Replace the hyphen with the character ^
43 demojized = demojized.replace("-", "^")
44 # Replace the emoji with its transliterated version in the original text
45 out_text[emoji_start:emoji_end] = demojized
46 # Return the full text with transliterated emojis
47 return "".join(out_text)
Example of message before the transliteration#
oh good you’re up 😄, here are a million Tweets to look at
Example [e6.06]
after emojis transliteration through [s6.04]
#
# using option 'default'
<text>oh good you're up {grinning_face_with_smiling_eyes}, here are a million Tweets to look at</text>
# using option 'custom'
<text>oh good you're up {grinning^face^with^smiling^eyes}, here are a million Tweets to look at</text>