Twitter#

Warning

As of late June 2023 a number of changes have made anonymous access to tweets impossible, consequently rendering the use of tools such as snscrape and other โ€œAPI-lessโ€ ones useless. Apparently Twitter will reinstate anonymous access in the future (see this discussion from snscrape issue pages), but as of 10th September 2023 a number of commands and scripts included in this page - and marked with are not working. They are included to support the contents of the book, and new (working) ones will be included if anything changes.

Data collection from Twitter can be obtained using snscrape.
Options and arguments for the tool can be found in the official documentation.

1CATLISM, 183

Installing the tool1CATLISM, 183#

Command [c5.11]#
pip install snscrape
[c5.11]
2"CATLISM, 183; 191-192

Using the tool2"CATLISM, 183; 191-192#

Command [c5.12]#
snscrape [GLOBAL-OPTIONS] SCRAPER-NAME [SCRAPER-OPTIONS] [SCRAPER-ARGUMENTS ... ]
Command [c5.13]#
snscrape SCRAPER-NAME --help
Command [c5.14] #
snscrape -v --progress --jsonl twitter-search '[QUERY]' > OUTPUT.jsonl
Command [c5.15] #
snscrape -v --progress --jsonl twitter-search '("technology" OR "physics") ("space" OR "rockets")' > OUTPUT.jsonl
Command [c5.16]#
("technology" OR "physics") ("space" OR "rockets")
Command [c5.17]#
(techology OR physics) (space OR rokets)
Command [c5.18]#
("technology" OR "physics") -("games" OR "console")
Command [c5.19]#
(๐Ÿด OR ๐ŸŒˆ) since_time:1654034400 until_time:1656626399
Command [c5.20]#
source:twitter_web_app since_time:1654038000 until_time:1659308400 @POTUS -from:POTUS -filter:media
3CATLISM, 193-196

Collecting tweets from a list of search terms3CATLISM, 193-196#

Install the required modules#

Command [c5.21]#
pip install pandas
[c5.21]

Script to read the list of search terms and download tweets#

Script [s5.07] #
  1# Adapted from the following sources
  2# https://github.com/MartinBeckUT/TwitterScraper/blob/127b15b3878ab0c1b74438011056d89152701db1/snscrape/python-wrapper/snscrape-python-wrapper.py
  3# https://github.com/satomlins/snscrape-by-location/blob/1f605fb6e1caff3577198792a7717ffbf3c3f454/snscrape_by_location_tutorial.ipynb
  4# Import the modules for: using snscrape; working with dataframes; employing command-line arguments; using regular expressions
  5
  6import snscrape.modules.twitter as sntwitter
  7import pandas as pd
  8import argparse
  9import re
 10
 11# Add the ability to specify arguments for the script
 12parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
 13# Construct the first argument as the name of the file containing the search terms
 14parser.add_argument("searchlist")
 15# Construct the optional second argument (following the label --max) as the maximum number of tweets to be collected
 16# for each search term
 17parser.add_argument(
 18    "--max",
 19    dest="maxResults",
 20    type=lambda x: int(x)
 21    if int(x) >= 0
 22    else parser.error("--max-results N must be zero or positive"),
 23    metavar="N",
 24    help="Only return the first N results",
 25)
 26# Read the provided arguments
 27args = parser.parse_args()
 28# Read the first argument as the name of the file containing the search terms
 29search_list = args.searchlist
 30# Read the second (optional) argument as the maximum number of tweets to be collected for each search term
 31maxResults = args.maxResults
 32
 33# Open the search terms list
 34with open(search_list, "r", encoding="utf-8") as f:
 35    # For every search terms do
 36    for word in f:
 37        # Eliminate the 'newline' character (\n)
 38        word = word.rstrip("\n")
 39        # Clean the search term from characters that are invalid in filenames using a regular expression; this is used for the output file only.
 40        # From https://stackoverflow.com/a/71199182
 41        clean_word = re.sub(r"[/\\?%*:|\"<>\x7F\x00-\x1F]", "-", word)
 42        # Create an empty list that will contain the collected tweets
 43        tweets_list = []
 44        # Collect tweets for the search terms, taking into account the maximum number of tweets (if supplied)
 45        for i, tweet in enumerate(sntwitter.TwitterSearchScraper(word).get_items()):
 46            # Stop the collection of tweets once the maximum number supplied is reach
 47            if i >= maxResults:
 48                break
 49            # Add all the data-points collected for the tweet to the previously created list
 50            tweets_list.append(
 51                [
 52                    tweet.date,
 53                    tweet.id,
 54                    tweet.content,
 55                    tweet.url,
 56                    tweet.user.username,
 57                    tweet.user.followersCount,
 58                    tweet.replyCount,
 59                    tweet.retweetCount,
 60                    tweet.likeCount,
 61                    tweet.quoteCount,
 62                    tweet.lang,
 63                    tweet.outlinks,
 64                    tweet.media,
 65                    tweet.retweetedTweet,
 66                    tweet.quotedTweet,
 67                    tweet.inReplyToTweetId,
 68                    tweet.inReplyToUser,
 69                    tweet.mentionedUsers,
 70                    tweet.coordinates,
 71                    tweet.place,
 72                    tweet.hashtags,
 73                    tweet.cashtags,
 74                ]
 75            )
 76            # Create a dataframe from the tweets list above, naming the columns with the provided human-readable labels
 77            tweets_df = pd.DataFrame(
 78                tweets_list,
 79                columns=[
 80                    "Datetime",
 81                    "Tweet Id",
 82                    "Text",
 83                    "URL",
 84                    "Username",
 85                    "N. followers",
 86                    "N. replies",
 87                    "N. retweets",
 88                    "N. likes",
 89                    "N. quote",
 90                    "Language",
 91                    "Outlink",
 92                    "Media",
 93                    "Retweeted tweet",
 94                    "Quoted tweet",
 95                    "In reply to tweet ID",
 96                    "In reply to user",
 97                    "Mentioned users",
 98                    "Geo coordinates",
 99                    "Place",
100                    "Hashtags",
101                    "Cashtags",
102                ],
103            )
104            # Export the dataframe to a tab-delimited CSV whose filename is equivalent to the search term cleaned from characters that cannot be included in filenames
105            tweets_df.to_csv(clean_word + ".csv", sep="\t", index=False, encoding="utf-8")

Using script [s5.07]#

From the terminal, once inside the folder where script snscrape_from_list.py has been saved and where the SEARCH_LIST.txt file resides, run command [c5.22] after changing --max N to the desired value (e.g. --max 10).

Command [c5.22] #
python snscrape_from_list.py SEARCH_LIST.txt --max N

Example of filename saved by script [s5.07]#

Example [e5.09]#
(๐Ÿด OR ๐ŸŒˆ) since_time-1654034400 until_time-1656626399.csv
4CATLISM, 204-206

Extracting the data4CATLISM, 204-206#

Script [s5.08] #
 1# Import the required modules to read/write jsonl and xml files
 2import jsonlines
 3from lxml import etree
 4
 5# Create the root element of the XML structure - named 'corpus' -, which will contain all the extracted tweets
 6# as elements defined by the 'text' tag (one tweet = one text)
 7corpus = etree.Element("corpus")
 8
 9# Open the jsonl file
10tweets = jsonlines.open("snscrape_output.jsonl")
11# Read it line by line - i.e. one tweet at a time -, and for each line do:
12for obj in tweets:
13    # Read the selected data points and store each one in a separate variable
14    tweet_id = obj["id"]
15    tweet_date = obj["date"]
16    tweet_username = obj["user"]["username"]
17    tweet_user_realname = obj["user"]["displayname"]
18    tweet_content = obj["content"]
19    # The following variables should contain the value 0 if no urls are included or the tweet is not a retweet,
20    # and the value 1 when links are present or the tweet is a retweet
21    tweet_urls_present = 0 if obj["outlinks"] is None else 1
22    tweet_isretweet = 0 if obj["retweetedTweet"] is None else 1
23    # The extracted values are assigned as values to the arguments of a tag labelled 'text' - contained inside of the main
24    # <corpus> element tag - separating one tweet from another. The actual content of the tweet is
25    # then enclosed inside of the <text> element tag using the notation '.text'
26    etree.SubElement(
27        corpus,
28        "text",
29        id=str(tweet_id),
30        date=str(tweet_date),
31        username=str(tweet_username),
32        user_realname=str(tweet_user_realname),
33        urls_present=str(tweet_urls_present),
34        isretweet=str(tweet_isretweet),
35    ).text = str(tweet_content)
36# The XML structure is created by adding all the extracted elements to the main 'corpus' tag
37tree = etree.ElementTree(corpus)
38# The resulting XML structure is written to the output.xml file using utf-8 encoding, adding the XML declaration
39# at the beginning and graphically formatting the layout ('pretty_print')
40tree.write("output.xml", pretty_print=True, xml_declaration=True, encoding="utf-8")

How to use script [s5.08]#

Copy/download the file s5.08_extract_snscrape_jsonl.py inside the folder where the output file snscrape_output.jsonl downloaded through snscrape (e.g. through c5.14) resides; then browse inside the folder through the terminal, e.g.

cd Downloads/twitter_data/

At last, run the script from the terminal:

python s5.08_extract_snscrape_jsonl.py
5CATLISM, 204

Example of data extracted with [s5.08]5CATLISM, 204#

Example [e5.10]#
1<?xml version='1.0' encoding='UTF-8'?>
2<corpus>
3  <text id="ID_NUMBER" date="YYYY-MM-DDThh:mm:ssTZD" username="USERNAME" user_realname="USER_REAL_NAME" urls_present="0_OR_1" isretweet="0_OR_1">TWEET TEXT</text>
4</corpus>