Twitter#

Warning

As of late June 2023 a number of changes have made anonymous access to tweets impossible, consequently rendering the use of tools such as snscrape and other “API-less” ones useless. Apparently Twitter will reinstate anonymous access in the future (see this discussion from snscrape issue pages), but as of 10th September 2023 a number of commands and scripts included in this page - and marked with are not working. They are included to support the contents of the book, and new (working) ones will be included if anything changes.

Data collection from Twitter can be obtained using snscrape.
Options and arguments for the tool can be found in the official documentation.

¹CATLISM, 183

Installing the tool ¹ ¹`CATLISM, 183`#

Command [c5.11]#

pip install snscrape

²"CATLISM, 183; 191-192

Using the tool ² ²`"CATLISM, 183; 191-192`#

Command [c5.12]#

snscrape [GLOBAL-OPTIONS] SCRAPER-NAME [SCRAPER-OPTIONS] [SCRAPER-ARGUMENTS ... ]

Command [c5.13]#

snscrape SCRAPER-NAME --help

Command [c5.14] #

snscrape -v --progress --jsonl twitter-search '[QUERY]' > OUTPUT.jsonl

Command [c5.15] #

snscrape -v --progress --jsonl twitter-search '("technology" OR "physics") ("space" OR "rockets")' > OUTPUT.jsonl

Command [c5.16]#

("technology" OR "physics") ("space" OR "rockets")

Command [c5.17]#

(techology OR physics) (space OR rokets)

Command [c5.18]#

("technology" OR "physics") -("games" OR "console")

Command [c5.19]#

(🏴 OR 🌈) since_time:1654034400 until_time:1656626399

Command [c5.20]#

source:twitter_web_app since_time:1654038000 until_time:1659308400 @POTUS -from:POTUS -filter:media

³CATLISM, 193-196

Collecting tweets from a list of search terms ³ ³`CATLISM, 193-196`#

Install the required modules#

Command [c5.21]#

pip install pandas

Script to read the list of search terms and download tweets#

Script [s5.07] #

# Adapted from the following sources
# https://github.com/MartinBeckUT/TwitterScraper/blob/127b15b3878ab0c1b74438011056d89152701db1/snscrape/python-wrapper/snscrape-python-wrapper.py
# https://github.com/satomlins/snscrape-by-location/blob/1f605fb6e1caff3577198792a7717ffbf3c3f454/snscrape_by_location_tutorial.ipynb
# Import the modules for: using snscrape; working with dataframes; employing command-line arguments; using regular expressions

import snscrape.modules.twitter as sntwitter
import pandas as pd
import argparse
import re

# Add the ability to specify arguments for the script
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
# Construct the first argument as the name of the file containing the search terms
parser.add_argument("searchlist")
# Construct the optional second argument (following the label --max) as the maximum number of tweets to be collected
# for each search term
parser.add_argument(
    "--max",
    dest="maxResults",
    type=lambda x: int(x)
    if int(x) >= 0
    else parser.error("--max-results N must be zero or positive"),
    metavar="N",
    help="Only return the first N results",
)
# Read the provided arguments
args = parser.parse_args()
# Read the first argument as the name of the file containing the search terms
search_list = args.searchlist
# Read the second (optional) argument as the maximum number of tweets to be collected for each search term
maxResults = args.maxResults

# Open the search terms list
with open(search_list, "r", encoding="utf-8") as f:
    # For every search terms do
    for word in f:
        # Eliminate the 'newline' character (\n)
        word = word.rstrip("\n")
        # Clean the search term from characters that are invalid in filenames using a regular expression; this is used for the output file only.
        # From https://stackoverflow.com/a/71199182
        clean_word = re.sub(r"[/\\?%*:|\"<>\x7F\x00-\x1F]", "-", word)
        # Create an empty list that will contain the collected tweets
        tweets_list = []
        # Collect tweets for the search terms, taking into account the maximum number of tweets (if supplied)
        for i, tweet in enumerate(sntwitter.TwitterSearchScraper(word).get_items()):
            # Stop the collection of tweets once the maximum number supplied is reach
            if i >= maxResults:
                break
            # Add all the data-points collected for the tweet to the previously created list
            tweets_list.append(
                [
                    tweet.date,
                    tweet.id,
                    tweet.content,
                    tweet.url,
                    tweet.user.username,
                    tweet.user.followersCount,
                    tweet.replyCount,
                    tweet.retweetCount,
                    tweet.likeCount,
                    tweet.quoteCount,
                    tweet.lang,
                    tweet.outlinks,
                    tweet.media,
                    tweet.retweetedTweet,
                    tweet.quotedTweet,
                    tweet.inReplyToTweetId,
                    tweet.inReplyToUser,
                    tweet.mentionedUsers,
                    tweet.coordinates,
                    tweet.place,
                    tweet.hashtags,
                    tweet.cashtags,
                ]
            )
            # Create a dataframe from the tweets list above, naming the columns with the provided human-readable labels
            tweets_df = pd.DataFrame(
                tweets_list,
                columns=[
                    "Datetime",
                    "Tweet Id",
                    "Text",
                    "URL",
                    "Username",
                    "N. followers",
                    "N. replies",
                    "N. retweets",
                    "N. likes",
                    "N. quote",
                    "Language",
                    "Outlink",
                    "Media",
                    "Retweeted tweet",
                    "Quoted tweet",
                    "In reply to tweet ID",
                    "In reply to user",
                    "Mentioned users",
                    "Geo coordinates",
                    "Place",
                    "Hashtags",
                    "Cashtags",
                ],
            )
            # Export the dataframe to a tab-delimited CSV whose filename is equivalent to the search term cleaned from characters that cannot be included in filenames
            tweets_df.to_csv(clean_word + ".csv", sep="\t", index=False, encoding="utf-8")

Using script `[s5.07]`#

From the terminal, once inside the folder where script snscrape_from_list.py has been saved and where the SEARCH_LIST.txt file resides, run command [c5.22] after changing --max N to the desired value (e.g. --max 10).

Command [c5.22] #

python snscrape_from_list.py SEARCH_LIST.txt --max N

Example of filename saved by script `[s5.07]`#

Example [e5.09]#

(🏴 OR 🌈) since_time-1654034400 until_time-1656626399.csv

⁴CATLISM, 204-206

Extracting the data ⁴ ⁴`CATLISM, 204-206`#

Script [s5.08] #

# Import the required modules to read/write jsonl and xml files
import jsonlines
from lxml import etree

# Create the root element of the XML structure - named 'corpus' -, which will contain all the extracted tweets
# as elements defined by the 'text' tag (one tweet = one text)
corpus = etree.Element("corpus")

# Open the jsonl file
tweets = jsonlines.open("snscrape_output.jsonl")
# Read it line by line - i.e. one tweet at a time -, and for each line do:
for obj in tweets:
    # Read the selected data points and store each one in a separate variable
    tweet_id = obj["id"]
    tweet_date = obj["date"]
    tweet_username = obj["user"]["username"]
    tweet_user_realname = obj["user"]["displayname"]
    tweet_content = obj["content"]
    # The following variables should contain the value 0 if no urls are included or the tweet is not a retweet,
    # and the value 1 when links are present or the tweet is a retweet
    tweet_urls_present = 0 if obj["outlinks"] is None else 1
    tweet_isretweet = 0 if obj["retweetedTweet"] is None else 1
    # The extracted values are assigned as values to the arguments of a tag labelled 'text' - contained inside of the main
    # <corpus> element tag - separating one tweet from another. The actual content of the tweet is
    # then enclosed inside of the <text> element tag using the notation '.text'
    etree.SubElement(
        corpus,
        "text",
        id=str(tweet_id),
        date=str(tweet_date),
        username=str(tweet_username),
        user_realname=str(tweet_user_realname),
        urls_present=str(tweet_urls_present),
        isretweet=str(tweet_isretweet),
    ).text = str(tweet_content)
# The XML structure is created by adding all the extracted elements to the main 'corpus' tag
tree = etree.ElementTree(corpus)
# The resulting XML structure is written to the output.xml file using utf-8 encoding, adding the XML declaration
# at the beginning and graphically formatting the layout ('pretty_print')
tree.write("output.xml", pretty_print=True, xml_declaration=True, encoding="utf-8")

How to use script `[s5.08]`#

Copy/download the file s5.08_extract_snscrape_jsonl.py inside the folder where the output file snscrape_output.jsonl downloaded through snscrape (e.g. through c5.14) resides; then browse inside the folder through the terminal, e.g.

cd Downloads/twitter_data/

At last, run the script from the terminal:

python s5.08_extract_snscrape_jsonl.py

⁵CATLISM, 204

Example of data extracted with `[s5.08]` ⁵ ⁵`CATLISM, 204`#

Example [e5.10]#

<?xml version='1.0' encoding='UTF-8'?>
<corpus>
  <text id="ID_NUMBER" date="YYYY-MM-DDThh:mm:ssTZD" username="USERNAME" user_realname="USER_REAL_NAME" urls_present="0_OR_1" isretweet="0_OR_1">TWEET TEXT</text>
</corpus>

Twitter

Contents

Twitter#

Installing the tool 1 1CATLISM, 183#

Using the tool 2 2"CATLISM, 183; 191-192#

Collecting tweets from a list of search terms 3 3CATLISM, 193-196#

Install the required modules#

Script to read the list of search terms and download tweets#

Using script [s5.07]#

Example of filename saved by script [s5.07]#

Extracting the data 4 4CATLISM, 204-206#

How to use script [s5.08]#

Example of data extracted with [s5.08] 5 5CATLISM, 204#