Twitter#
Warning
As of late June 2023 a number of changes have made anonymous access to tweets impossible, consequently rendering the use of tools such as snscrape
and other โAPI-lessโ ones useless. Apparently Twitter will reinstate anonymous access in the future (see this discussion from snscrape
issue pages), but as of 10th September 2023 a number of commands and scripts included in this page - and marked with are not working. They are included to support the contents of the book, and new (working) ones will be included if anything changes.
Data collection from Twitter can be obtained using snscrape
.
Options and arguments for the tool can be found in the official documentation.
CATLISM, 183
Installing the tool1CATLISM, 183
#
pip install snscrape
[c5.11]
"CATLISM, 183; 191-192
Using the tool2"CATLISM, 183; 191-192
#
snscrape [GLOBAL-OPTIONS] SCRAPER-NAME [SCRAPER-OPTIONS] [SCRAPER-ARGUMENTS ... ]
snscrape SCRAPER-NAME --help
snscrape -v --progress --jsonl twitter-search '[QUERY]' > OUTPUT.jsonl
snscrape -v --progress --jsonl twitter-search '("technology" OR "physics") ("space" OR "rockets")' > OUTPUT.jsonl
("technology" OR "physics") ("space" OR "rockets")
(techology OR physics) (space OR rokets)
("technology" OR "physics") -("games" OR "console")
(๐ด OR ๐) since_time:1654034400 until_time:1656626399
source:twitter_web_app since_time:1654038000 until_time:1659308400 @POTUS -from:POTUS -filter:media
CATLISM, 193-196
Collecting tweets from a list of search terms3CATLISM, 193-196
#
Install the required modules#
pip install pandas
[c5.21]
Script to read the list of search terms and download tweets#
1# Adapted from the following sources
2# https://github.com/MartinBeckUT/TwitterScraper/blob/127b15b3878ab0c1b74438011056d89152701db1/snscrape/python-wrapper/snscrape-python-wrapper.py
3# https://github.com/satomlins/snscrape-by-location/blob/1f605fb6e1caff3577198792a7717ffbf3c3f454/snscrape_by_location_tutorial.ipynb
4# Import the modules for: using snscrape; working with dataframes; employing command-line arguments; using regular expressions
5
6import snscrape.modules.twitter as sntwitter
7import pandas as pd
8import argparse
9import re
10
11# Add the ability to specify arguments for the script
12parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
13# Construct the first argument as the name of the file containing the search terms
14parser.add_argument("searchlist")
15# Construct the optional second argument (following the label --max) as the maximum number of tweets to be collected
16# for each search term
17parser.add_argument(
18 "--max",
19 dest="maxResults",
20 type=lambda x: int(x)
21 if int(x) >= 0
22 else parser.error("--max-results N must be zero or positive"),
23 metavar="N",
24 help="Only return the first N results",
25)
26# Read the provided arguments
27args = parser.parse_args()
28# Read the first argument as the name of the file containing the search terms
29search_list = args.searchlist
30# Read the second (optional) argument as the maximum number of tweets to be collected for each search term
31maxResults = args.maxResults
32
33# Open the search terms list
34with open(search_list, "r", encoding="utf-8") as f:
35 # For every search terms do
36 for word in f:
37 # Eliminate the 'newline' character (\n)
38 word = word.rstrip("\n")
39 # Clean the search term from characters that are invalid in filenames using a regular expression; this is used for the output file only.
40 # From https://stackoverflow.com/a/71199182
41 clean_word = re.sub(r"[/\\?%*:|\"<>\x7F\x00-\x1F]", "-", word)
42 # Create an empty list that will contain the collected tweets
43 tweets_list = []
44 # Collect tweets for the search terms, taking into account the maximum number of tweets (if supplied)
45 for i, tweet in enumerate(sntwitter.TwitterSearchScraper(word).get_items()):
46 # Stop the collection of tweets once the maximum number supplied is reach
47 if i >= maxResults:
48 break
49 # Add all the data-points collected for the tweet to the previously created list
50 tweets_list.append(
51 [
52 tweet.date,
53 tweet.id,
54 tweet.content,
55 tweet.url,
56 tweet.user.username,
57 tweet.user.followersCount,
58 tweet.replyCount,
59 tweet.retweetCount,
60 tweet.likeCount,
61 tweet.quoteCount,
62 tweet.lang,
63 tweet.outlinks,
64 tweet.media,
65 tweet.retweetedTweet,
66 tweet.quotedTweet,
67 tweet.inReplyToTweetId,
68 tweet.inReplyToUser,
69 tweet.mentionedUsers,
70 tweet.coordinates,
71 tweet.place,
72 tweet.hashtags,
73 tweet.cashtags,
74 ]
75 )
76 # Create a dataframe from the tweets list above, naming the columns with the provided human-readable labels
77 tweets_df = pd.DataFrame(
78 tweets_list,
79 columns=[
80 "Datetime",
81 "Tweet Id",
82 "Text",
83 "URL",
84 "Username",
85 "N. followers",
86 "N. replies",
87 "N. retweets",
88 "N. likes",
89 "N. quote",
90 "Language",
91 "Outlink",
92 "Media",
93 "Retweeted tweet",
94 "Quoted tweet",
95 "In reply to tweet ID",
96 "In reply to user",
97 "Mentioned users",
98 "Geo coordinates",
99 "Place",
100 "Hashtags",
101 "Cashtags",
102 ],
103 )
104 # Export the dataframe to a tab-delimited CSV whose filename is equivalent to the search term cleaned from characters that cannot be included in filenames
105 tweets_df.to_csv(clean_word + ".csv", sep="\t", index=False, encoding="utf-8")
Using script [s5.07]
#
From the terminal, once inside the folder where script snscrape_from_list.py
has been saved and where the SEARCH_LIST.txt
file resides, run command [c5.22]
after changing --max N
to the desired value (e.g. --max 10
).
python snscrape_from_list.py SEARCH_LIST.txt --max N
Example of filename saved by script [s5.07]
#
(๐ด OR ๐) since_time-1654034400 until_time-1656626399.csv
CATLISM, 204-206
Extracting the data4CATLISM, 204-206
#
1# Import the required modules to read/write jsonl and xml files
2import jsonlines
3from lxml import etree
4
5# Create the root element of the XML structure - named 'corpus' -, which will contain all the extracted tweets
6# as elements defined by the 'text' tag (one tweet = one text)
7corpus = etree.Element("corpus")
8
9# Open the jsonl file
10tweets = jsonlines.open("snscrape_output.jsonl")
11# Read it line by line - i.e. one tweet at a time -, and for each line do:
12for obj in tweets:
13 # Read the selected data points and store each one in a separate variable
14 tweet_id = obj["id"]
15 tweet_date = obj["date"]
16 tweet_username = obj["user"]["username"]
17 tweet_user_realname = obj["user"]["displayname"]
18 tweet_content = obj["content"]
19 # The following variables should contain the value 0 if no urls are included or the tweet is not a retweet,
20 # and the value 1 when links are present or the tweet is a retweet
21 tweet_urls_present = 0 if obj["outlinks"] is None else 1
22 tweet_isretweet = 0 if obj["retweetedTweet"] is None else 1
23 # The extracted values are assigned as values to the arguments of a tag labelled 'text' - contained inside of the main
24 # <corpus> element tag - separating one tweet from another. The actual content of the tweet is
25 # then enclosed inside of the <text> element tag using the notation '.text'
26 etree.SubElement(
27 corpus,
28 "text",
29 id=str(tweet_id),
30 date=str(tweet_date),
31 username=str(tweet_username),
32 user_realname=str(tweet_user_realname),
33 urls_present=str(tweet_urls_present),
34 isretweet=str(tweet_isretweet),
35 ).text = str(tweet_content)
36# The XML structure is created by adding all the extracted elements to the main 'corpus' tag
37tree = etree.ElementTree(corpus)
38# The resulting XML structure is written to the output.xml file using utf-8 encoding, adding the XML declaration
39# at the beginning and graphically formatting the layout ('pretty_print')
40tree.write("output.xml", pretty_print=True, xml_declaration=True, encoding="utf-8")
How to use script [s5.08]
#
Copy/download the file s5.08_extract_snscrape_jsonl.py
inside the folder where the output file snscrape_output.jsonl
downloaded through snscrape
(e.g. through c5.14
) resides; then browse inside the folder through the terminal, e.g.
cd Downloads/twitter_data/
At last, run the script from the terminal:
python s5.08_extract_snscrape_jsonl.py
CATLISM, 204
Example of data extracted with [s5.08]
5CATLISM, 204
#
1<?xml version='1.0' encoding='UTF-8'?>
2<corpus>
3 <text id="ID_NUMBER" date="YYYY-MM-DDThh:mm:ssTZD" username="USERNAME" user_realname="USER_REAL_NAME" urls_present="0_OR_1" isretweet="0_OR_1">TWEET TEXT</text>
4</corpus>