Instagram#
Data collection from Instagram can be obtained using instaloader
.
Options and arguments for the tool can be found in the official documentation.
CATLISM, 206
Installing the tool1CATLISM, 206
#
pip install instaloader
[c5.23]
CATLISM, 206; 208
Using the tool2CATLISM, 206; 208
#
instaloader --login [account_collecting] [target] [options]
instaloader --login [ACCOUNT_COLLECTING] --comments --geotags profile mitpress
Extracting the data#
CATLISM, 220-226
Extracting and merging data from posts and comments3CATLISM, 220-226
#
1# Import modules for: loading files using regular expression; reading JSON files; working with .xz compressed files;
2# working on local folders and files; using regular expressions; working with XML files
3import glob
4import json
5import lzma
6import os
7import re
8from lxml import etree
9
10# Create and compile a regular expression to capture the timestamp included in the filenames downloaded by instaloader
11dates_filter = re.compile(
12 "([0-9]{4}-[0-9]{2}-[0-9]{2}_[0-9]{2}-[0-9]{2}-[0-9]{2}_UTC).*", re.UNICODE
13)
14
15# Create an empty list to store all the timestamps retrieved from filenames
16dates = []
17# List all the files in the current folder (the one where the script resides)
18files = glob.glob("*.*")
19
20# For every single file found:
21for single_file in files:
22 # Use the 'dates_filter' regex to find the date in the filename, and store it in the variable 'found_date'
23 found_date = re.search(dates_filter, single_file)
24 # If the date is found and is not already included in the list 'dates', add it; otherwise, proceed to the next step
25 if found_date is not None and found_date[1] not in dates:
26 dates.append(found_date[1])
27
28# For every date in the list of dates, do
29for date in dates:
30 # Create the root element tag <text> to include all the contents relative to the date (i.e. the post and its relative comments)
31 text_tag = etree.Element("text")
32
33 # Build the filename of the compressed JSON containing the post contents and metadata, and store it in a variable
34 archive_filename = date + ".json.xz"
35 # Check if the file exist on disk; if not, skip this date and start from the beginning
36 if not os.path.isfile(archive_filename):
37 print("File " + archive_filename + "not found, skipping...")
38 continue
39
40 # Create the <item> element tag to store the contents of the post
41 item_tag = etree.SubElement(text_tag, "item")
42 # Open the compressed JSON file and do:
43 with lzma.open(archive_filename) as f:
44 # Read its contents and store them into a variable
45 contents = f.read()
46 # Decode the contents to UTF-8
47 contents = contents.decode("utf-8")
48 # Load the decoded contents as a JSON file
49 data = json.loads(contents)
50
51 # Assign the main JSON data-point to the variable 'node', to avoid repeating a longer string throughout the code
52 node = data["node"]
53 # Extract a number of values from JSON data-points, and assign each one of them to a separate attribute of <item>
54 item_tag.attrib["id"] = str(node["shortcode"])
55 item_tag.attrib["type"] = "post"
56 item_tag.attrib["created"] = str(node["taken_at_timestamp"])
57 item_tag.attrib["username"] = str(node["owner"]["username"])
58 item_tag.attrib["comments"] = str(node["edge_media_to_comment"]["count"])
59 # The following data-points are checked: if they do not exist, a value of 'none' and 'na' is assigned
60 # to the two attributes respectively
61 item_tag.attrib["location"] = str(
62 node["location"]["slug"] if node["location"] is not None else "none"
63 )
64 item_tag.attrib["likes"] = str(
65 data["node"]["edge_media_preview_like"]["count"]
66 if "edge_media_preview_like" in data["node"]
67 else "na"
68 )
69 # Try to extract the textual content of the post: if the key exists, extract it; if not, assign an empty string to the variable
70 # that stores the caption
71 try:
72 text_post_caption = str(
73 node["edge_media_to_caption"]["edges"][0]["node"]["text"]
74 )
75 except IndexError:
76 text_post_caption = ""
77 # Enclose the textual content of the post inside <item>
78 item_tag.text = text_post_caption
79
80 # Check if data-point 'edge_sidecar_to_children' exists (i.e. if the post contains multiple multimedia files)
81 if "edge_sidecar_to_children" in node:
82 # For each object (i.e. multimedia file) found, start a counter to assign a progressive number (starting from 1)
83 # to each one of them, and then do:
84 for media_num, media in enumerate(
85 node["edge_sidecar_to_children"]["edges"], start=1
86 ):
87 # Extract a number of values from JSON data-points, and assign each one of them to a separate variable
88 media_shortcode = str(media["node"].get("shortcode", "na"))
89 # Check if the data-point 'is_video' exists, and if so assign the value 'video' to 'media_type';
90 # otherwise assign it the value 'image'
91 media_type = "video" if media["node"]["is_video"] else "image"
92 # Check if the data-points 'is_video' exists: if so, assign the name of the media using the string '.mp4',
93 # if not assign the string '.jpg'
94 media_name = (
95 str(date + "_" + str(media_num) + ".mp4")
96 if media["node"]["is_video"]
97 else str(date + "_" + str(media_num) + ".jpg")
98 )
99 # Check if the following data-points exists: if they do, extract their value and assign them to two separate variables;
100 # if not, assign the value 'na' to the variable
101 media_accessibility_caption = (
102 str(media["node"]["accessibility_caption"])
103 if "accessibility_caption" in media["node"]
104 else "na"
105 )
106 media_views = (
107 str(media["node"]["video_view_count"])
108 if media["node"]["is_video"]
109 else "na"
110 )
111 # Create a <media> element tag inside of <item>, and assign it all the previously extracted elements as
112 # values to its attributes
113 etree.SubElement(
114 item_tag,
115 "media",
116 mediafile=media_name,
117 mediatype=media_type,
118 mediadescr=media_accessibility_caption,
119 media_shortcode=media_shortcode,
120 media_views=media_views,
121 )
122 # Otherwise, if data-point 'edge_sidecar_to_children' does not exist (i.e. if the post contains one single multimedia file)
123 else:
124 # Extract a number of values from JSON data-points, and assign each one of them to a separate variable - using
125 # the same criteria adopted for the ones extracted from 'edge_sidecar_to_children'
126 media_shortcode = str(node["shortcode"])
127 media_type = "video" if node["is_video"] else "image"
128 media_name = str(date + ".mp4") if node["is_video"] else str(date + ".jpg")
129 media_accessibility_caption = (
130 str(node["accessibility_caption"])
131 if "accessibility_caption" in node
132 else "na"
133 )
134 media_views = str(node["video_view_count"]) if node["is_video"] else "na"
135
136 etree.SubElement(
137 item_tag,
138 "media",
139 mediafile=media_name,
140 mediatype=media_type,
141 mediadescr=media_accessibility_caption,
142 media_shortcode=media_shortcode,
143 media_views=media_views,
144 )
145
146 # Build the filename for the comments file
147 comments_filename = str(date + "_comments.json")
148 # Check if the comments file exists, and if so do:
149 if os.path.isfile(comments_filename):
150 # Open the comments file and do:
151 with open(comments_filename, encoding="utf-8") as f:
152 # Read its contents as JSON and store them into a variable
153 comments = json.loads(f.read())
154 # For each comment in the contents do:
155 for comment in comments:
156 # Create an <item> element tag
157 item_tag = etree.SubElement(text_tag, "item")
158 # Extract a number of values from JSON data-points, and assign each one of them to a separate attribute of <item>
159 item_tag.attrib["id"] = str(comment["id"])
160 item_tag.attrib["type"] = "comment"
161 item_tag.attrib["created"] = str(comment["created_at"])
162 item_tag.attrib["username"] = comment["owner"]["username"]
163 # The location is not present in the data-points for a comment; however, to have a structure that is consisten with
164 # the <item> element tag when used for a post (for which a location may be present) the attribute is added with value 'na'
165 item_tag.attrib["location"] = "na"
166 item_tag.attrib["likes"] = str(comment["likes_count"])
167 item_tag.attrib["comments"] = str(
168 len(comment["answers"]) if comment["answers"] is not None else "na"
169 )
170 item_tag.text = comment["text"]
171
172 # Write the extracted data formatted in XML to the final XML structure
173 tree = etree.ElementTree(text_tag)
174 # Write the resulting XML structure to the output file, using utf-8 encoding, adding the XML declaration
175 # at the start of the file and graphically formatting the layout ('pretty_print')
176 tree.write(date + ".xml", pretty_print=True, xml_declaration=True, encoding="utf-8")
How to use script [s5.09]
#
Copy/download the file s5.09_extract_instaloader_json.py
inside the folder where the data downloaded through instaloader
(e.g. through c5.25
) resides; then browse inside the folder through the terminal, e.g.
cd Downloads/instagram_data/
At last, run the script from the terminal:
python s5.09_extract_instaloader_json.py
CATLISM, 220
Example of data extracted with [s5.09]
4CATLISM, 220
#
1<?xml version='1.0' encoding='UTF-8'?>
2<text>
3 <item id="UNIQUE_ID" type="post" created="UNIX_TIMESTAMP" username="USERNAME" location="LOCATION" likes="NUMBER" comments="NUMBER">
4 POST TEXTUAL CONTENT
5 <media mediafile="MEDIA_FILENAME.mp4" mediatype="VIDEO_OR_IMAGE" mediadescr="MEDIA_ACCESSIBILITY_CAPTION" media_shortcode="SHORTCODE" media_views="NUMBER" />
6 </item>
7 <item id="UNIQUE_ID" type="comment" created="UNIX_TIMESTAMP" username="USERNAME" location="LOCATION" likes="NUMBER" comments="NUMBER">COMMENT TEXTUAL CONTENT</item>
8</text>
Fix login error
using a ‘session file’ for the --login
option#
As per the official documentation, when using the --login
option in interactive mode (i.e. inserting username and password in the CLI), instaloader
may fail with a login error
. To solve this error it is oftentimes sufficient to use a ‘session file’, i.e. the Instagram cookies generated by a web browser when logging in into the website from PC.
Existing ‘session file’ can automatically be imported into instaloader
through script 615_import_firefox_session.py, which first requires the user to login into Instagram through Firefox (other browsers are not supported!). The following steps (adapted from the official documentation) describe the procedure, also exemplified in the asciinema
video below.
Download the script 615_import_firefox_session.py - you may save it in any folder
Login to Instagram using Firefox
In the CLI, browse to the folder where you downloaded script 615_import_firefox_session.py
Execute the script in the CLI, using command
[c0.02]
python 615_import_firefox_session.py
Using
instaloader
with the--login
option will automatically make use of the imported ‘session file’
python 615_import_firefox_session.py
[c0.02]
Prior to the execution of command [c0.02]
, script 615_import_firefox_session.py was downloaded to the folder instaloader_script
after logging into Instagram using Firefox.
The username whose details are loaded through the cookie is displayed in the output; in the video this has been replaced with [REDACTED_USERNAME]
.