Instagram#

Data collection from Instagram can be obtained using instaloader.
Options and arguments for the tool can be found in the official documentation.

1CATLISM, 206

Installing the tool1CATLISM, 206#

Command [c5.23]#
pip install instaloader
[c5.23]
2CATLISM, 206; 208

Using the tool2CATLISM, 206; 208#

Command [c5.24]#
instaloader --login [account_collecting] [target] [options]
Command [c5.25]#
instaloader --login [ACCOUNT_COLLECTING] --comments --geotags profile mitpress

Extracting the data#

3CATLISM, 220-226

Extracting and merging data from posts and comments3CATLISM, 220-226#

Script [s5.09] #
  1# Import modules for: loading files using regular expression; reading JSON files; working with .xz compressed files;
  2# working on local folders and files; using regular expressions; working with XML files
  3import glob
  4import json
  5import lzma
  6import os
  7import re
  8from lxml import etree
  9
 10# Create and compile a regular expression to capture the timestamp included in the filenames downloaded by instaloader
 11dates_filter = re.compile(
 12    "([0-9]{4}-[0-9]{2}-[0-9]{2}_[0-9]{2}-[0-9]{2}-[0-9]{2}_UTC).*", re.UNICODE
 13)
 14
 15# Create an empty list to store all the timestamps retrieved from filenames
 16dates = []
 17# List all the files in the current folder (the one where the script resides)
 18files = glob.glob("*.*")
 19
 20# For every single file found:
 21for single_file in files:
 22    # Use the 'dates_filter' regex to find the date in the filename, and store it in the variable 'found_date'
 23    found_date = re.search(dates_filter, single_file)
 24    # If the date is found and is not already included in the list 'dates', add it; otherwise, proceed to the next step
 25    if found_date is not None and found_date[1] not in dates:
 26        dates.append(found_date[1])
 27
 28# For every date in the list of dates, do
 29for date in dates:
 30    # Create the root element tag <text> to include all the contents relative to the date (i.e. the post and its relative comments)
 31    text_tag = etree.Element("text")
 32
 33    # Build the filename of the compressed JSON containing the post contents and metadata, and store it in a variable
 34    archive_filename = date + ".json.xz"
 35    # Check if the file exist on disk; if not, skip this date and start from the beginning
 36    if not os.path.isfile(archive_filename):
 37        print("File " + archive_filename + "not found, skipping...")
 38        continue
 39
 40    # Create the <item> element tag to store the contents of the post
 41    item_tag = etree.SubElement(text_tag, "item")
 42    # Open the compressed JSON file and do:
 43    with lzma.open(archive_filename) as f:
 44        # Read its contents and store them into a variable
 45        contents = f.read()
 46        # Decode the contents to UTF-8
 47        contents = contents.decode("utf-8")
 48    # Load the decoded contents as a JSON file
 49    data = json.loads(contents)
 50
 51    # Assign the main JSON data-point to the variable 'node', to avoid repeating a longer string throughout the code
 52    node = data["node"]
 53    # Extract a number of values from JSON data-points, and assign each one of them to a separate attribute of <item>
 54    item_tag.attrib["id"] = str(node["shortcode"])
 55    item_tag.attrib["type"] = "post"
 56    item_tag.attrib["created"] = str(node["taken_at_timestamp"])
 57    item_tag.attrib["username"] = str(node["owner"]["username"])
 58    item_tag.attrib["comments"] = str(node["edge_media_to_comment"]["count"])
 59    # The following data-points are checked: if they do not exist, a value of 'none' and 'na' is assigned
 60    # to the two attributes respectively
 61    item_tag.attrib["location"] = str(
 62        node["location"]["slug"] if node["location"] is not None else "none"
 63    )
 64    item_tag.attrib["likes"] = str(
 65        data["node"]["edge_media_preview_like"]["count"]
 66        if "edge_media_preview_like" in data["node"]
 67        else "na"
 68    )
 69    # Try to extract the textual content of the post: if the key exists, extract it; if not, assign an empty string to the variable
 70    # that stores the caption
 71    try:
 72        text_post_caption = str(
 73            node["edge_media_to_caption"]["edges"][0]["node"]["text"]
 74        )
 75    except IndexError:
 76        text_post_caption = ""
 77    # Enclose the textual content of the post inside <item>
 78    item_tag.text = text_post_caption
 79
 80    # Check if data-point 'edge_sidecar_to_children' exists (i.e. if the post contains multiple multimedia files)
 81    if "edge_sidecar_to_children" in node:
 82        # For each object (i.e. multimedia file) found, start a counter to assign a progressive number (starting from 1)
 83        # to each one of them, and then do:
 84        for media_num, media in enumerate(
 85            node["edge_sidecar_to_children"]["edges"], start=1
 86        ):
 87            # Extract a number of values from JSON data-points, and assign each one of them to a separate variable
 88            media_shortcode = str(media["node"].get("shortcode", "na"))
 89            # Check if the data-point 'is_video' exists, and if so assign the value 'video' to 'media_type';
 90            # otherwise assign it the value 'image'
 91            media_type = "video" if media["node"]["is_video"] else "image"
 92            # Check if the data-points 'is_video' exists: if so, assign the name of the media using the string '.mp4',
 93            # if not assign the string '.jpg'
 94            media_name = (
 95                str(date + "_" + str(media_num) + ".mp4")
 96                if media["node"]["is_video"]
 97                else str(date + "_" + str(media_num) + ".jpg")
 98            )
 99            # Check if the following data-points exists: if they do, extract their value and assign them to two separate variables;
100            # if not, assign the value 'na' to the variable
101            media_accessibility_caption = (
102                str(media["node"]["accessibility_caption"])
103                if "accessibility_caption" in media["node"]
104                else "na"
105            )
106            media_views = (
107                str(media["node"]["video_view_count"])
108                if media["node"]["is_video"]
109                else "na"
110            )
111            # Create a <media> element tag inside of <item>, and assign it all the previously extracted elements as
112            # values to its attributes
113            etree.SubElement(
114                item_tag,
115                "media",
116                mediafile=media_name,
117                mediatype=media_type,
118                mediadescr=media_accessibility_caption,
119                media_shortcode=media_shortcode,
120                media_views=media_views,
121            )
122    # Otherwise, if data-point 'edge_sidecar_to_children' does not exist (i.e. if the post contains one single multimedia file)
123    else:
124        # Extract a number of values from JSON data-points, and assign each one of them to a separate variable - using
125        # the same criteria adopted for the ones extracted from 'edge_sidecar_to_children'
126        media_shortcode = str(node["shortcode"])
127        media_type = "video" if node["is_video"] else "image"
128        media_name = str(date + ".mp4") if node["is_video"] else str(date + ".jpg")
129        media_accessibility_caption = (
130            str(node["accessibility_caption"])
131            if "accessibility_caption" in node
132            else "na"
133        )
134        media_views = str(node["video_view_count"]) if node["is_video"] else "na"
135
136        etree.SubElement(
137            item_tag,
138            "media",
139            mediafile=media_name,
140            mediatype=media_type,
141            mediadescr=media_accessibility_caption,
142            media_shortcode=media_shortcode,
143            media_views=media_views,
144        )
145
146    # Build the filename for the comments file
147    comments_filename = str(date + "_comments.json")
148    # Check if the comments file exists, and if so do:
149    if os.path.isfile(comments_filename):
150        # Open the comments file and do:
151        with open(comments_filename, encoding="utf-8") as f:
152            # Read its contents as JSON and store them into a variable
153            comments = json.loads(f.read())
154        # For each comment in the contents do:
155        for comment in comments:
156            # Create an <item> element tag
157            item_tag = etree.SubElement(text_tag, "item")
158            # Extract a number of values from JSON data-points, and assign each one of them to a separate attribute of <item>
159            item_tag.attrib["id"] = str(comment["id"])
160            item_tag.attrib["type"] = "comment"
161            item_tag.attrib["created"] = str(comment["created_at"])
162            item_tag.attrib["username"] = comment["owner"]["username"]
163            # The location is not present in the data-points for a comment; however, to have a structure that is consisten with
164            # the <item> element tag when used for a post (for which a location may be present) the attribute is added with value 'na'
165            item_tag.attrib["location"] = "na"
166            item_tag.attrib["likes"] = str(comment["likes_count"])
167            item_tag.attrib["comments"] = str(
168                len(comment["answers"]) if comment["answers"] is not None else "na"
169            )
170            item_tag.text = comment["text"]
171
172    # Write the extracted data formatted in XML to the final XML structure
173    tree = etree.ElementTree(text_tag)
174    # Write the resulting XML structure to the output file, using utf-8 encoding, adding the XML declaration
175    # at the start of the file and graphically formatting the layout ('pretty_print')
176    tree.write(date + ".xml", pretty_print=True, xml_declaration=True, encoding="utf-8")

How to use script [s5.09]#

Copy/download the file s5.09_extract_instaloader_json.py inside the folder where the data downloaded through instaloader (e.g. through c5.25) resides; then browse inside the folder through the terminal, e.g.

cd Downloads/instagram_data/

At last, run the script from the terminal:

python s5.09_extract_instaloader_json.py
4CATLISM, 220

Example of data extracted with [s5.09]4CATLISM, 220#

Example [e5.11]#
1<?xml version='1.0' encoding='UTF-8'?>
2<text>
3  <item id="UNIQUE_ID" type="post" created="UNIX_TIMESTAMP" username="USERNAME" location="LOCATION" likes="NUMBER" comments="NUMBER">
4    POST TEXTUAL CONTENT
5    <media mediafile="MEDIA_FILENAME.mp4" mediatype="VIDEO_OR_IMAGE" mediadescr="MEDIA_ACCESSIBILITY_CAPTION" media_shortcode="SHORTCODE" media_views="NUMBER" />
6  </item>
7  <item id="UNIQUE_ID" type="comment" created="UNIX_TIMESTAMP" username="USERNAME" location="LOCATION" likes="NUMBER" comments="NUMBER">COMMENT TEXTUAL CONTENT</item>
8</text>

Fix login error using a ‘session file’ for the --login option#

As per the official documentation, when using the --login option in interactive mode (i.e. inserting username and password in the CLI), instaloader may fail with a login error. To solve this error it is oftentimes sufficient to use a ‘session file’, i.e. the Instagram cookies generated by a web browser when logging in into the website from PC.
Existing ‘session file’ can automatically be imported into instaloader through script 615_import_firefox_session.py, which first requires the user to login into Instagram through Firefox (other browsers are not supported!). The following steps (adapted from the official documentation) describe the procedure, also exemplified in the asciinema video below.

  1. Download the script 615_import_firefox_session.py - you may save it in any folder

  2. Login to Instagram using Firefox

  3. In the CLI, browse to the folder where you downloaded script 615_import_firefox_session.py

  4. Execute the script in the CLI, using command [c0.02] python 615_import_firefox_session.py

  5. Using instaloader with the --login option will automatically make use of the imported ‘session file’

Command [c0.02]#
python 615_import_firefox_session.py
[c0.02]

Prior to the execution of command [c0.02], script 615_import_firefox_session.py was downloaded to the folder instaloader_script after logging into Instagram using Firefox.
The username whose details are loaded through the cookie is displayed in the output; in the video this has been replaced with [REDACTED_USERNAME].