Facebook#

Data collection from Facebook can be obtained using facebook-scraper.
Options and arguments for the tool can be found in the official documentation.

¹CATLISM, 227

Installing the tool ¹ ¹`CATLISM, 227`#

Command [c5.26]#

pip install facebook-scraper

²CATLISM, 227

Using the tool ² ²`CATLISM, 227`#

Command [c5.27]#

facebook-scraper --filename OUTPUT.json --format json --comments --pages N PROFILE_NAME

Extracting the data#

³CATLISM, 237-242

Extract data from posts ³ ³`CATLISM, 237-242`#

Script [s5.10] #

# Import modules for: regular expressions; reading JSON files; loading files using regular expression;
# working with XML files; reading timestamps as date objects;
import re
import json
import glob
from lxml import etree
from dateutil import parser

# Get a list of all the files with .json extension
files = glob.glob("*.json")
# For each file, do:
for single_file in files:
    # Delete the extension from the filename and store it inside of the 'filename' variable
    filename = single_file.replace(".json", "")

    # Open the file and do:
    with open(single_file, encoding="utf-8") as f:
        # Read its contents and save them to the 'contents' variable
        contents = f.read()
        # Use a regular expression to remove extra commas between JSON objects; this fixes a bug in the current version of
        # facebook-scraper (v0.2.59) whereby more than one comma may be inserted between JSON objects, rendering the contents
        # unparsable by Python
        contents = re.sub("\},{2,}\{", "},{", contents)

    # Load the read contents as JSON
    data = json.loads(contents)
    # Create the <text> root element tag where the post contents and details are stored in the final XML
    text_tag = etree.Element("text")

    # For each object in the JSON, corresponding to one post, do:
    for post in data:
        # Create the <post> element tag to enclose one single post
        post_tag = etree.SubElement(text_tag, "post")
        # Assign a set of attributes to <post> using the data extracted from the metadata data-points, as well as the
        # textual content of the post as text of the <post> element tag
        post_tag.attrib["id"] = post["post_id"]
        post_tag.attrib["author"] = post["username"]
        post_tag.attrib["author_id"] = str(post["user_id"])
        post_tag.attrib["comments"] = str(post["comments"])
        post_tag.attrib["shares"] = str(post["shares"])
        post_tag.text = post["text"]

        # Extract the date and transform it into a datetime object, then extract the two-digit day and month along the 4-digit
        # year and assign them to the three attributes 'date_d', 'date_m', and 'date_y' respectively
        post_date = parser.parse(post["time"])
        post_tag.attrib["date_d"] = str(post_date.day)
        post_tag.attrib["date_m"] = str(post_date.month)
        post_tag.attrib["date_y"] = str(post_date.year)

        # Check if the number of 'likes' and 'reactions' is present: if so, assign the values to the two attributes, if not
        # assign the value 0
        post_tag.attrib["likes"] = str(
            post["likes"] if post["likes"] is not None else 0
        )
        post_tag.attrib["reactions_count"] = str(
            post["reaction_count"] if post["reaction_count"] is not None else 0
        )

        # Check if details concerning reactions are present, i.e. if the dictionary for reactions exists and the key 'sad' exists
        reactions = post.get("reactions")
        if isinstance(reactions, dict) and reactions.get("sad"):
            # If present, assign the total number of 'sad' reactions to the 'reactions_sad' attribute
            post_tag.attrib["reaction_sad"] = str(reactions.get("sad"))
        else:
            # If not, assign the value 0
            post_tag.attrib["reaction_sad"] = "0"

        # Check if the array of comments is not empty, i.e. if check if the metadata data-point for comments is present
        # (if the data was collected withouth the '--comments' the array is always empty; if '--comments' was used,
        # array may be empty if no comments were made to the post. If the array is empty, proceed with the next item
        if post["comments_full"] is None:
            continue

        # For each found comment, do:
        for comment in post["comments_full"]:
            # Create the <comment> element tag to enclose the contents of the comment
            comment_tag = etree.SubElement(post_tag, "comment")
            # Assign a set of attributes to <comment>, including 'type' with value 'c' indicating this is a comment
            # and not a reply to a comment (identified by the value 'r'), as well as the textual content of the
            # post as text of the <post> element tag
            comment_tag.attrib["type"] = "c"
            comment_tag.attrib["comment_to"] = post["post_id"]
            comment_tag.attrib["id"] = comment["comment_id"]
            comment_tag.attrib["author"] = comment["commenter_name"]
            comment_tag.attrib["author_id"] = comment["commenter_id"]

            # Extract the date and transform it into a datetime object - if present, otherwise
            # extract the two-digit day and month along the 4-digit year and assign them to
            # the three attributes 'date_d', 'date_m', and 'date_y' respectively
            comment_date = (
                parser.parse(comment["comment_time"])
                if comment["comment_time"] is not None
                else None
            )

            comment_tag.attrib["date_d"] = str(
                comment_date.day if comment_date is not None else "na"
            )
            comment_tag.attrib["date_m"] = str(
                comment_date.month if comment_date is not None else "na"
            )
            comment_tag.attrib["date_y"] = str(
                comment_date.year if comment_date is not None else "na"
            )
            comment_tag.text = comment["comment_text"]

            # Check if the array of replies exists; if it does not, proceed with the next item
            if not comment["replies"]:
                continue

            # For each reply found, do:
            for reply in comment["replies"]:
                # Create the <comment> element tag to enclose the contents of the reply
                reply_tag = etree.SubElement(post_tag, "comment")

                # Assign a set of attributes to <comment>, including 'type' with value 'r' indicating
                # this is a reply to a comment and not a direct comment to the post (identified by the value 'c'),
                # as well as the textual content of the  post as text of the <post> element tag
                reply_tag.attrib["type"] = "r"
                reply_tag.attrib["comment_to"] = comment["comment_id"]
                reply_tag.attrib["id"] = reply["comment_id"]
                reply_tag.attrib["author"] = reply["commenter_name"]
                reply_tag.attrib["author_id"] = reply["commenter_id"]
                reply_date = (
                    parser.parse(reply["comment_time"])
                    if reply["comment_time"] is not None
                    else None
                )
                reply_tag.attrib["date_d"] = str(
                    reply_date.day if reply_date is not None else "na"
                )
                reply_tag.attrib["date_m"] = str(
                    reply_date.month if reply_date is not None else "na"
                )
                reply_tag.attrib["date_y"] = str(
                    reply_date.year if reply_date is not None else "na"
                )
                reply_tag.text = reply["comment_text"]

    # Build the XML structure with all the elements collected so far
    tree = etree.ElementTree(text_tag)
    # Write the resulting XML structure to a file named after the input filename, using utf-8 encoding, adding the XML declaration
    # at the start of the file and graphically formatting the layout ('pretty_print')
    tree.write(
        filename + ".xml", pretty_print=True, xml_declaration=True, encoding="utf-8"
    )

How to use script `[s5.10]`#

Copy/download the file s5.10_extract_facebookscraper-posts_json.py inside the folder where the data downloaded through facebook-scraper (e.g. through c5.27) resides; then browse inside the folder through the terminal, e.g.

cd Downloads/facebook_data/

At last, run the script from the terminal:

python s5.10_extract_facebookscraper-posts_json.py

⁴CATLISM, 242-245

Extract data from profiles ⁴ ⁴`CATLISM, 242-245`#

Script [s5.11] #

# Import facebook_scraper module get_profile to get the profile details
from facebook_scraper import get_profile

# Define the function 'get_profile_details'
def get_profile_details(profile_id, profiles_db):
    """Extract details from a Facebook profile; input is the profile ID and the pandas dataframe to which
    downloaded details are stored. Returns a list with the following elements:

    friend_count = profile_details[0]
    follower_count = profile_details[1]
    following_count = profile_details[2]
    basic = profile_details[3]
    about = profile_details[4]

    Assumes the existence of:
    - a file named 'cookies.json' in the current path, containing Facebook cookie
    - a dataframe with 'profile_id' as index, e.g.:
    profiles_db = pd.DataFrame(
        columns=[
            "profile_id",
            "friend_count",
            "follower_count",
            "following_count",
            "basic_info",
            "about",
            ],
    ).set_index("profile_id")
    """

    # If details for the profile ID have already been downloaded:
    if profile_id in profiles_db.index:
        # Get the details from the already-downloaded data stored in the 'profiles_db' dataframe
        friend_count = profiles_db.loc[profile_id, "friend_count"]
        follower_count = profiles_db.loc[profile_id, "follower_count"]
        following_count = profiles_db.loc[profile_id, "following_count"]
        basic_info = profiles_db.loc[profile_id, "basic_info"]
        about = profiles_db.loc[profile_id, "about"]
        # Assign all the extracted details to a tuple labelled 'collected_details'
        collected_details = (
            friend_count,
            follower_count,
            following_count,
            basic_info,
            about,
        )
        # Output the list with the details
        return collected_details

    # If the details for the selected profile ID are not present in the 'profiles_db' dataframe:
    else:
        # Download the details using the 'get_profile' function from facebook-scraper; this requires Facebook cookies to be present in
        # the folder where the script is run, and passed to the function using the argument 'cookies=. Details on how to extract them
        # from the web browser are avaible in the tool's official documentation
        profile_details = get_profile(profile_id, cookies="cookies_fb.jso")
        # From the data downloaded by facebook-scraper extract only a number of details, saving each one of them to a separate variable
        friend_count = str(profile_details["Friend_count"])
        follower_count = str(profile_details["Follower_count"])
        following_count = str(profile_details["Following_count"])
        basic_info = profile_details["Basic info"]
        # Check if the data-point 'About' exists: if it does, extract the contents of the data-point 'About' and assign it to
        # the variable 'about'; if it does not assign the value 'None'instead
        about = profile_details.get("About", "None")
        # Assign all the extracted details to a tuple labelled 'collected_details'
        collected_details = (
            friend_count,
            follower_count,
            following_count,
            basic_info,
            about,
        )
        # Add the list with the downloaded details to the 'profiles_db' dataframe
        profiles_db.loc[profile_id] = collected_details
        # Output the list with the details
        return collected_details

⁵CATLISM, 245-246

Implement the collection of profile details (`[s5.11]`) into `[s5.10]` ⁵ ⁵`CATLISM, 245-246`#

Script [s5.12] #

# Assuming that script [s5.11] has been saved locally to a file name 'get_profile_details.py
# it can be implemented into script [s5.10] by adding the following lines where indicated in the comments

# Add the following imports at the bottom of the 'import' section, to use pandas dataframes, and the get_profile_details module
import pandas as pd
from get_profile_details import get_profile_details

# Add the following lines after the import section to create a pandas dataframe for storing the downloaded details.
    profiles_db = pd.DataFrame(
        columns=[
            "profile_id",
            "friend_count",
            "follower_count",
            "following_count",
            "basic_info",
            "about",
            ],
    ).set_index("profile_id")

# Whenever details from a profile need to be extracted, the function 'get_profile_details' can be applied to the extracted
# 'profile_id'. For example, to download details regarding the author of a comment:
commenter_details = get_profile_details(comment["commenter_id"])

# Each detail can then be extracted to an XML attribute using e.g.:
comment_tag.attrib["author_friends"] = commenter_details[0]
comment_tag.attrib["author_current_place"] = commenter_details[3]

⁶CATLISM, 236-237

Example of data extracted with `[s5.10]` ⁶ ⁶`CATLISM, 236-237`#

Example [e5.12]#

<?xml version='1.0' encoding='UTF-8'?>
<text>
  <post id="UNIQUE_POST_ID" author="USER_FULL_NAME" author_id="UNIQUE_AUTHOR_ID" comments="NUMBER" shares="NUMBER" date_d="NUMBER" date_m="NUMBER" date_y="NUMBER" likes="NUMBER" reactions_count="NUMBER" reaction_sad="NUMBER">
    POST TEXTUAL CONTENTS
    <comment type="c" comment_to="UNIQUE_POST_ID" id="UNIQUE_COMMENT_ID" author="USER_FULL_NAME" author_id="UNIQUE_AUTHOR_ID" date_d="NUMBER" date_m="NUMBER" date_y="NUMBER">COMMENT TEXTUAL CONTENTS</comment>
    <comment type="r" comment_to="UNIQUE_COMMENT_ID" id="UNIQUE_COMMENT_ID" author="USER_FULL_NAME" author_id="UNIQUE_AUTHOR_ID" date_d="NUMBER" date_m="NUMBER" date_y="NUMBER">REPLY TEXTUAL CONTENTS</comment>
  </post>
</text>

Facebook

Contents

Facebook#

Installing the tool 1 1CATLISM, 227#

Using the tool 2 2CATLISM, 227#

Extracting the data#

Extract data from posts 3 3CATLISM, 237-242#

How to use script [s5.10]#

Extract data from profiles 4 4CATLISM, 242-245#

Implement the collection of profile details ([s5.11]) into [s5.10] 5 5CATLISM, 245-246#

Example of data extracted with [s5.10] 6 6CATLISM, 236-237#