On scripts and tools#

The following excerpt from Corpus Approaches to Language in Social Media provides some context for the scripts and tools proposed in the volume and shared through this online compendium.

1CATLISM, 102

In addition to CLI [Command-Line Interface] tools, practical applications of the data-processing techniques are exemplified by means of code written in Python. The choice may look strongly unbalanced towards those readers who already have some degree of acquaintance with Python, while those without any previous experience with Python or any other programming language may feel excluded and hindered. This choice is however motivated by the opposite aim: by shifting the focus from the purely functional aspects of code to its underlying rationale and subjective (human) decisions, the ‘reading of the code’ should - with a look at Rancière’s “ignorant schoolmaster” ([Rancière, 1991]) - allow readers to approach the intricacies of programming from a different perspective. One that appreciates “the way code signifies” ([Marino, 2020]:5) and opens up paths of interpretations and inferencing, while leaving aside the “chauvinism that creeps into discussions of programming” that produces “a hierarchy based on an arbitrary judgment of what is “real” or “good” or “right” code” (cf. [Marino, 2020]:16–17). For this reason not much relevance nor presence is given to sharing an a-priori knowledge of Python – its basic syntax, loops, ifs, etc…; rather, explanations and descriptions of code snippets and scripts are progressively laid out and blended inside the discursive narration of the operations conducted through the code, and inside the code itself in the form of comments. The use of comments is common practice among developers and scholars to make the code more accessible, and certainly not a novel approach to computational approaches for the social sciences. Here these ‘in-line narrations’ are tailored to the purpose of the volume, aimed at enabling social scientists to grasp the relation between the code and the considerations pertaining the analysis of language, regardless of their coding skills. […] The choice adopted with regards to programming languages follows the simple principle that social scientists should not take the place of programmers or acquire advanced skills in coding, but must be able to engage with code and understand the “medial changes” it opens up (cf. [Berry, 2012]). Without any presumption of providing all the required basics for coding, the ‘narrations’ […] should also represent a gentle introduction to more advance studies of programming languages, such as the ones included in the materials suggested in the ‘Advanced reading’, a non-comprehensive list of resources – based on the author’s experience – for readers wishing to learn more about Python and its usage. At last, readers may at times wonder why what appears to be a more complex approach conducted through the use of code has been adopted instead of e.g. point-and-click tools: this choice is deliberate, and tries to propose the benefits of an ‘artisanal’ approach to digital data along the lines of situated software - i.e. “software designed in and for a particular social situation or context” and that “doesn’t need to be personalized [but that] is personal from its inception” ([Shirky, 2004]). Considering how each piece of online data (e.g. a web page) is likely to have some unique characteristics (e.g. in its structure), an ‘artisanal’ approach presents in fact two initial advantages: first it allows researchers to build custom (semi-)automated workflows for the collection of the data that correctly fulfils the research needs and required procedures; second it enables researchers to automatically process data that could otherwise not be collected through existing ready-made procedures or tools. Customisation is not however free from downsides, the major being a degree of experience required with programming and markup languages, as well as with digital data itself. The third advantage therefore is that having to deal with said languages requires an ever-increasing understanding of the characteristics and the ‘nature’ of digital data, which in turn informs the proficiency in the practices of processing and analysing it. To further prompt such benefits the scripts presented throughout the volume present variations, whereby the same operation is conducted using different approaches (syntax, modules, rationales) across different scripts. A choice that, although not in line with Python’s principles (the so-called Zen of Python; cf. [Peters, 2004]), will hopefully encourage readers to engage with the code, its meanings, and its possibilities.1CATLISM, 102