Other elements#
Transformation of various elements (e.g. URLs, email addresses) from their original format into XML elements may be obtained by using specific regular expressions in conjunction with with script [s5.17]
. As noted in the volume
CATLISM, 287
while more efficient and safer options exist (i.e. the use of the
lxml
module to modify an existing XML file to avoid the deletion of elements that may result in a malformed structure), the advantage of this strategy is that it can be applied to any type of file (.txt
,.csv
,.xml
,.json
,etc.
) and adapted to transform [any element] into any required syntax1CATLISM, 287
Each regular expression is complemented with a direct link to its respective interactive version of RegExr ([Skinner, 2022]), the tool suggested in the book for the inspection and creation of regular expressions.
CATLISM, 289
Regular expression to capture usernames (e.g. @matteodic
)2CATLISM, 289
#
1(?<=^|\s)(@[\w.]+)(?<!\.)
CATLISM, 289
Regular expression to capture simple URLs (e.g. http://example.com
and https://example.com
)3CATLISM, 289
#
1http[s]?:\/\/(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+
CATLISM, 289
Regular expression to capture complex URLs (e.g. simple URLs plus email addresses, mailto:
links, URLs with optional parameters)4Adapted from https://blog.mattheworiordan.com/post/13174566389/url-regular-expression-for-links-with-or-without; CATLISM, 289
#
1((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w_]*)?\??(?:[-\+=&;%@.\w_]*)#?(?:[.\!\/\\w]*))?)