Hashtags (word segmentation)#

Segmentation of hashtags with English words can be achieved through the Python module wordsegment.
Word segmentation techniques for languages other than English require a different tool - or to train wordsegment on non-English data. Suggestions of tools for non-English texts are included at the end of this page.
Word segmentation techniques may also be employed for text normalisation procedures, since the tool works with any piece of text where two or more words are grouped together.

Install the required module#

pip install wordsegment

Segment hashtags and transform them into XML tags in a XML corpus file#

Script [s5.17] #
 1# Import the module to use regular expressions, and the one for segmenting strings of multiple words joined together
 2# into single words
 3import re
 4import wordsegment
 5
 6# Load the list of English words supported by the module
 7wordsegment.load()
 8# Compile the regular expression to identify hashtags (including the two possible symbols) and the text string that follows
 9hashtag_re = re.compile("(?:^|\s)([##]{1})(\w+)", re.UNICODE)
10
11# Open and read the data file, and store its contents in the variable 'file_contents'
12file_contents = open("twint_output.xml", "r", encoding="utf-8").read()
13# Search for every hashtag appearing in the contents, and for each one do
14for hashtag in re.findall(hashtag_re, file_contents):c
15    # Merge the hashtag symbol and the following string into one single string and store it in the variable 'found_hashtag'
16    found_hashtag = "".join(hashtag)
17    # Save the string (without the hashtag symbol) into the variable 'clean_hashtag'
18    clean_hashtag = hashtag[1]
19    # Apply the word segmentation on the hashtag string, and save the resulting string to the 'segmented' variable
20    segmented = " ".join(wordsegment.segment(clean_hashtag))
21    # Construct the final tag element 'exhashtag' using the non-segmented version of the string as value to the argument 'original',
22    # and include the segmented version as enclosed by the tag
23    tag = f"<exhashtag original='{clean_hashtag}'>{segmented}</exhashtag>"
24    # Replace every instance of the original hashtag (composed by the hashtag symbol followed by the text string) with the final tag element
25    file_contents = file_contents.replace(found_hashtag, tag)
26
27# Open the output file
28with open("cleaned_hashtags.xml", "w", encoding="utf-8") as out_file:
29    # Write the modified contents to the output file
30    out_file.write(file_contents)

Example of hashtags transformed through [s5.17]#

Example [e5.23]#
1When did I become such a girl....<exhashtag orig="overanalyzingeverything">overanalyzing everything</exhashtag> <exhashtag orig="thisisntlikeme">this isn't like me</exhashtag>

Word segmentation tools for languages other than English#

The following table collects Python tools for applying word segmentation to languages other than English.
Tools are selected from the Github list of tools tagged as word-segment and have not been tested!

tool

URL

supported language(s)

PyThaiNLP

PyThaiNLP/pythainlp

Thai

symspellpy

mammothb/symspellpy

Chinese, English, French, German, Hebrew, Italian, Russian, Spanish

nagisa

taishi-i/nagisa

Japanese

PyCantonese

jacksonllee/pycantonese

Cantonese

hashformers

ruanchaves/hashformers

Any language supported by Hugging Face models (virtually >200 as of September 2023)

CKIP Transformers

ckiplab/ckip-transformers

Chinese

Myan-word-breaker

stevenay/myan-word-breaker

Burmese