Hashtags (word segmentation)#

Segmentation of hashtags with English words can be achieved through the Python module wordsegment.
Word segmentation techniques for languages other than English require a different tool - or to train wordsegment on non-English data. Suggestions of tools for non-English texts are included at the end of this page.
Word segmentation techniques may also be employed for text normalisation procedures, since the tool works with any piece of text where two or more words are grouped together.

Install the required module#

pip install wordsegment

Segment hashtags and transform them into XML tags in a XML corpus file#

Script [s5.17] #

# Import the module to use regular expressions, and the one for segmenting strings of multiple words joined together
# into single words
import re
import wordsegment

# Load the list of English words supported by the module
wordsegment.load()
# Compile the regular expression to identify hashtags (including the two possible symbols) and the text string that follows
hashtag_re = re.compile("(?:^|\s)([＃#]{1})(\w+)", re.UNICODE)

# Open and read the data file, and store its contents in the variable 'file_contents'
file_contents = open("twint_output.xml", "r", encoding="utf-8").read()
# Search for every hashtag appearing in the contents, and for each one do
for hashtag in re.findall(hashtag_re, file_contents):c
    # Merge the hashtag symbol and the following string into one single string and store it in the variable 'found_hashtag'
    found_hashtag = "".join(hashtag)
    # Save the string (without the hashtag symbol) into the variable 'clean_hashtag'
    clean_hashtag = hashtag[1]
    # Apply the word segmentation on the hashtag string, and save the resulting string to the 'segmented' variable
    segmented = " ".join(wordsegment.segment(clean_hashtag))
    # Construct the final tag element 'exhashtag' using the non-segmented version of the string as value to the argument 'original',
    # and include the segmented version as enclosed by the tag
    tag = f"<exhashtag original='{clean_hashtag}'>{segmented}</exhashtag>"
    # Replace every instance of the original hashtag (composed by the hashtag symbol followed by the text string) with the final tag element
    file_contents = file_contents.replace(found_hashtag, tag)

# Open the output file
with open("cleaned_hashtags.xml", "w", encoding="utf-8") as out_file:
    # Write the modified contents to the output file
    out_file.write(file_contents)

Example of hashtags transformed through `[s5.17]`#

Example [e5.23]#

When did I become such a girl....<exhashtag orig="overanalyzingeverything">overanalyzing everything</exhashtag> <exhashtag orig="thisisntlikeme">this isn't like me</exhashtag>

Word segmentation tools for languages other than English#

The following table collects Python tools for applying word segmentation to languages other than English.
Tools are selected from the Github list of tools tagged as word-segment and have not been tested!

tool	URL	supported language(s)
`PyThaiNLP`	PyThaiNLP/pythainlp	Thai
`symspellpy`	mammothb/symspellpy	Chinese, English, French, German, Hebrew, Italian, Russian, Spanish
`nagisa`	taishi-i/nagisa	Japanese
`PyCantonese`	jacksonllee/pycantonese	Cantonese
`hashformers`	ruanchaves/hashformers	Any language supported by Hugging Face models (virtually >200 as of September 2023)
`CKIP Transformers`	ckiplab/ckip-transformers	Chinese
`Myan-word-breaker`	stevenay/myan-word-breaker	Burmese

Hashtags (word segmentation)

Contents

Hashtags (word segmentation)#

Install the required module#

Segment hashtags and transform them into XML tags in a XML corpus file#

Example of hashtags transformed through [s5.17]#

Word segmentation tools for languages other than English#

Example of hashtags transformed through `[s5.17]`#