Hashtags (word segmentation)#
Segmentation of hashtags with English words can be achieved through the Python module wordsegment
.
Word segmentation techniques for languages other than English require a different tool - or to train wordsegment
on non-English data. Suggestions of tools for non-English texts are included at the end of this page.
Word segmentation techniques may also be employed for text normalisation procedures, since the tool works with any piece of text where two or more words are grouped together.
Install the required module#
pip install wordsegment
Segment hashtags and transform them into XML tags in a XML corpus file#
1# Import the module to use regular expressions, and the one for segmenting strings of multiple words joined together
2# into single words
3import re
4import wordsegment
5
6# Load the list of English words supported by the module
7wordsegment.load()
8# Compile the regular expression to identify hashtags (including the two possible symbols) and the text string that follows
9hashtag_re = re.compile("(?:^|\s)([##]{1})(\w+)", re.UNICODE)
10
11# Open and read the data file, and store its contents in the variable 'file_contents'
12file_contents = open("twint_output.xml", "r", encoding="utf-8").read()
13# Search for every hashtag appearing in the contents, and for each one do
14for hashtag in re.findall(hashtag_re, file_contents):c
15 # Merge the hashtag symbol and the following string into one single string and store it in the variable 'found_hashtag'
16 found_hashtag = "".join(hashtag)
17 # Save the string (without the hashtag symbol) into the variable 'clean_hashtag'
18 clean_hashtag = hashtag[1]
19 # Apply the word segmentation on the hashtag string, and save the resulting string to the 'segmented' variable
20 segmented = " ".join(wordsegment.segment(clean_hashtag))
21 # Construct the final tag element 'exhashtag' using the non-segmented version of the string as value to the argument 'original',
22 # and include the segmented version as enclosed by the tag
23 tag = f"<exhashtag original='{clean_hashtag}'>{segmented}</exhashtag>"
24 # Replace every instance of the original hashtag (composed by the hashtag symbol followed by the text string) with the final tag element
25 file_contents = file_contents.replace(found_hashtag, tag)
26
27# Open the output file
28with open("cleaned_hashtags.xml", "w", encoding="utf-8") as out_file:
29 # Write the modified contents to the output file
30 out_file.write(file_contents)
Example of hashtags transformed through [s5.17]
#
1When did I become such a girl....<exhashtag orig="overanalyzingeverything">overanalyzing everything</exhashtag> <exhashtag orig="thisisntlikeme">this isn't like me</exhashtag>
Word segmentation tools for languages other than English#
The following table collects Python tools for applying word segmentation to languages other than English.
Tools are selected from the Github list of tools tagged as word-segment
and have not been tested!
tool |
URL |
supported language(s) |
---|---|---|
|
Thai |
|
|
Chinese, English, French, German, Hebrew, Italian, Russian, Spanish |
|
|
Japanese |
|
|
Cantonese |
|
|
Any language supported by Hugging Face models (virtually >200 as of September 2023) |
|
|
Chinese |
|
|
Burmese |