Skip to main content

PyMUSAS

Python Multilingual Ucrel Semantic Analysis System, is a semantic tagging framework that contains various different semantic taggers; rule based, neural network, and a hybrid of the two, of which all but the neural network can identify and tag Multi Word Expressions (MWE). The taggers can support any semantic tagset, however the tagset we have concentrated on and released pre-configured spaCy components for is the Ucrel Semantic Analysis System (USAS).

Below we describe the different semantic taggers we supported and the pre-configured models we have released for each semantic tagger, as well how to read and navigate the documentation website.

Semantic Taggers

As mentioned we have 3 different taggers; rule based, neural network (neural), and hybrid. A guide on how to choose the right tagger for you can be found in the tagger comparison section below, of which in these section we also compare the taggers based on performance across various languages.

tip

The USAS special tags;

  • Z99 - tagger does not know how to tag that word. Only the Rule Based taggers can generate this tag.
  • PUNCT - tagger believes the word to be punctuation and therefore does not have any further semantic meaning.

Rule Based

The rule based tagger supports both single token and MWE and is a re-implementation of the USAS rule based tagger that has been developed in C and then Java programming languages by Paul Rayson and Scott Piao, of which it is heavily based on the rules from Extracting Multiword Expressions with A Semantic Tagger by Scott Piao et al. 2003. For more information on exactly how the tagger works please read the API documentation specifically the RuleBasedTagger class and the Contextual Ranker class.

PyMUSAS currently support 11 different languages for the rule based tagger with pre-configured spaCy components that can be downloaded, each language has it's own guide on how to tag text using PyMUSAS with the Rule Based Tagger. Below we show the languages supported, if the model for that language supports MWE identification and tagging (all languages support single token level tagging by default), and disk space size in Mega Bytes (MB) of the model:

Language (BCP 47 language code)MWE SupportDisk Space (MB)
Mandarin Chinese (cmn)✔️1.28
Danish (da)✔️0.85
Dutch, Flemish (nl)0.15
English (en)✔️0.86
Finnish (fi)0.64
French (fr)0.08
Indonesian (id)0.24
Italian (it)✔️0.50
Portuguese (pt)✔️0.27
Spanish, Castilian (es)✔️0.26
Welsh (cy)✔️1.10

Neural

The neural tagger, as the name suggests, is Neural Network based whereby we have trained a model to predict semantic tags, specifically USAS tags, for all single tokens it is given, more specifically we have fine tuned various different BERT like models. The models we have trained all use the same English training data which are 1,083 English Wikipedia articles that contain ~5.3 million token labels, whereby the labels have been automatically generated using the C version of the English rule based USAS semantic tagger.

Currently we have 4 trained neural taggers, 2 for English and 2 that are highly multilingual, with a guide on how to tag text using PyMUSAS with these neural taggers. The table below show the size of these models in both number of Millions (M) of parameters and disk space size in Mega Bytes (MB), as well as the name of the tuned models on HuggingFace with a link to each model's card which details how they were trained and how they perform in more detail.

LanguageHuggingFace ID with model card linkParameter Size (M)Disk Space (MB)
Englishucrelnlp/PyMUSAS-Neural-English-Small-BEM1760
Englishucrelnlp/PyMUSAS-Neural-English-Base-BEM68242
Multilingualucrelnlp/PyMUSAS-Neural-Multilingual-Small-BEM140501
Multilingualucrelnlp/PyMUSAS-Neural-Multilingual-Base-BEM3071,060

Hybrid

Hybrid is a combination of a rule based and neural tagger. This tagger in essence runs the configured rule based tagger on a given text and if it has one or more unknown words in the text it cannot tag then it uses the neural tagger to assign a tag to those words, more details can be found in the API documentation of the HybridTagger class.. The hybrid tagger does not come with any pre-configured spaCy components, all hybrid tagger must be configured for your own use case, but we have a detailed how to guide on this at how to tag text with hybrid tagger.

Tagger Comparison

We have 3 different types of taggers; rule based, neural, and hybrid, of which below we state the advantages and dis-advantages of each tagger as well as their evaluation results on 4 different languages;

Advantages
  • Very fast.
  • Requires little amount of disk space and memory (RAM).
  • An explainable/interpretable tagger.
  • Can generate USAS tags that contain affixed symbols like +, -, %, @, etc and multi membership tags that are denoted through the use of a slash /, e.g. F2/O2.
  • Can identify and tag Multi Word Expressions.
Drawbacks
  • Depending on the size and content of the lexicon determines the number of words it can generate a semantic tag for. In essence this tagger is unlikely to make a prediction for all words.
  • Unlike the neural tagger, cannot generate a controllable number of semantic tags per word, the number of semantic tags generated is based on the lexicon within the tagger.

We also have performance metrics for the 3 types of taggers for 4 different languages, these performance metrics reinforce the benefits and disadvanytag using Top-N accuracy as the evaluation metric, whereby higher is better (100 is best);

ModelsEnglishChineseFinnishWelsh
Rule Based72.432.658.470.6
Neural-E-17M66.4---
Neural-E-68M70.1---
Neural-M-140M66.042.215.821.7
Neural-M-307M70.247.925.942.0
Hybrid-E-17M72.5---
Hybrid-E-68M72.5---
Hybrid-M-140M72.539.859.171.3
Hybrid-M-307M72.539.860.372.4
Full Model Names
  • Neural-E-17M - Neural English 17 Million parameter tagger.
  • Neural-E-68M - Neural English 68 Million parameter tagger.
  • Neural-M-140M - Neural Multilingual 140 Million parameter tagger.
  • Neural-M-307M - Neural Multilingual 307 Million parameter tagger.

This naming convention is the same for the Hybrid models. For the Hybrid models we used the rule base tagger for the given dataset language and where possible we would use the rule based tagger with the Multi Word Expressions.

Top-N evaluation explanation

Top-1 and top-5 evaluation is a top-N accuracy based evaluation whereby if one of the tagger's top N predictions is the correct prediction then it is correct else it is not, of which this is performed on each token that a human has annotated with a semantic tag in the given datasets.

Reading the documentation

How the documentation website is split between the Usage and API pages:

  • Usage - The usage pages contains how-to-guides, and explanations.
  • API - Are the docstrings of the PyMUSAS library, best pages to look at if you want to know exactly what a class / function / attribute does in more technical detail. These do contain examples, but the examples are more like minimum working examples rather than real world examples.

Future Plans

Our road map contains the most up to date future plans for PyMUSAS.