Skip to main content

PyMUSAS

Python Multilingual Ucrel Semantic Analysis System, is a rule based token and Multi Word Expression (MWE) semantic tagger. The tagger can support any semantic tagset, however the tagset we have concentrated on and released pre-configured spaCy components for is the Ucrel Semantic Analysis System (USAS).

PyMUSAS currently support 10 different languages with pre-configured spaCy components that can be downloaded, each language has it's own guide on how to tag text using PyMUSAS. Below we show the languages supported, if the model for that language supports MWE identification and tagging (all languages support token level tagging by default), and size of the model:

Language (BCP 47 language code)MWE SupportSize
Mandarin Chinese (cmn)✔️1.28MB
Welsh (cy)✔️1.09MB
Spanish, Castilian (es)✔️0.20MB
Finnish (fi)0.63MB
French (fr)0.08MB
Indonesian (id)0.24MB
Italian (it)✔️0.50MB
Dutch, Flemish (nl)0.15MB
Portuguese (pt)✔️0.27MB
English (en)✔️0.88MB

Reading the documentation

How the documentation website is split between the Usage and API pages:

  • Usage - The usage pages contains how-to-guides, and explanations.
  • API - Are the docstrings of the PyMUSAS library, best pages to look at if you want to know exactly what a class / function / attribute does in more technical detail. These do contain examples, but the examples are more like minimum working examples rather than real world examples.

Future Plans

Our road map contains the most up to date future plans for PyMUSAS.