PyMUSAS
Python Multilingual Ucrel Semantic Analysis System, is a rule based token and Multi Word Expression (MWE) semantic tagger. The tagger can support any semantic tagset, however the tagset we have concentrated on and released pre-configured spaCy components for is the Ucrel Semantic Analysis System (USAS).
PyMUSAS currently support 10 different languages with pre-configured spaCy components that can be downloaded, each language has it's own guide on how to tag text using PyMUSAS. Below we show the languages supported, if the model for that language supports MWE identification and tagging (all languages support token level tagging by default), and size of the model:
Language (BCP 47 language code) | MWE Support | Size |
---|---|---|
Mandarin Chinese (cmn) | ✔️ | 1.28MB |
Welsh (cy) | ✔️ | 1.09MB |
Spanish, Castilian (es) | ✔️ | 0.20MB |
Finnish (fi) | ❌ | 0.63MB |
French (fr) | ❌ | 0.08MB |
Indonesian (id) | ❌ | 0.24MB |
Italian (it) | ✔️ | 0.50MB |
Dutch, Flemish (nl) | ❌ | 0.15MB |
Portuguese (pt) | ✔️ | 0.27MB |
English (en) | ✔️ | 0.88MB |
Reading the documentation
How the documentation website is split between the Usage and API pages:
- Usage - The usage pages contains how-to-guides, and explanations.
- API - Are the docstrings of the PyMUSAS library, best pages to look at if you want to know exactly what a class / function / attribute does in more technical detail. These do contain examples, but the examples are more like minimum working examples rather than real world examples.
Future Plans
Our road map contains the most up to date future plans for PyMUSAS.