pos_mapper
pymusas.pos_mapper
Attributes¶
-
UPOS_TO_USAS_CORE :
Dict[str, List[str]]
A mapping from the Universal Part Of Speech (UPOS) tagset to the USAS core tagset. The UPOS tagset used here is the same as that used by the Universal Dependencies Treebank project. This is slightly different to the original presented in the paper by Petrov et al. 2012, for this original tagset see the following GitHub repository. -
USAS_CORE_TO_UPOS :
Dict[str, List[str]]
The reverse ofUPOS_TO_USAS_CORE. -
PENN_CHINESE_TREEBANK_TO_USAS_CORE :
Dict[str, List[str]]
A mapping from the Penn Chinese Treebank tagset to the USAS core tagset. The Penn Chinese Treebank tagset here is slightly different to the original as it contains three extra tags,X,URL, andINF, that appear to be unique to the spaCy Chinese models. For more information on how this mapping was created, see the following GitHub issue. -
USAS_CORE_TO_PENN_CHINESE_TREEBANK :
Dict[str, List[str]]
The reverse ofPENN_CHINESE_TREEBANK_TO_USAS_CORE. -
BASIC_CORCENCC_TO_USAS_CORE :
Dict[str, List[str]]
A mapping from the basic CorCenCC tagset to the USAS core tagset. This mapping has come from table A.1 in the paper Leveraging Pre-Trained Embeddings for Welsh Taggers. and from table 6 in the paper Towards A Welsh Semantic Annotation System. -
USAS_CORE_TO_BASIC_CORCENCC :
Dict[str, List[str]]
The reverse ofBASIC_CORCENCC_TO_USAS_CORE.
UPOS_TO_USAS_CORE
UPOS_TO_USAS_CORE: Dict[str, List[str]] = {
'ADJ': ['adj'],
'ADP': ['prep'],
'ADV': ['adv'],
'AUX': ['verb'],
'CCONJ': ['c ...
USAS_CORE_TO_UPOS
USAS_CORE_TO_UPOS: Dict[str, List[str]] = {
'adj': ['ADJ'],
'prep': ['ADP'],
'adv': ['ADV'],
'verb': ['VERB', 'AUX'],
'con ...
PENN_CHINESE_TREEBANK_TO_USAS_CORE
PENN_CHINESE_TREEBANK_TO_USAS_CORE: Dict[str, List[str]] = {
'AS': ['part'],
'DEC': ['part'],
'DEG': ['part'],
'DER': ['part'],
'DEV': ['pa ...
USAS_CORE_TO_PENN_CHINESE_TREEBANK
USAS_CORE_TO_PENN_CHINESE_TREEBANK: Dict[str, List[str]] = {
'part': ['AS', 'DEC', 'DEG', 'DER', 'DEV', 'ETC', 'LC', 'MSP', 'SP'],
'fw': ['BA', 'FW', ' ...
BASIC_CORCENCC_TO_USAS_CORE
BASIC_CORCENCC_TO_USAS_CORE: Dict[str, List[str]] = {
"E": ["noun"],
"YFB": ["art"],
"Ar": ["prep"],
"Cys": ["conj"],
"Rhi": ["num"] ...
USAS_CORE_TO_BASIC_CORCENCC
USAS_CORE_TO_BASIC_CORCENCC: Dict[str, List[str]] = {
"noun": ["E"],
"pnoun": ["E"],
"art": ["YFB"],
"det": ["YFB"],
"prep": ["Ar"], ...
upos_to_usas_core
def upos_to_usas_core(upos_tag: str) -> List[str]
Given a Universal Part Of Speech (UPOS) tag
it returns a List of USAS core POS tags that are equivalent, whereby if the
length of the List is greater than 1 then the first tag in the List
is the most equivalent tag.
If the List is empty then an invalid UPOS tag was given.
The mappings between UPOS and USAS core can be seen in UPOS_TO_USAS_CORE
Parameters¶
- upos_tag :
str
UPOS tag, expected to be all upper case.
Returns¶
List[str]
Examples¶
from pymusas.pos_mapper import upos_to_usas_core
assert upos_to_usas_core('CCONJ') == ['conj']
# Most equivalent tag for 'X' is 'fw'
assert upos_to_usas_core('X') == ['fw', 'xx']
assert upos_to_usas_core('Unknown') == []