pos_mapper
pymusas.pos_mapper
AttributesΒΆβ
UPOS_TO_USAS_CORE :
Dict[str, List[str]]
A mapping from the Universal Part Of Speech (UPOS) tagset to the USAS core tagset. The UPOS tagset used here is the same as that used by the Universal Dependencies Treebank project. This is slightly different to the original presented in the paper by Petrov et al. 2012, for this original tagset see the following GitHub repository.USAS_CORE_TO_UPOS :
Dict[str, List[str]]
The reverse ofUPOS_TO_USAS_CORE
.PENN_CHINESE_TREEBANK_TO_USAS_CORE :
Dict[str, List[str]]
A mapping from the Penn Chinese Treebank tagset to the USAS core tagset. The Penn Chinese Treebank tagset here is slightly different to the original as it contains three extra tags,X
,URL
, andINF
, that appear to be unique to the spaCy Chinese models. For more information on how this mapping was created, see the following GitHub issue.USAS_CORE_TO_PENN_CHINESE_TREEBANK :
Dict[str, List[str]]
The reverse ofPENN_CHINESE_TREEBANK_TO_USAS_CORE
.BASIC_CORCENCC_TO_USAS_CORE :
Dict[str, List[str]]
A mapping from the basic CorCenCC tagset to the USAS core tagset. This mapping has come from table A.1 in the paper Leveraging Pre-Trained Embeddings for Welsh Taggers. and from table 6 in the paper Towards A Welsh Semantic Annotation System.USAS_CORE_TO_BASIC_CORCENCC :
Dict[str, List[str]]
The reverse ofBASIC_CORCENCC_TO_USAS_CORE
.
UPOS_TO_USAS_COREβ
UPOS_TO_USAS_CORE: Dict[str, List[str]] = {
'ADJ': ['adj'],
'ADP': ['prep'],
'ADV': ['adv'],
'AUX': ['verb'],
'CCONJ': ['c ...
USAS_CORE_TO_UPOSβ
USAS_CORE_TO_UPOS: Dict[str, List[str]] = {
'adj': ['ADJ'],
'prep': ['ADP'],
'adv': ['ADV'],
'verb': ['VERB', 'AUX'],
'con ...
PENN_CHINESE_TREEBANK_TO_USAS_COREβ
PENN_CHINESE_TREEBANK_TO_USAS_CORE: Dict[str, List[str]] = {
'AS': ['part'],
'DEC': ['part'],
'DEG': ['part'],
'DER': ['part'],
'DEV': ['pa ...
USAS_CORE_TO_PENN_CHINESE_TREEBANKβ
USAS_CORE_TO_PENN_CHINESE_TREEBANK: Dict[str, List[str]] = {
'part': ['AS', 'DEC', 'DEG', 'DER', 'DEV', 'ETC', 'LC', 'MSP', 'SP'],
'fw': ['BA', 'FW', ' ...
BASIC_CORCENCC_TO_USAS_COREβ
BASIC_CORCENCC_TO_USAS_CORE: Dict[str, List[str]] = {
"E": ["noun"],
"YFB": ["art"],
"Ar": ["prep"],
"Cys": ["conj"],
"Rhi": ["num"] ...
USAS_CORE_TO_BASIC_CORCENCCβ
USAS_CORE_TO_BASIC_CORCENCC: Dict[str, List[str]] = {
"noun": ["E"],
"pnoun": ["E"],
"art": ["YFB"],
"det": ["YFB"],
"prep": ["Ar"], ...
upos_to_usas_coreβ
def upos_to_usas_core(upos_tag: str) -> List[str]
Given a Universal Part Of Speech (UPOS) tag
it returns a List
of USAS core POS tags that are equivalent, whereby if the
length of the List
is greater than 1
then the first tag in the List
is the most equivalent tag.
If the List is empty then an invalid UPOS tag was given.
The mappings between UPOS and USAS core can be seen in UPOS_TO_USAS_CORE
ParametersΒΆβ
- upos_tag :
str
UPOS tag, expected to be all upper case.
ReturnsΒΆβ
List[str]
ExamplesΒΆβ
from pymusas.pos_mapper import upos_to_usas_core
assert upos_to_usas_core('CCONJ') == ['conj']
# Most equivalent tag for 'X' is 'fw'
assert upos_to_usas_core('X') == ['fw', 'xx']
assert upos_to_usas_core('Unknown') == []