Tag Text

In this guide, we are going to show you how to tag text using the PyMUSAS RuleBasedTagger so that you can extract token level USAS semantic tags from the tagged text. The guide is broken down into different languages, for each guide we are going to:

Download the relevant pre-configured PyMUSAS RuleBasedTagger spaCy component for the language.
Download and use a Natural Language Processing (NLP) pipeline that will tokenize, lemmatize, and Part Of Speech (POS) tag. In most cases, this will be a spaCy pipeline. Note that the PyMUSAS RuleBasedTagger only requires at minimum the data to be tokenized but having the lemma and POS tag will improve the accuracy of the tagging of the text.
Run the PyMUSAS RuleBasedTagger.
Extract token-level linguistic information from the tagged text, which will include USAS semantic tags.
For Chinese, Italian, Portuguese, Spanish, Welsh, and English taggers which support Multi Word Expression (MWE) identification and tagging we will show how to extract this information from the tagged text as well.

Chinese

Expand

First download both the Chinese PyMUSAS RuleBasedTagger spaCy component and the small Chinese spaCy model:

pip install https://github.com/UCREL/pymusas-models/releases/download/cmn_dual_upos2usas_contextual-0.3.3/cmn_dual_upos2usas_contextual-0.3.3-py3-none-any.whl
python -m spacy download zh_core_web_sm

Then create the tagger, in a Python script:

note

Currently, there is no lemmatization component in the spaCy pipeline for Chinese.

import spacy

# We exclude the following components as we do not need them. 
nlp = spacy.load('zh_core_web_sm', exclude=['parser', 'ner'])
# Load the Chinese PyMUSAS rule-based tagger in a separate spaCy pipeline
chinese_tagger_pipeline = spacy.load('cmn_dual_upos2usas_contextual')
# Adds the Chinese PyMUSAS rule-based tagger to the main spaCy pipeline
nlp.add_pipe('pymusas_rule_based_tagger', source=chinese_tagger_pipeline)

The tagger is now set up for tagging text through the spaCy pipeline like so (this example follows on from the last). The example text is taken from the Chinese Wikipedia page on the topic of The Nile River:

text = "尼罗河 是一条流經非洲東部與北部的河流，與中非地區的剛果河、非洲南部的赞比西河以及西非地区的尼日尔河並列非洲最大的四個河流系統。"

output_doc = nlp(text)

print(f'Text\tPOS\tUSAS Tags')
for token in output_doc:
    print(f'{token.text}\t{token.pos_}\t{token._.pymusas_tags}')

Output:

Text    POS     USAS Tags
尼罗河     PROPN   ['Z2']
是       VERB    ['A3', 'Z5']
一       NUM     ['N1']
条       NUM     ['G2.1/P1', 'S7.4-', 'A1.7+', 'S8-']
流經      ADJ     ['Z99']
非洲      PROPN   ['Z2']
東部      NOUN    ['Z99']
與北部     PROPN   ['Z99']
的       PART    ['Z5']
河流      NOUN    ['W3/M4', 'N5+']
，       PUNCT   ['PUNCT']
與       VERB    ['Z99']
中非      PROPN   ['Z99']
地區      NOUN    ['Z99']
的       PART    ['Z5']
剛果河     PROPN   ['Z99']
、       PUNCT   ['PUNCT']
非洲      PROPN   ['Z2']
南部      NOUN    ['M6']
的       PART    ['Z5']
赞比西河    NOUN    ['Z99']
以及      CCONJ   ['N5++', 'N5.2+', 'A13.3', 'Z5']
西非      PROPN   ['Z99']
地区      NOUN    ['A1.1.1', 'B3/X1', 'G1.1c', 'W3', 'F4/M7', 'K2', 'M7', 'A4.1', 'N3.6', 'B1', 'T1.1', 'O4.4', 'N5.1-', 'S5+c', 'B3', 'Y1', 'C1/H1@']
的       PART    ['Z5']
尼日尔河    NOUN    ['Z99']
並列      VERB    ['Z99']
非洲      PROPN   ['Z2']
最       ADV     ['A11.1+', 'N5+++', 'N3.2+++', 'A11.1+++', 'N5.1+', 'O2/M4', 'O3']
大       VERB    ['A11.1+', 'N5+++', 'N3.2+++', 'A11.1+++', 'N5.1+', 'O2/M4', 'O3']
的       PART    ['Z5']
四       NUM     ['N1']
個       NUM     ['N1']
河流      NOUN    ['W3/M4', 'N5+']
系統      NOUN    ['Z99']
。       PUNCT   ['PUNCT']

For Chinese the tagger also identifies and tags Multi-Word Expressions (MWE), to find these MWE's you can run the following:

print(f'Text\tPOS\tMWE start and end index\tUSAS Tags')
for token in output_doc:
    start, end = token._.pymusas_mwe_indexes[0]
    if (end - start) > 1:
        print(f'{token.text}\t{token.pos_}\t{(start, end)}\t{token._.pymusas_tags}')

Which will output the following:

Text    POS    MWE start and end index    USAS Tags
最       ADV    (28, 30)                   ['A11.1+', 'N5+++', 'N3.2+++', 'A11.1+++', 'N5.1+', 'O2/M4', 'O3']
大       VERB   (28, 30)                   ['A11.1+', 'N5+++', 'N3.2+++', 'A11.1+++', 'N5.1+', 'O2/M4', 'O3']

Dutch

Expand

First download both the Dutch PyMUSAS RuleBasedTagger spaCy component and the small Dutch spaCy model:

pip install https://github.com/UCREL/pymusas-models/releases/download/nl_single_upos2usas_contextual-0.3.3/nl_single_upos2usas_contextual-0.3.3-py3-none-any.whl
python -m spacy download nl_core_news_sm

Then create the tagger, in a Python script:

import spacy

# We exclude the following components as we do not need them. 
nlp = spacy.load('nl_core_news_sm', exclude=['parser', 'ner', 'tagger'])
# Load the Dutch PyMUSAS rule-based tagger in a separate spaCy pipeline
dutch_tagger_pipeline = spacy.load('nl_single_upos2usas_contextual')
# Adds the Dutch PyMUSAS rule-based tagger to the main spaCy pipeline
nlp.add_pipe('pymusas_rule_based_tagger', source=dutch_tagger_pipeline)

The tagger is now set up for tagging text through the spaCy pipeline like so (this example follows on from the last). The example text is taken from the Dutch Wikipedia page on the topic of The Nile River:

text = "De Nijl is met een lengte van 5499 tot 6695 km de langste of de op een na langste rivier van de wereld."

output_doc = nlp(text)

print(f'Text\tLemma\tPOS\tUSAS Tags')
for token in output_doc:
    print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.pymusas_tags}')

Output:

Text    Lemma   POS     USAS Tags
De      de      DET     ['Z5']
Nijl    nijl    PROPN   ['Z99']
is      is      AUX     ['Z99']
met     met     ADP     ['Z5']
een     een     DET     ['Z5']
lengte  lengte  NOUN    ['N3.7', 'T1.3', 'M4']
van     van     ADP     ['Z5']
5499    5499    NUM     ['N1']
tot     tot     ADP     ['Z99']
6695    6695    NUM     ['N1']
km      km      SYM     ['Z99']
de      de      DET     ['Z5']
langste lang    ADJ     ['N3.7+', 'T1.3+', 'N3.3+', 'N3.2+', 'X7+']
of      of      CCONJ   ['Z5']
de      de      DET     ['Z5']
op      op      ADP     ['A5.1+', 'G2.2+', 'A1.1.1', 'M6', 'Z5']
een     e       NUM     ['N1', 'T3', 'T1.2', 'Z8']
na      na      ADP     ['N4', 'Z5']
langste lang    ADJ     ['N3.7+', 'T1.3+', 'N3.3+', 'N3.2+', 'X7+']
rivier  rivier  NOUN    ['W3/M4', 'N5+']
van     van     ADP     ['Z5']
de      de      DET     ['Z5']
wereld  wereld  NOUN    ['W1', 'S5+c', 'A4.1', 'N5+']
.       .       PUNCT   ['PUNCT']

French

Expand

First download both the French PyMUSAS RuleBasedTagger spaCy component and the small French spaCy model:

pip install https://github.com/UCREL/pymusas-models/releases/download/fr_single_upos2usas_contextual-0.3.3/fr_single_upos2usas_contextual-0.3.3-py3-none-any.whl
python -m spacy download fr_core_news_sm

Then create the tagger, in a Python script:

import spacy

# We exclude the following components as we do not need them. 
nlp = spacy.load('fr_core_news_sm', exclude=['parser', 'ner'])
# Load the French PyMUSAS rule-based tagger in a separate spaCy pipeline
french_tagger_pipeline = spacy.load('fr_single_upos2usas_contextual')
# Adds the French PyMUSAS rule-based tagger to the main spaCy pipeline
nlp.add_pipe('pymusas_rule_based_tagger', source=french_tagger_pipeline)

The tagger is now set up for tagging text through the spaCy pipeline like so (this example follows on from the last). The example text is taken from the French Wikipedia page on the topic of The Nile River:

text = "Le Nil est un fleuve d'Afrique. Avec une longueur d'environ 6 700 km, c'est avec le fleuve Amazone, le plus long fleuve du monde."

output_doc = nlp(text)

print(f'Text\tLemma\tPOS\tUSAS Tags')
for token in output_doc:
    print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.pymusas_tags}')

Output:

Text      Lemma     POS       USAS Tags
Le        le        DET       ['Z5']
Nil       Nil       PROPN     ['Z99']
est       être      AUX       ['M6']
un        un        DET       ['Z5']
fleuve    fleuve    NOUN      ['W3/M4', 'N5+']
d'        de        ADP       ['Z5']
Afrique   Afrique   PROPN     ['Z99']
.         .         PUNCT     ['PUNCT']
Avec      avec      ADP       ['Z5']
une       un        DET       ['Z5']
longueur  longueur  NOUN      ['N3.7', 'T1.3', 'M4']
d'        de        ADP       ['Z5']
environ   environ   ADV       ['Z5']
6         6         DET       ['Z99']
700       700       NUM       ['N1']
km        kilomètre NOUN      ['N3.3', 'N3.7']
,         ,         PUNCT     ['PUNCT']
c'        ce        PRON      ['Z8']
est       être      VERB      ['M6']
avec      avec      ADP       ['Z5']
le        le        DET       ['Z5']
fleuve    fleuve    NOUN      ['W3/M4', 'N5+']
Amazone   amazone   NOUN      ['Z99']
,         ,         PUNCT     ['PUNCT']
le        le        DET       ['Z5']
plus      plus      ADV       ['Z5']
long      long      ADJ       ['Z99']
fleuve    fleuve    NOUN      ['W3/M4', 'N5+']
du        de        ADP       ['Z5']
monde     monde     NOUN      ['Z99']
.         .         PUNCT     ['PUNCT']

Italian

Expand

First download both the Italian PyMUSAS RuleBasedTagger spaCy component and the small Italian spaCy model:

pip install https://github.com/UCREL/pymusas-models/releases/download/it_dual_upos2usas_contextual-0.3.3/it_dual_upos2usas_contextual-0.3.3-py3-none-any.whl
python -m spacy download it_core_news_sm

Then create the tagger, in a Python script:

import spacy

# We exclude the following components as we do not need them. 
nlp = spacy.load('it_core_news_sm', exclude=['parser', 'ner', 'tagger'])
# Load the Italian PyMUSAS rule-based tagger in a separate spaCy pipeline
italian_tagger_pipeline = spacy.load('it_dual_upos2usas_contextual')
# Adds the Italian PyMUSAS rule-based tagger to the main spaCy pipeline
nlp.add_pipe('pymusas_rule_based_tagger', source=italian_tagger_pipeline)

The tagger is now set up for tagging text through the spaCy pipeline like so (this example follows on from the last). The example text is taken from the Italian Wikipedia page on the topic of The Nile River:

text = "Il Nilo è un fiume africano lungo 6.852 km che attraversa otto stati dell'Africa. Tradizionalmente considerato il fiume più lungo del mondo, contende il primato della lunghezza al Rio delle Amazzoni."

output_doc = nlp(text)

print(f'Text\tLemma\tPOS\tUSAS Tags')
for token in output_doc:
    print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.pymusas_tags}')

Output:

Text              Lemma             POS     USAS Tags
Il                il                DET     ['Z5']
Nilo              nilo              PROPN   ['Z99']
è                 essere            AUX     ['A5.1', 'S7.1++', 'X3.2', 'Q2.2', 'A8', 'N3.1%']
un                uno               DET     ['Z5']
fiume             fiume             NOUN    ['W3']
africano          africano          ADJ     ['Z2']
lungo             lungo             ADP     ['Z5']
6.852             6.852             NUM     ['N1']
km                km                NOUN    ['N3.3']
che               che               PRON    ['Z8']
attraversa        attraversare      VERB    ['M1', 'M6', 'S8-', 'A1.8+', 'A6.3+', 'F4/L2', 'O4.4', 'Q1.2', 'E3-', 'S1.1.1', 'S9@']
otto              otto              NUM     ['N1']
stati             stato             NOUN    ['G2.1/H1', 'B2', 'A3']
dell'             dell'             ADP     ['Z99']
Africa            Africa            PROPN   ['Z2']
.                 .                 PUNCT   ['PUNCT']
Tradizionalmente  tradizionalmente  ADV     ['Z99']
considerato       considerare       VERB    ['A5.1', 'N2', 'A11.1+', 'Q2.2', 'S1.1.1', 'Q1.3', 'S9%', 'X2.1', 'X2.4', 'X6']
il                il                DET     ['Z5']
fiume             fiume             NOUN    ['W3']
più               molto             ADV     ['N3.3+', 'A13.3']
lungo             lungo             ADJ     ['N3.3+', 'A13.3']
del               del               ADP     ['Z5']
mondo             mondo             NOUN    ['W1']
,                 ,                 PUNCT   ['PUNCT']
contende          contendere        VERB    ['S7.3']
il                il                DET     ['Z5']
primato           primato           NOUN    ['A5.1+++', 'A11.1+']
della             della             ADP     ['Z99']
lunghezza         lunghezza         NOUN    ['N3.7', 'T1.3', 'M4']
al                al                ADP     ['Z5']
Rio               Rio               PROPN   ['Z2']
delle             della             ADP     ['Z5']
Amazzoni          amazzoni          PROPN   ['Z99']
.                 .                 PUNCT   ['PUNCT']

For Italian the tagger also identifies and tags Multi-Word Expressions (MWE), to find these MWE's you can run the following:

print(f'Text\tPOS\tMWE start and end index\tUSAS Tags')

for token in output_doc:
    start, end = token._.pymusas_mwe_indexes[0]
    if (end - start) > 1:
        print(f'{token.text}\t{token.pos_}\t{(start, end)}\t{token._.pymusas_tags}')

Which will output the following:

Text    POS     MWE start and end index    USAS Tags
più     ADV     (20, 22)                   ['N3.3+', 'A13.3']
lungo   ADJ     (20, 22)                   ['N3.3+', 'A13.3']

Portuguese

Expand

First download both the Portuguese PyMUSAS RuleBasedTagger spaCy component and the small Portuguese spaCy model:

pip install https://github.com/UCREL/pymusas-models/releases/download/pt_dual_upos2usas_contextual-0.3.3/pt_dual_upos2usas_contextual-0.3.3-py3-none-any.whl
python -m spacy download pt_core_news_sm

Then create the tagger, in a Python script:

import spacy

# We exclude the following components as we do not need them. 
nlp = spacy.load('pt_core_news_sm', exclude=['parser', 'ner'])
# Load the Portuguese PyMUSAS rule-based tagger in a separate spaCy pipeline
portuguese_tagger_pipeline = spacy.load('pt_dual_upos2usas_contextual')
# Adds the Portuguese PyMUSAS rule-based tagger to the main spaCy pipeline
nlp.add_pipe('pymusas_rule_based_tagger', source=portuguese_tagger_pipeline)

The tagger is now set up for tagging text through the spaCy pipeline like so (this example follows on from the last). The example text is taken from the Portuguese Wikipedia page on the topic of The Nile River:

text = "Todos estes estudos levam a que o comprimento de ambos os rios permaneça em aberto, continuando por isso o debate e como tal, continuando-se a considerar o Nilo como o rio mais longo."

output_doc = nlp(text)

print(f'Text\tLemma\tPOS\tUSAS Tags')
for token in output_doc:
    print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.pymusas_tags}')

Output:

Text            Lemma           POS     USAS Tags
Todos           Todos           DET     ['Z8/N5.1+c']
estes           este            DET     ['Z5', 'Z8']
estudos         estudo          NOUN    ['P1', 'X2.4', 'H2', 'Q1.2', 'C1']
levam           levar           VERB    ['A9+', 'T1.3', 'C1', 'A1.1.1', 'M2', 'S7.1-', 'A2.1+', 'X2.4', 'S6+', 'S7.4+', 'N3', 'A2.1+', 'P1', 'M1', 'X2.5+', 'F1@', 'F2@', 'Q1.2@', 'B3@']
a               o               SCONJ   ['M6', 'Z5']
que             que             SCONJ   ['A13.3', 'A6.1+', 'Z5', 'Z8']
o               o               DET     ['Z5']
comprimento     comprimento     NOUN    ['N3.7', 'T1.3', 'M4']
de              de              ADP     ['Z5']
ambos           ambos           DET     ['N5']
os              o               DET     ['Z5']
rios            rio             NOUN    ['W3/M4', 'N5+']
permaneça       permanecer      VERB    ['T2++', 'M8', 'N5.2+']
em              em              SCONJ   ['A5.1+', 'G2.2+', 'A1.1.1', 'M6', 'O4.2+', 'Z5']
aberto          aberto          VERB    ['A10+', 'T2+']
,               ,               PUNCT   ['PUNCT']
continuando     continuar       VERB    ['Z99']
por             por             ADP     ['N4', 'Z5', 'T1.2']
isso            isso            PRON    ['N4', 'Z5', 'T1.2']
o               o               DET     ['Z5']
debate          debater         NOUN    ['Q2.1', 'Q2.1/A6.1-', 'Q2.1/E3-', 'Q2.2']
e               e               CCONJ   ['Z5']
como            comer           ADP     ['Z5']
tal             tal             PRON    ['Z5']
,               ,               PUNCT   ['PUNCT']
continuando-se  continuando-se  VERB    ['Z99']
a               o               SCONJ   ['M6', 'Z5']
considerar      considerar      VERB    ['Z99']
o               o               DET     ['Z5']
Nilo            Nilo            PROPN   ['Z2']
como            comer           ADP     ['Z5']
o               o               DET     ['Z5']
rio             rir             NOUN    ['W3/M4', 'N5+']
mais            mais            ADV     ['T1.3++', 'N3.7++', 'N3.3++', 'N3.2++']
longo           longo           ADJ     ['T1.3++', 'N3.7++', 'N3.3++', 'N3.2++']
.               .               PUNCT   ['PUNCT']

For Portuguese the tagger also identifies and tags Multi-Word Expressions (MWE), to find these MWE's you can run the following:

print(f'Text\tPOS\tMWE start and end index\tUSAS Tags')

for token in output_doc:
    start, end = token._.pymusas_mwe_indexes[0]
    if (end - start) > 1:
        print(f'{token.text}\t{token.pos_}\t{(start, end)}\t{token._.pymusas_tags}')

Which will output the following:

Text    POS     MWE start and end index    USAS Tags
por     ADP     (17, 19)                   ['N4', 'Z5', 'T1.2']
isso    PRON    (17, 19)                   ['N4', 'Z5', 'T1.2']
mais    ADV     (33, 35)                   ['T1.3++', 'N3.7++', 'N3.3++', 'N3.2++']
longo   ADJ     (33, 35)                   ['T1.3++', 'N3.7++', 'N3.3++', 'N3.2++']

Spanish

Expand

First download both the Spanish PyMUSAS RuleBasedTagger spaCy component and the small Spanish spaCy model:

pip install https://github.com/UCREL/pymusas-models/releases/download/es_dual_upos2usas_contextual-0.3.3/es_dual_upos2usas_contextual-0.3.3-py3-none-any.whl
python -m spacy download es_core_news_sm

Then create the tagger, in a Python script:

import spacy

# We exclude the following components as we do not need them. 
nlp = spacy.load('es_core_news_sm', exclude=['parser', 'ner'])
# Load the Spanish PyMUSAS rule-based tagger in a separate spaCy pipeline
spanish_tagger_pipeline = spacy.load('es_dual_upos2usas_contextual')
# Adds the Spanish PyMUSAS rule-based tagger to the main spaCy pipeline
nlp.add_pipe('pymusas_rule_based_tagger', source=spanish_tagger_pipeline)

The tagger is now set up for tagging text through the spaCy pipeline like so (this example follows on from the last). The example text is taken from the Spanish Wikipedia page on the topic of Países Bajos:

text = "Los Países Bajos son un país soberano ubicado al noreste de la Europa continental y el país constituyente más grande de los cuatro que, junto con las islas de Aruba, Curazao y San Martín, forman el Reino de los Países Bajos."

output_doc = nlp(text)

print(f'Text\tLemma\tPOS\tUSAS Tags')
for token in output_doc:
    print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.pymusas_tags}')

Output:

Text            Lemma           POS     USAS Tags
Los             el              DET     ['Z5']
Países          Países          PROPN   ['Z2']
Bajos           Bajos           PROPN   ['Z2']
son             ser             AUX     ['A3+', 'L1', 'Z5']
un              uno             DET     ['Z5', 'N1']
país            país            NOUN    ['G1.1c', 'W3', 'M7']
soberano        soberano        ADJ     ['Z99']
ubicado         ubicado         ADJ     ['Z99']
al              al              ADP     ['Z5']
noreste         noreste         NOUN    ['Z99']
de              de              ADP     ['Z5']
la              el              DET     ['Z5']
Europa          Europa          PROPN   ['Z2', 'S7', 'M7']
continental     continental     ADJ     ['Z99']
y               y               CCONJ   ['Z5', 'A1.8+']
el              el              DET     ['Z5']
país            país            NOUN    ['G1.1c', 'W3', 'M7']
constituyente   constituyente   ADJ     ['Z99']
más             más             ADV     ['A13.3', 'N6++', 'Z5']
grande          grande          ADJ     ['N3.1+/A6.1+/A13.2+', 'A5']
de              de              ADP     ['Z5']
los             el              DET     ['Z5']
cuatro          cuatro          NUM     ['N1']
que             que             PRON    ['Z5', 'Z8']
,               ,               PUNCT   ['PUNCT']
junto           junto           ADJ     ['A2.2', 'S5+', 'A1.8+']
con             con             ADP     ['Z5', 'A4.1']
las             el              DET     ['Z5']
islas           isla            NOUN    ['W3M7']
de              de              ADP     ['Z5']
Aruba           Aruba           PROPN   ['Z99']
,               ,               PUNCT   ['PUNCT']
Curazao         Curazao         PROPN   ['Z99']
y               y               CCONJ   ['Z5', 'A1.8+']
San             San             PROPN   ['S9', 'S2', 'A4.1']
Martín          Martín          PROPN   ['Z1', 'S2']
,               ,               PUNCT   ['PUNCT']
forman          formar          VERB    ['T2+', 'A2.1+', 'A1.8+', 'A3+', 'A1.1.1']
el              el              DET     ['Z5']
Reino           Reino           PROPN   ['M7']
de              de              ADP     ['Z5']
los             el              DET     ['Z5']
Países          Países          PROPN   ['Z2']
Bajos           Bajos           PROPN   ['Z2']
.               .               PUNCT   ['PUNCT']

For Spanish the tagger also identifies and tags Multi-Word Expressions (MWE), to find these MWE's you can run the following:

print(f'Text\tPOS\tMWE start and end index\tUSAS Tags')

for token in output_doc:
    start, end = token._.pymusas_mwe_indexes[0]
    if (end - start) > 1:
        print(f'{token.text}\t{token.pos_}\t{(start, end)}\t{token._.pymusas_tags}')

Which will output the following:

Text    POS     MWE start and end index    USAS Tags
Países  PROPN   (1, 3)                     ['Z2']
Bajos   PROPN   (1, 3)                     ['Z2']
Países  PROPN   (42, 44)                   ['Z2']
Bajos   PROPN   (42, 44)                   ['Z2']

Finnish

Expand

First download both the Finnish PyMUSAS RuleBasedTagger spaCy component and the small Finnish spaCy model:

pip install https://github.com/UCREL/pymusas-models/releases/download/fi_single_upos2usas_contextual-0.3.3/fi_single_upos2usas_contextual-0.3.3-py3-none-any.whl
python -m spacy download fi_core_news_sm

Then create the tagger, in a Python script:

import spacy

# We exclude the following components as we do not need them. 
nlp = spacy.load("fi_core_news_sm", exclude=['tagger', 'parser', 'attribute_ruler', 'ner'])
# Load the Finnish PyMUSAS rule-based tagger in a separate spaCy pipeline
finnish_tagger_pipeline = spacy.load('fi_single_upos2usas_contextual')
# Adds the Finnish PyMUSAS rule-based tagger to the main spaCy pipeline
nlp.add_pipe('pymusas_rule_based_tagger', source=finnish_tagger_pipeline)

The tagger is now set up for tagging text through the spaCy pipeline like so (this example follows on from the last). The example text is taken from the Finnish Wikipedia page on the topic of Bank as a financial institution:

text = "Pankki on instituutio, joka tarjoaa finanssipalveluita, erityisesti maksuliikenteen hoitoa ja luotonantoa."

output_doc = nlp(text)

print(f'Text\tLemma\tPOS\tUSAS Tags')
for token in output_doc:
    print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.pymusas_tags}')
print(f'{"Text":<20}{"Lemma":<20}{"POS":<8}USAS Tags')
for token in output_doc:
    print(f'{token.text:<20}{token.lemma_:<20}{token.pos_:<8}{token._.pymusas_tags}')

Output:

Text                Lemma               POS     USAS Tags
Pankki              pankki              NOUN    ['I1/H1', 'K5.2/I1.1']
on                  olla                AUX     ['A3+', 'A1.1.1', 'M6', 'Z5']
instituutio         instituutio         NOUN    ['S5+']
,                   ,                   PUNCT   ['PUNCT']
joka                joka                PRON    ['Z8', 'N5.1+']
tarjoaa             tarjota             VERB    ['A9-', 'Q2.2', 'F1', 'S6+', 'A7+', 'I2.2']
finanssipalveluita  finanssipalvelus    NOUN    ['Z99']
,                   ,                   PUNCT   ['PUNCT']
erityisesti         erityisesti         ADV     ['A14']
maksuliikenteen     maksuliikentete     NOUN    ['Z99']
hoitoa              hoito               NOUN    ['B3', 'S4']
ja                  ja                  CCONJ   ['Z5']
luotonantoa         luotonanto          NOUN    ['Z99']
.                   .                   PUNCT   ['PUNCT']

Welsh

Expand

In this example, we will not be using spaCy for tokenization, lemmatization, and POS tagging, as we will be using the CyTag toolkit that has been wrapped in a docker container. Therefore, first, you will need to install docker.

We assume that you would like to tag the following text, of which this text is stored in the file named welsh_text_example.txt. The example text is taken from the Welsh Wikipedia page on the topic of Bank as a financial institution. With an additional random sentence at the end to demonstrate the Multi-Word Expression (MWE) identification and tagging attributes of the tagger.

welsh_text_example.txt

Sefydliad cyllidol yw bancwr neu fanc sy'n actio fel asiant talu ar gyfer cwsmeriaid, ac yn rhoi benthyg ac yn benthyg arian. Yn rhai gwledydd, megis yr Almaen a Siapan, mae banciau'n brif berchenogion corfforaethau diwydiannol, tra mewn gwledydd eraill, megis yr Unol Daleithiau, mae banciau'n cael eu gwahardd rhag bod yn berchen ar gwmniau sydd ddim yn rhai cyllidol. Adran Iechyd Cymru.

First, we will need to run the CyTag toolkit, more specifically we will run version 1 of the toolkit as we have a mapping from the POS tags produced in version 1 (the basic CorCencC POS tagset) to the POS tags that the USAS lexicon uses (the USAS core POS tagset) within the pre-configured Welsh PyMUSAS RuleBasedTagger tagger.

cat welsh_text_example.txt | docker run -i --rm ghcr.io/ucrel/cytag:1.0.4 > welsh_text_example.tsv

We now have a tsv version of the file that has been tokenized, lemmatized, and POS tagged. The welsh_text_example.tsv file should contain the following (I have added column headers here to explain what each column represents, these headers should not be in your file, also note that the "Mutation" column is optional):

welsh_text_example.tsv:

welsh_text_example.tsv
Line Number Token   Sentence Index, Token Index Lemma   Basic POS   Enriched POS    Mutation
 Sefydliad   1,1 sefydliad   E   Egu 
 cyllidol    1,2 cyllidol    Ans Anscadu 
 yw  1,3 bod B   Bpres3u 
 bancwr  1,4 bancwr  E   Egu 
 neu 1,5 neu Cys Cyscyd  
 fanc    1,6 banc    E   Egu +sm
 sy  1,7 bod B   Bpres3perth 
 'n  1,8 yn  U   Uberf   
 actio   1,9 actio   B   Be  
fel 1,10    fel Cys Cyscyd  
asiant  1,11    asiant | asio   E | B   Egu | Bpres3ll  
talu    1,12    talu    B   Be  
ar  1,13    ar  Ar  Arsym   
gyfer   1,14    cyfer   E   Egu +sm
cwsmeriaid  1,15    cwsmer  E   Egll    
,   1,16    ,   Atd Atdcan  
ac  1,17    a   Cys Cyscyd  
yn  1,18    yn  U   Uberf   
rhoi    1,19    rhoi    B   Be  
benthyg 1,20    benthyg E   Egu 
ac  1,21    a   Cys Cyscyd  
yn  1,22    yn  U   Uberf   
benthyg 1,23    benthyg B   Be  
arian   1,24    arian   E   Egu 
.   1,25    .   Atd Atdt    
Yn  2,1 yn  Ar  Arsym   
rhai    2,2 rhai    unk unk 
gwledydd    2,3 gwlad   E   Ebll    
,   2,4 ,   Atd Atdcan  
megis   2,5 megis   Cys Cyscyd  
yr  2,6 y   YFB YFB 
Almaen  2,7 Almaen  E   Epb 
a   2,8 a   Cys Cyscyd  
Siapan  2,9 Siapan  E   Epb 
,   2,10    ,   Atd Atdcan  
mae 2,11    bod B   Bpres3u 
banciau 2,12    banc    E   Egll    
'n  2,13    yn  U   Utra    
brif    2,14    brif    unk unk 
berchenogion    2,15    berchenogion    unk unk 
corfforaethau   2,16    corfforaeth E   Ebll    
diwydiannol 2,17    diwydiannol Ans Anscadu 
,   2,18    ,   Atd Atdcan  
tra 2,19    tra Cys Cyscyd  
mewn    2,20    mewn    Ar  Arsym   
gwledydd    2,21    gwlad   E   Ebll    
eraill  2,22    arall   Ans Anscadu 
,   2,23    ,   Atd Atdcan  
megis   2,24    megis   Cys Cyscyd  
yr  2,25    y   YFB YFB 
Unol    2,26    unol    Ans Anscadu 
Daleithiau  2,27    Daleithiau  E   Ep  
,   2,28    ,   Atd Atdcan  
mae 2,29    bod B   Bpres3u 
banciau 2,30    banc    E   Egll    
'n  2,31    yn  U   Uberf   
cael    2,32    cael    B   Be  
eu  2,33    eu  Rha Rhadib3ll   
gwahardd    2,34    gwahardd    B   Be  
rhag    2,35    rhag    Ar  Arsym   
bod 2,36    bod B   Be  
yn  2,37    yn  U   Utra    
berchen 2,38    perchen E   Egu +sm
ar  2,39    ar  Ar  Arsym   
gwmniau 2,40    gwmniau unk unk 
sydd    2,41    bod B   Bpres3perth 
ddim    2,42    dim E   Egu +sm
yn  2,43    yn  U   Utra    
rhai    2,44    rhai    unk unk 
cyllidol    2,45    cyllidol    Ans Anscadu 
.   2,46    .   Atd Atdt    
Adran   3,1 adran   E   Ebu 
Iechyd  3,2 iechyd  E   Egu 
Cymru   3,3 Cymru   E   Epb 
.   3,4 .   Atd Atdt

Now we have the token, lemma, and POS tag information we can run the Welsh PyMUSAS RuleBasedTagger, so first we will download it:

pip install https://github.com/UCREL/pymusas-models/releases/download/cy_dual_basiccorcencc2usas_contextual-0.3.3/cy_dual_basiccorcencc2usas_contextual-0.3.3-py3-none-any.whl

Now we can run the tagger over the tsv data using the following Python script:

from pathlib import Path
from typing import List

import spacy
from spacy.tokens import Doc
from spacy.vocab import Vocab

# Load the Welsh PyMUSAS rule-based tagger
nlp = spacy.load("cy_dual_basiccorcencc2usas_contextual")

tokens: List[str] = []
spaces: List[bool] = []
basic_pos_tags: List[str] = []
lemmas: List[str] = []

welsh_tagged_file = Path(Path.cwd(), 'welsh_text_example.tsv').resolve()

print('Text\tLemma\tPOS\tUSAS Tags')
with welsh_tagged_file.open('r', encoding='utf-8') as welsh_tagged_data:
    for line in welsh_tagged_data:
        line = line.strip()
        if line:
            line_tags = line.split('\t')
            tokens.append(line_tags[1])
            lemmas.append(line_tags[3])
            basic_pos_tags.append(line_tags[4])
            spaces.append(True)


# As the tagger is a spaCy component that expects tokens, pos, and lemma
# We need to create a spaCy Doc object that will contain this information
doc = Doc(Vocab(), words=tokens, tags=basic_pos_tags, lemmas=lemmas)
output_doc = nlp(doc)

print(f'Text\tLemma\tPOS\tUSAS Tags')
for token in output_doc:
    print(f'{token.text}\t{token.lemma_}\t{token.tag_}\t{token._.pymusas_tags}')

Output:

Text            Lemma           POS     USAS Tags
Sefydliad       sefydliad       E       ['S5+c', 'S7.1+', 'H1c', 'S1.1.1', 'T2+']
cyllidol        cyllidol        Ans     ['I1']
yw              bod             B       ['A3+', 'Z5']
bancwr          bancwr          E       ['Z99']
neu             neu             Cys     ['Z5']
fanc            banc            E       ['I1.1', 'X2.6+', 'M1']
sy              bod             B       ['A3+', 'Z5']
'n              yn              U       ['Z5']
actio           actio           B       ['A1.1.1', 'T1.1.2', 'A8', 'K4']
fel             fel             Cys     ['Z5']
asiant          asiant | asio   E | B   ['I2.1/S2mf', 'G3/S2mf', 'K4/S2mf']
talu            talu            B       ['I1.2', 'A9-', 'I1.1/I3.1']
ar              ar              Ar      ['Z5']
gyfer           cyfer           E       ['M6', 'Q2.2', 'Q2.2', 'S7.1+', 'X4.2', 'K4']
cwsmeriaid      cwsmer          E       ['I2.2/S2mf']
,               ,               Atd     ['PUNCT']
ac              a               Cys     ['Z5']
yn              yn              U       ['Z5']
rhoi            rhoi            B       ['A9-', 'A1.1.1']
benthyg         benthyg         E       ['A9-']
ac              a               Cys     ['Z5']
yn              yn              U       ['Z5']
benthyg         benthyg         B       ['A9-']
arian           arian           E       ['I1']
.               .               Atd     ['PUNCT']
Yn              yn              Ar      ['Z5']
rhai            rhai            unk     ['A13.5']
gwledydd        gwlad           E       ['M7']
,               ,               Atd     ['PUNCT']
megis           megis           Cys     ['Z5']
yr              y               YFB     ['Z5']
Almaen          Almaen          E       ['Z2']
a               a               Cys     ['Z5']
Siapan          Siapan          E       ['Z2']
,               ,               Atd     ['PUNCT']
mae             bod             B       ['A3+', 'Z5']
banciau         banc            E       ['I1.1', 'X2.6+', 'M1']
'n              yn              U       ['Z5']
brif            brif            unk     ['Z99']
berchenogion    berchenogion    unk     ['Z99']
corfforaethau   corfforaeth     E       ['I2.1/S5c', 'G1.1c']
diwydiannol     diwydiannol     Ans     ['I4']
,               ,               Atd     ['PUNCT']
tra             tra             Cys     ['Z5']
mewn            mewn            Ar      ['Z5']
gwledydd        gwlad           E       ['M7']
eraill          arall           Ans     ['A6.1-/Z8']
,               ,               Atd     ['PUNCT']
megis           megis           Cys     ['Z5']
yr              y               YFB     ['Z5']
Unol            unol            Ans     ['S5+', 'A1.1.1']
Daleithiau      Daleithiau      E       ['Z99']
,               ,               Atd     ['PUNCT']
mae             bod             B       ['A3+', 'Z5']
banciau         banc            E       ['I1.1', 'X2.6+', 'M1']
'n              yn              U       ['Z5']
cael            cael            B       ['A9+', 'Z5', 'X9.2+', 'A2.1+', 'A2.2', 'M1', 'M2', 'X2.5+', 'E4.1-']
eu              eu              Rha     ['Z8']
gwahardd        gwahardd        B       ['S7.4-']
rhag            rhag            Ar      ['Z5']
bod             bod             B       ['A3+', 'Z5']
yn              yn              U       ['Z5']
berchen         perchen         E       ['A9+/S2mf']
ar              ar              Ar      ['Z5']
gwmniau         gwmniau         unk     ['Z99']
sydd            bod             B       ['A3+', 'Z5']
ddim            dim             E       ['Z6/Z8']
yn              yn              U       ['Z5']
rhai            rhai            unk     ['A13.5']
cyllidol        cyllidol        Ans     ['I1']
.               .               Atd     ['PUNCT']
Adran           adran           E       ['G1.1']
Iechyd          iechyd          E       ['G1.1']
Cymru           Cymru           E       ['Z2', 'Z1mf']
.               .               Atd     ['PUNCT']

For Welsh the tagger also identifies and tags Multi-Word Expressions (MWE), to find these MWE's you can run the following:

print(f'Text\tPOS\tMWE start and end index\tUSAS Tags')
for token in output_doc:
    start, end = token._.pymusas_mwe_indexes[0]
    if (end - start) > 1:
        print(f'{token.text}\t{token.tag_}\t{(start, end)}\t{token._.pymusas_tags}')

Which will output the following:

Text    POS     MWE start and end index    USAS Tags
Adran   E       (71, 73)                   ['G1.1']
Iechyd  E       (71, 73)                   ['G1.1']

Indonesian

Expand

In this example, we will not be using spaCy for tokenization, lemmatization, and POS tagging, as we will be using the Indonesian TreeTagger that has been wrapped in a docker container. Therefore, first, you will need to install docker. After installing docker you will need to build the Indonesian TreeTagger docker container locally, of which by doing this you agree to the TreeTagger license (this license stops you from re-distributing the TreeTagger code, therefore please do not upload your built docker container to a registry like Docker Hub), like so (docker container size roughly 139MB):

docker build -t indonesian-treetagger:1.0.0 https://github.com/UCREL/Indonesian-TreeTagger-Docker-Build.git#main

We assume that you would like to tag the following text, of which this text is stored in the file named indonesian_text_example.txt. The example text is taken from the Indonesian Wikipedia page on the topic of Bank as a financial institution.

indonesian_text_example.txt

Bank adalah sebuah lembaga keuangan intermediasi yang umumnya didirikan dengan kewenangan untuk menerima simpanan uang, meminjamkan uang, dan menerbitkan surat sanggup bayar.

First, we will need to run the Indonesian TreeTagger:

cat indonesian_text_example.txt | docker run -i --rm indonesian-treetagger:1.0.0 > indonesian_text_example.tsv

We now have a tsv version of the file that has been tokenized, lemmatized, and POS tagged. The indonesian_text_example.tsv file should contain the following (I have added column headers here to explain what each column represents, these headers should not be in your file):

note

The POS tagset for Indonesian is not the USAS core or UPOS tagset, but rather the UI tagset.

indonesian_text_example.tsv:

indonesian_text_example.tsv
Token   POS Lemma
Bank    NNP bank
adalah  VB  adalah
sebuah  NND sebuah
lembaga keuangan    NN  lembaga
intermediasi    NN  intermediasi
yang    SC  yang
umumnya NN  umumnya
didirikan   VB  diri
dengan  IN  dengan
kewenangan  NN  wenang
untuk   SC  untuk
menerima    VB  terima
simpanan    NN  simpan
uang    NN  uang
,   Z   ,
meminjamkan VB  pinjam
uang    NN  uang
,   Z   ,
dan CC  dan
menerbitkan VB  terbit
surat   NN  surat
sanggup VB  sanggup
bayar   VB  bayar
.   Z   .

Now we have the token, lemma, and POS tag information we can run the Indonsian PyMUSAS RuleBasedTagger, so first we will download it:

pip install https://github.com/UCREL/pymusas-models/releases/download/id_single_none_contextual-0.3.3/id_single_none_contextual-0.3.3-py3-none-any.whl

Now we can run the tagger over the tsv data using the following Python script:

from pathlib import Path
from typing import List

import spacy
from spacy.tokens import Doc
from spacy.vocab import Vocab

# Load the Indonesian PyMUSAS rule based tagger
nlp = spacy.load("id_single_none_contextual")

tokens: List[str] = []
spaces: List[bool] = []
pos_tags: List[str] = []
lemmas: List[str] = []

indonesian_tagged_file = Path(Path.cwd(), 'indonesian_text_example.tsv').resolve()

print('Text\tLemma\tPOS\tUSAS Tags')
with indonesian_tagged_file.open('r', encoding='utf-8') as indonesian_tagged_data:
    for line in indonesian_tagged_data:
        line = line.strip()
        if line:
            line_tags = line.split('\t')
            tokens.append(line_tags[0])
            lemmas.append(line_tags[2])
            pos_tags.append(line_tags[1])
            spaces.append(True)


# As the tagger is a spaCy component that expects tokens, pos, and lemma
# we need to create a spaCy Doc object that will contain this information
doc = Doc(Vocab(), words=tokens, tags=pos_tags, lemmas=lemmas)
output_doc = nlp(doc)

print(f'Text\tLemma\tPOS\tUSAS Tags')
for token in output_doc:
    print(f'{token.text}\t{token.lemma_}\t{token.tag_}\t{token._.pymusas_tags}')

Output:

Text                Lemma               POS     USAS Tags
Bank                bank                NNP     ['Z99']
adalah              adalah              VB      ['Z99']
sebuah              sebuah              NND     ['Z99']
lembaga keuangan    lembaga             NN      ['Z99']
intermediasi        intermediasi        NN      ['Z99']
yang                yang                SC      ['Z5']
umumnya             umumnya             NN      ['Z99']
didirikan           diri                VB      ['Z99']
dengan              dengan              IN      ['Z5']
kewenangan          wenang              NN      ['Z99']
untuk               untuk               SC      ['Z5']
menerima            terima              VB      ['Z99']
simpanan            simpan              NN      ['Z99']
uang                uang                NN      ['Z99']
,                   ,                   Z       ['PUNCT']
meminjamkan         pinjam              VB      ['Z99']
uang                uang                NN      ['Z99']
,                   ,                   Z       ['PUNCT']
dan                 dan                 CC      ['Z5']
menerbitkan         terbit              VB      ['Z99']
surat               surat               NN      ['Z99']
sanggup             sanggup             VB      ['Z99']
bayar               bayar               VB      ['Z99']
.                   .                   Z       ['PUNCT']

English

Expand

First download both the English PyMUSAS RuleBasedTagger spaCy component and the small English spaCy model:

pip install https://github.com/UCREL/pymusas-models/releases/download/en_dual_none_contextual-0.3.3/en_dual_none_contextual-0.3.3-py3-none-any.whl
python -m spacy download en_core_web_sm

Then create the tagger, in a Python script:

import spacy

# We exclude the following components as we do not need them. 
nlp = spacy.load('en_core_web_sm', exclude=['parser', 'ner'])
# Load the English PyMUSAS rule-based tagger in a separate spaCy pipeline
english_tagger_pipeline = spacy.load('en_dual_none_contextual')
# Adds the English PyMUSAS rule-based tagger to the main spaCy pipeline
nlp.add_pipe('pymusas_rule_based_tagger', source=english_tagger_pipeline)

The tagger is now set up for tagging text through the spaCy pipeline like so (this example follows on from the last). The example text is taken from the English Wikipedia page on the topic of The Nile River, we capitalised the n in Northeastern:

text = "The Nile is a major north-flowing river in Northeastern Africa."

output_doc = nlp(text)

print(f'Text\tLemma\tPOS\tUSAS Tags')
for token in output_doc:
    print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.pymusas_tags}')

Output:

Text            Lemma           POS     USAS Tags
The             the             DET     ['Z5']
Nile            Nile            PROPN   ['Z2']
is              be              AUX     ['A3+', 'Z5']
a               a               DET     ['Z5']
major           major           ADJ     ['A11.1+', 'N3.2+']
north           north           NOUN    ['M6']
-               -               PUNCT   ['PUNCT']
flowing         flow            VERB    ['M4', 'M1']
river           river           NOUN    ['W3/M4', 'N5+']
in              in              ADP     ['Z5']
Northeastern    Northeastern    PROPN   ['Z1mf', 'Z3c']
Africa          Africa          PROPN   ['Z1mf', 'Z3c']
.               .               PUNCT   ['PUNCT']

For English, the tagger also identifies and tags Multi-Word Expressions (MWE), to find these MWE's you can run the following:

print(f'Text\tPOS\tMWE start and end index\tUSAS Tags')

for token in output_doc:
    start, end = token._.pymusas_mwe_indexes[0]
    if (end - start) > 1:
        print(f'{token.text}\t{token.pos_}\t{(start, end)}\t{token._.pymusas_tags}')

Which will output the following:

Text            POS             MWE start and end index     USAS Tags
Northeastern    PROPN           (10, 12)                    ['Z1mf', 'Z3c']
Africa          PROPN           (10, 12)                    ['Z1mf', 'Z3c']

Chinese​

note

Dutch​

French​

Italian​

Portuguese​

Spanish​

Finnish​

Welsh​

Indonesian​

note

English​

Chinese

Dutch

French

Italian

Portuguese

Spanish

Finnish

Welsh

Indonesian

English