rule_based
pymusas.taggers.rule_based
RuleBasedTagger​
class RuleBasedTagger:
| ...
| def __init__(
| self,
| rules: List[Rule],
| ranker: LexiconEntryRanker,
| default_punctuation_tags: Optional[Set[str]] = None,
| default_number_tags: Optional[Set[str]] = None
| ) -> None
The tagger when called, through __call__, and given a sequence of
tokens and their associated lingustic data (lemma, Part Of Speech (POS))
will apply one or more pymusas.taggers.rules.rule.Rules
to create a list of possible candidate tags for each token in the sequence.
Each candidate, represented as a
pymusas.rankers.ranking_meta_data.RankingMetaData object, for each
token is then Ranked using a
pymusas.rankers.lexicon_entry.LexiconEntryRanker ranker. The best
candidate and it's associated tag(s) for each token are then returned along
with a List of token indexes indicating if the token is part of a Multi
Word Expression (MWE).
If we cannot tag a token then the following process will happen:
- If the token's POS tag is in
default_punctuation_tagsthen it will assign the tagPUNCT. - If the token's POS tag is in
default_number_tagsthen it will assign the tagN1. - Assign the default tag
Z99.
Parameters¶​
- rules :
List[pymusas.taggers.rules.rule.Rule]
A list of rules to apply to the sequence of tokens in the__call__. The output from each rule is concatendated and given to theranker. - ranker :
pymusas.rankers.lexicon_entry.LexiconEntryRanker
A ranker to rank the output from all of therules. - default_punctuation_tags :
Set[str], optional (default =None)
The POS tags that represent punctuation. IfNonethen we will use theSet:set(['punc']). - default_number_tags :
Set[str], optional (default =None)
The POS tags that represent numbers. IfNonethen we will use theSet:set(['num']).
Instance Attributes¶​
- rules :
List[pymusas.taggers.rules.rule.Rule]
The givenrules. - ranker :
pymusas.rankers.lexicon_entry.LexiconEntryRanker
The givenranker. - default_punctuation_tags :
Set[str]
The givendefault_punctuation_tags - default_number_tags :
Set[str]
The givendefault_number_tags
Examples¶​
from pymusas.lexicon_collection import LexiconCollection
from pymusas.taggers.rule_based import RuleBasedTagger
from pymusas.taggers.rules.single_word import SingleWordRule
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.pos_mapper import BASIC_CORCENCC_TO_USAS_CORE
welsh_lexicon_url = 'https://raw.githubusercontent.com/apmoore1/Multilingual-USAS/master/Welsh/semantic_lexicon_cy.tsv'
lexicon_lookup = LexiconCollection.from_tsv(welsh_lexicon_url, include_pos=True)
lemma_lexicon_lookup = LexiconCollection.from_tsv(welsh_lexicon_url, include_pos=False)
single_word_rule = SingleWordRule(lexicon_lookup, lemma_lexicon_lookup,
BASIC_CORCENCC_TO_USAS_CORE)
ranker = ContextualRuleBasedRanker(1, 0)
tagger = RuleBasedTagger([single_word_rule], ranker)
__call__​
class RuleBasedTagger:
| ...
| def __call__(
| self,
| tokens: List[str],
| lemmas: List[str],
| pos_tags: List[str]
| ) -> List[Tuple[List[str],
| List[Tuple[int, int]]
| ]]
Given a List of tokens, their associated lemmas and
Part Of Speech (POS) tags it returns for each token:
- A
Listof tags. The first tag in theListof tags is the most likely tag. - A
ListofTupleswhereby eachTupleindicates the start and end token index of the associated Multi Word Expression (MWE). If theListcontains more than oneTuplethen the MWE is discontinuous. For single word expressions theListwill only contain 1Tuplewhich will be (token_start_index, token_start_index + 1).
All the generated tags and MWEs are based on the rules and ranker given to this model.
NOTE this tagger has been designed to be flexible with the amount of
resources avaliable, if you do not have POS or lemma information assign
them a List of empty strings.
Parameters¶​
- tokens :
List[str]
A List of full text form of the tokens to be tagged. - lemmas :
List[str]
The List of lemma/base form of the tokens to be tagged. - pos_tags :
List[str]
The List of POS tags of the tokens to be tagged.
Returns¶​
List[Tuple[List[str], List[Tuple[int, int]]]]
Raises¶​
ValueError
If the length of thetokens,lemmas, andpos_tagsare not of the same legnth.