rule_based
pymusas.taggers.rule_based
RuleBasedTagger​
class RuleBasedTagger:
| ...
| def __init__(
| self,
| rules: List[Rule],
| ranker: LexiconEntryRanker,
| default_punctuation_tags: Optional[Set[str]] = None,
| default_number_tags: Optional[Set[str]] = None
| ) -> None
The tagger when called, through __call__
, and given a sequence of
tokens and their associated lingustic data (lemma, Part Of Speech (POS))
will apply one or more pymusas.taggers.rules.rule.Rule
s
to create a list of possible candidate tags for each token in the sequence.
Each candidate, represented as a
pymusas.rankers.ranking_meta_data.RankingMetaData
object, for each
token is then Ranked using a
pymusas.rankers.lexicon_entry.LexiconEntryRanker
ranker. The best
candidate and it's associated tag(s) for each token are then returned along
with a List
of token indexes indicating if the token is part of a Multi
Word Expression (MWE).
If we cannot tag a token then the following process will happen:
- If the token's POS tag is in
default_punctuation_tags
then it will assign the tagPUNCT
. - If the token's POS tag is in
default_number_tags
then it will assign the tagN1
. - Assign the default tag
Z99
.
Parameters¶​
- rules :
List[pymusas.taggers.rules.rule.Rule]
A list of rules to apply to the sequence of tokens in the__call__
. The output from each rule is concatendated and given to theranker
. - ranker :
pymusas.rankers.lexicon_entry.LexiconEntryRanker
A ranker to rank the output from all of therules
. - default_punctuation_tags :
Set[str]
, optional (default =None
)
The POS tags that represent punctuation. IfNone
then we will use theSet
:set(['punc'])
. - default_number_tags :
Set[str]
, optional (default =None
)
The POS tags that represent numbers. IfNone
then we will use theSet
:set(['num'])
.
Instance Attributes¶​
- rules :
List[pymusas.taggers.rules.rule.Rule]
The givenrules
. - ranker :
pymusas.rankers.lexicon_entry.LexiconEntryRanker
The givenranker
. - default_punctuation_tags :
Set[str]
The givendefault_punctuation_tags
- default_number_tags :
Set[str]
The givendefault_number_tags
Examples¶​
from pymusas.lexicon_collection import LexiconCollection
from pymusas.taggers.rule_based import RuleBasedTagger
from pymusas.taggers.rules.single_word import SingleWordRule
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.pos_mapper import BASIC_CORCENCC_TO_USAS_CORE
welsh_lexicon_url = 'https://raw.githubusercontent.com/apmoore1/Multilingual-USAS/master/Welsh/semantic_lexicon_cy.tsv'
lexicon_lookup = LexiconCollection.from_tsv(welsh_lexicon_url, include_pos=True)
lemma_lexicon_lookup = LexiconCollection.from_tsv(welsh_lexicon_url, include_pos=False)
single_word_rule = SingleWordRule(lexicon_lookup, lemma_lexicon_lookup,
BASIC_CORCENCC_TO_USAS_CORE)
ranker = ContextualRuleBasedRanker(1, 0)
tagger = RuleBasedTagger([single_word_rule], ranker)
__call__​
class RuleBasedTagger:
| ...
| def __call__(
| self,
| tokens: List[str],
| lemmas: List[str],
| pos_tags: List[str]
| ) -> List[Tuple[List[str],
| List[Tuple[int, int]]
| ]]
Given a List
of tokens, their associated lemmas and
Part Of Speech (POS) tags it returns for each token:
- A
List
of tags. The first tag in theList
of tags is the most likely tag. - A
List
ofTuples
whereby eachTuple
indicates the start and end token index of the associated Multi Word Expression (MWE). If theList
contains more than oneTuple
then the MWE is discontinuous. For single word expressions theList
will only contain 1Tuple
which will be (token_start_index, token_start_index + 1).
All the generated tags and MWEs are based on the rules and ranker given to this model.
NOTE this tagger has been designed to be flexible with the amount of
resources avaliable, if you do not have POS or lemma information assign
them a List
of empty strings.
Parameters¶​
- tokens :
List[str]
A List of full text form of the tokens to be tagged. - lemmas :
List[str]
The List of lemma/base form of the tokens to be tagged. - pos_tags :
List[str]
The List of POS tags of the tokens to be tagged.
Returns¶​
List[Tuple[List[str], List[Tuple[int, int]]]]
Raises¶​
ValueError
If the length of thetokens
,lemmas
, andpos_tags
are not of the same legnth.