lexicon_entry

pymusas.rankers.lexicon_entry

LexiconEntryRanker

class LexiconEntryRanker(Serialise)

An abstract class that defines the basic methods, __call__, to_bytes, and from_bytes, that is required for all LexiconEntryRankers.

Each lexicon entry match is represented by a pymusas.rankers.ranking_meta_data.RankingMetaData object.

Lower ranked lexicon entry matches should be given priority when making tagging decisions. A rank of 0 is better than a rank of 1.

A LexcionEntryRanker when called, __call__, returns a tuple of two Lists whereby each entry in the list corresponds to a token:

Contains the ranks of the lexicon entry matches as a List[int]. Note that the List can be empty if a token has no lexicon entry matches.
An Optional[RankingMetaData] that is the global lowest ranked entry match for that token. If the value is None then no global lowest ranked entry can be found for that token. If the RankingMetaData represents more than one token, like a Multi Word Expression (MWE) match, then those associated tokens will have the same RankingMetaData object as the global lowest ranked entry match.

The tagger will have to make a decision how to handle global lowest ranked matches of value None, a suggested approach would be to assign an unmatched/unknown semantic tag to those tokens.

The reason for the adding the second list is that the global lowest ranked match is not the same as the local/token lowest ranked match, this is due to the potential of overlapping matches, e.g. North East London brewery can have a match of North East, North, and East London brewery in this case the lowest rank for North would be North East, but as we have a lower match that uses East which is East London brewery then the global lowest rank for North would be North.

call

class LexiconEntryRanker(Serialise):
 | ...
 | @abstractmethod
 | def __call__(
 |     self,
 |     token_ranking_data: List[List[RankingMetaData]]
 | ) -> Tuple[List[List[int]], List[Optional[RankingMetaData]]]

For each token it returns a List of rankings for each lexicon entry match and the optional pymusas.rankers.ranking_meta_data.RankingMetaData object of the global lowest ranked match for each token.

Parameters¶

token_ranking_data : List[List[RankingMetaData]]
For each token a List of pymusas.rankers.ranking_meta_data.RankingMetaData representing the lexicon entry match.

Returns¶

Tuple[List[List[int]], List[Optional[RankingMetaData]]]

ContextualRuleBasedRanker

class ContextualRuleBasedRanker(LexiconEntryRanker):
 | ...
 | def __init__(
 |     self,
 |     maximum_n_gram_length: int,
 |     maximum_number_wildcards: int
 | ) -> None

The contextual rule based ranker creates ranks based on the rules stated below.

Each lexicon entry match is represented by a pymusas.rankers.ranking_meta_data.RankingMetaData object.

Lower ranked lexicon entry matches should be given priority when making tagging decisions. See the LexiconEntryRanker class docstring for more details on the returned value of the __call__ method.

Ranking Rules:

The ranking of lexicon entires is based off the following rules, these rules are based on the 6 heuristic stated at the top of column 2 on page 4 of Piao et al. 2003:

First we create an initial ranking based on lexicon entry type:

Multi Word Expression (MWE) entries ranked lower than single and Non-Special entries are ranked lower than wild card entires.

Then within these rankings we further rank based on:

Longer entries, based on n-gram length, are ranked lower.
Entries with fewer wildcards are ranked lower.

Then we apply the following contextual ranking rules:

Whether the POS information was excluded in the match if so these are ranked higher. This is only True when the match ignores the POS information for single word lexicon entries. This is always False when used in a MWE lexicon entry match.
Whether the lexicon entry was matched on Token < Lemma < Lower cased token < Lower cased lemma. Token is the lowest ranked and lower cased lemma is highest.
The lexicon entry that first appears in the text is ranked lowest, this is required for matches that do not apply to the same sequence of tokens.

In the case whereby the global lowest ranked lexicon entry match is joint ranked with another entry then it is random which lexicon entry match is chosen.

Parameters¶

maximum_n_gram_length : int
The largest n_gram rule match that will be encountered, e.g. a match of ski_noun boot_noun will have a n-gram length of 2.
maximum_number_wildcards : int
The number of wildcards in the rule that contains the most wildcards, e.g. the rule ski_* *_noun would contain 2 wildcards. This can be 0 if you have no wildcard rules.

Instance Attributes¶

n_gram_number_indexes : int
The number of indexes that each n-gram length value should have when converting the n-gram length to a string using pymusas.rankers.lexicon_entry.ContextualRuleBasedRanker.int_2_str.
wildcards_number_indexes : int
The number of indexes that each wildcard count value should have when converting the wildcard count value to a string using pymusas.rankers.lexicon_entry.ContextualRuleBasedRanker.int_2_str.
n_gram_ranking_dictionary : Dict[int, int]
Maps the n-gram length to it's rank value, as the n-gram length is inverse to it's rank, as the larger the n-gram length the lower it's rank.

to_bytes

class ContextualRuleBasedRanker(LexiconEntryRanker):
 | ...
 | def to_bytes() -> bytes

Serialises the ContextualRuleBasedRanker to a bytestring.

Returns¶

bytes

from_bytes

class ContextualRuleBasedRanker(LexiconEntryRanker):
 | ...
 | @staticmethod
 | def from_bytes(bytes_data: bytes) -> "ContextualRuleBasedRanker"

Loads ContextualRuleBasedRanker from the given bytestring and returns it.

Parameters¶

bytes_data : bytes
The bytestring to load.

Returns¶

ContextualRuleBasedRanker

get_construction_arguments

class ContextualRuleBasedRanker(LexiconEntryRanker):
 | ...
 | @staticmethod
 | def get_construction_arguments(
 |     rules: List['Rule']
 | ) -> Tuple[int, int]

Given a List of rules it will return the maximum_n_gram_length and maximum_number_wildcards from the lexicon collections that those pymusas.taggers.rules.rule.Rule(s) are based on. The output from this function can then be used as the arguments to the constructor of ContextualRuleBasedRanker.

Parameters¶

rules : List[Rule]
A List of rules. This List is typically required when creating a pymusas.taggers.rule_based.RuleBasedTagger tagger.

Returns¶

Tuple[int, int]

Examples¶

from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.taggers.rules.mwe import MWERule
from pymusas.lexicon_collection import MWELexiconCollection
pt_mwe_lexicon_url = "https://raw.githubusercontent.com/UCREL/Multilingual-USAS/master/Portuguese/mwe-pt.tsv"
mwe_dict = MWELexiconCollection.from_tsv(pt_mwe_lexicon_url)
mwe_rule = MWERule(mwe_dict)
ranker_construction_arguments = ContextualRuleBasedRanker.get_construction_arguments([mwe_rule])
ranker = ContextualRuleBasedRanker(*ranker_construction_arguments)

int_2_str

class ContextualRuleBasedRanker(LexiconEntryRanker):
 | ...
 | @staticmethod
 | def int_2_str(int_value: int, number_indexes: int) -> str

Converts the integer, int_value, to a string with number_indexes, e.g. 10 and 05 both have number_indexes of 2 and 001, 020, and 211 have number_indexes of 3.

Parameters¶

int_value : int
The integer to converts to a string with the given number_indexes.
number_indexes : int
The number of indexes the int_value should have in the returned string.

Returns¶

str

Raises¶

ValueError
If the number_indexes of the int_value when converted to a string is greater than the given number_indexes.

get_global_lowest_ranks

class ContextualRuleBasedRanker(LexiconEntryRanker):
 | ...
 | @staticmethod
 | def get_global_lowest_ranks(
 |     token_ranking_data: List[List[RankingMetaData]],
 |     token_rankings: List[List[int]],
 |     ranking_data_to_exclude: Optional[Set[RankingMetaData]] = None
 | ) -> List[Optional[RankingMetaData]]

Returns the global lowest ranked entry match for each token. If the value is None then no global lowest ranked entry can be found for that token. If the RankingMetaData represents more than one token, like a Multi Word Expression (MWE) match, then those associated tokens will have the same RankingMetaData object as the global lowest ranked entry match.

Time Complexity, given N is the number of tokens, M is the number of unique ranking data, and P is the number of ranking data (non-unique) then the time complexity is:

O(N + P) + O(M log M) + O(M)

Parameters¶

token_ranking_data : List[List[RankingMetaData]]
For each token a List of pymusas.rankers.ranking_meta_data.RankingMetaData representing the lexicon entry match.
token_rankings : List[List[int]]
For each token contains the ranks of the lexicon entry matches. Note that the List can be empty if a token has no lexicon entry matches.
ranking_data_to_exclude : Set[RankingMetaData], optional (default = None)
Any pymusas.rankers.ranking_meta_data.RankingMetaData to exclude from the ranking selection, this can be useful when wanting to get the next best global rank for each token.

Raises¶

AssertionError
If the length of token_ranking_data is not equal to the length of token_rankings, for both the outer and inner Lists.

Examples¶

from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.rankers.ranking_meta_data import RankingMetaData
from pymusas.lexicon_collection import LexiconType
from pymusas.rankers.lexical_match import LexicalMatch
north_east = RankingMetaData(LexiconType.MWE_NON_SPECIAL, 2, 0,
                             False, LexicalMatch.TOKEN, 0, 2,
                             'North_noun East_noun', ('Z1',))
east_london_brewery = RankingMetaData(LexiconType.MWE_NON_SPECIAL, 3, 0,
                                      False, LexicalMatch.TOKEN, 1, 4,
                                      'East_noun London_noun brewery_noun', ('Z1',))
token_ranking_data = [
    [
        north_east
    ],
    [
        north_east,
        east_london_brewery
    ],
    [
        east_london_brewery
    ],
    [
        east_london_brewery
    ]
]
token_rankings = [[120110], [120110, 110111], [110111], [110111]]
expected_lowest_ranked_matches = [None, east_london_brewery,
                                  east_london_brewery, east_london_brewery]
assert (ContextualRuleBasedRanker.get_global_lowest_ranks(token_ranking_data, token_rankings, None)
        == expected_lowest_ranked_matches)

Following on from the previous example, we now want to find the next best global match for each token so we exclude the current best global match for each token which is the east_london_brewery match:

expected_lowest_ranked_matches = [north_east, north_east, None, None]
ranking_data_to_exclude = {east_london_brewery}
assert (ContextualRuleBasedRanker.get_global_lowest_ranks(token_ranking_data, token_rankings,
                                                         ranking_data_to_exclude)
        == expected_lowest_ranked_matches)

call

class ContextualRuleBasedRanker(LexiconEntryRanker):
 | ...
 | def __call__(
 |     self,
 |     token_ranking_data: List[List[RankingMetaData]]
 | ) -> Tuple[List[List[int]], List[Optional[RankingMetaData]]]

See the ranking rules in the class docstring for details on how each lexicon entry match is ranked.

Time Complexity, given N is the number of tokens, M is the number of unique ranking data, and P is the number of ranking data (non-unique) then the time complexity is:

O(3(N + P)) + O(M log M) + O(M)

Parameters¶

token_ranking_data : List[List[RankingMetaData]]
For each token a List of pymusas.rankers.ranking_meta_data.RankingMetaData representing the lexicon entry match.

Returns¶

Tuple[List[List[int]], List[Optional[RankingMetaData]]]

Examples¶

from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.rankers.ranking_meta_data import RankingMetaData
from pymusas.lexicon_collection import LexiconType
from pymusas.rankers.lexical_match import LexicalMatch
north_east = RankingMetaData(LexiconType.MWE_NON_SPECIAL, 2, 0,
                             False, LexicalMatch.TOKEN, 0, 2,
                             'North_noun East_noun', ('Z1',))
east_london_brewery = RankingMetaData(LexiconType.MWE_NON_SPECIAL, 3, 0,
                                      False, LexicalMatch.TOKEN, 1, 4,
                                      'East_noun London_noun brewery_noun', ('Z1',))
token_ranking_data = [
    [
        north_east
    ],
    [
        north_east,
        east_london_brewery
    ],
    [
        east_london_brewery
    ],
    [
        east_london_brewery
    ]
]
expected_ranks = [[120110], [120110, 110111], [110111], [110111]]
expected_lowest_ranked_matches = [None, east_london_brewery,
                                  east_london_brewery, east_london_brewery]
ranker = ContextualRuleBasedRanker(3, 0)
assert ((expected_ranks, expected_lowest_ranked_matches)
        == ranker(token_ranking_data))

eq

class ContextualRuleBasedRanker(LexiconEntryRanker):
 | ...
 | def __eq__(other: object) -> bool

Given another object to compare too it will return True if the other object is the same class and was initialised using with the same maximum_n_gram_length and maximum_number_wildcards values.

Parameters¶

other : object
The object to compare too.

Returns¶

True

LexiconEntryRanker​

__call__​

Parameters¶​

Returns¶​

ContextualRuleBasedRanker​

Parameters¶​

Instance Attributes¶​

to_bytes​

Returns¶​

from_bytes​

Parameters¶​

Returns¶​

get_construction_arguments​

Parameters¶​

Returns¶​

Examples¶​

int_2_str​

Parameters¶​

Returns¶​

Raises¶​

get_global_lowest_ranks​

Parameters¶​

Raises¶​

Examples¶​

__call__​

Parameters¶​

Returns¶​

Examples¶​

__eq__​

Parameters¶​

Returns¶​

LexiconEntryRanker

call

Parameters¶

Returns¶

ContextualRuleBasedRanker

Parameters¶

Instance Attributes¶

to_bytes

Returns¶

from_bytes

Parameters¶

Returns¶

get_construction_arguments

Parameters¶

Returns¶

Examples¶

int_2_str

Parameters¶

Returns¶

Raises¶

get_global_lowest_ranks

Parameters¶

Raises¶

Examples¶

call

Parameters¶

Returns¶

Examples¶

eq

Parameters¶

Returns¶