lexicon_entry
pymusas.rankers.lexicon_entry
LexiconEntryRanker​
class LexiconEntryRanker(Serialise)
An abstract class that defines the basic methods, __call__
,
to_bytes
, and from_bytes
, that is required for all
LexiconEntryRanker
s.
Each lexicon entry match is represented by a
pymusas.rankers.ranking_meta_data.RankingMetaData
object.
Lower ranked lexicon entry matches should be given priority when making tagging decisions. A rank of 0 is better than a rank of 1.
A LexcionEntryRanker when called, __call__
, returns a tuple of two List
s
whereby each entry in the list corresponds to a token:
- Contains the ranks of the lexicon entry matches as a
List[int]
. Note that theList
can be empty if a token has no lexicon entry matches. - An
Optional[RankingMetaData]
that is the global lowest ranked entry match for that token. If the value isNone
then no global lowest ranked entry can be found for that token. If theRankingMetaData
represents more than one token, like a Multi Word Expression (MWE) match, then those associated tokens will have the sameRankingMetaData
object as the global lowest ranked entry match.
The tagger will have to make a decision how to handle global lowest ranked
matches of value None
, a suggested approach would be to assign an
unmatched/unknown semantic tag to those tokens.
The reason for the adding the second list is that the global lowest
ranked match is not the same as the local/token lowest ranked match, this is
due to the potential of overlapping matches, e.g. North East London brewery
can have a match of North East
, North
, and East London brewery
in this
case the lowest rank for North
would be North East
, but as we have a
lower match that uses East
which is East London brewery
then the
global lowest rank for North
would be North
.
__call__​
class LexiconEntryRanker(Serialise):
| ...
| @abstractmethod
| def __call__(
| self,
| token_ranking_data: List[List[RankingMetaData]]
| ) -> Tuple[List[List[int]], List[Optional[RankingMetaData]]]
For each token it returns a List
of rankings for each lexicon entry
match and the optional pymusas.rankers.ranking_meta_data.RankingMetaData
object of the global lowest ranked match for each token.
Parameters¶​
- token_ranking_data :
List[List[RankingMetaData]]
For each token aList
ofpymusas.rankers.ranking_meta_data.RankingMetaData
representing the lexicon entry match.
Returns¶​
Tuple[List[List[int]], List[Optional[RankingMetaData]]]
ContextualRuleBasedRanker​
class ContextualRuleBasedRanker(LexiconEntryRanker):
| ...
| def __init__(
| self,
| maximum_n_gram_length: int,
| maximum_number_wildcards: int
| ) -> None
The contextual rule based ranker creates ranks based on the rules stated below.
Each lexicon entry match is represented by a
pymusas.rankers.ranking_meta_data.RankingMetaData
object.
Lower ranked lexicon entry matches should be given priority when making
tagging decisions. See the LexiconEntryRanker
class docstring for
more details on the returned value of the __call__
method.
Ranking Rules:
The ranking of lexicon entires is based off the following rules, these rules are based on the 6 heuristic stated at the top of column 2 on page 4 of Piao et al. 2003:
First we create an initial ranking based on lexicon entry type:
- Multi Word Expression (MWE) entries ranked lower than single and Non-Special entries are ranked lower than wild card entires.
Then within these rankings we further rank based on:
- Longer entries, based on n-gram length, are ranked lower.
- Entries with fewer wildcards are ranked lower.
Then we apply the following contextual ranking rules:
- Whether the POS information was excluded in the match if so these are ranked
higher. This is only
True
when the match ignores the POS information for single word lexicon entries. This is alwaysFalse
when used in a MWE lexicon entry match. - Whether the lexicon entry was matched on Token < Lemma < Lower cased token < Lower cased lemma. Token is the lowest ranked and lower cased lemma is highest.
- The lexicon entry that first appears in the text is ranked lowest, this is required for matches that do not apply to the same sequence of tokens.
In the case whereby the global lowest ranked lexicon entry match is joint ranked with another entry then it is random which lexicon entry match is chosen.
Parameters¶​
- maximum_n_gram_length :
int
The largest n_gram rule match that will be encountered, e.g. a match ofski_noun boot_noun
will have a n-gram length of 2. - maximum_number_wildcards :
int
The number of wildcards in the rule that contains the most wildcards, e.g. the ruleski_* *_noun
would contain 2 wildcards. This can be 0 if you have no wildcard rules.
Instance Attributes¶​
- n_gram_number_indexes :
int
The number of indexes that each n-gram length value should have when converting the n-gram length to a string usingpymusas.rankers.lexicon_entry.ContextualRuleBasedRanker.int_2_str
. - wildcards_number_indexes :
int
The number of indexes that each wildcard count value should have when converting the wildcard count value to a string usingpymusas.rankers.lexicon_entry.ContextualRuleBasedRanker.int_2_str
. - n_gram_ranking_dictionary :
Dict[int, int]
Maps the n-gram length to it's rank value, as the n-gram length is inverse to it's rank, as the larger the n-gram length the lower it's rank.
to_bytes​
class ContextualRuleBasedRanker(LexiconEntryRanker):
| ...
| def to_bytes() -> bytes
Serialises the ContextualRuleBasedRanker
to a bytestring.
Returns¶​
bytes
from_bytes​
class ContextualRuleBasedRanker(LexiconEntryRanker):
| ...
| @staticmethod
| def from_bytes(bytes_data: bytes) -> "ContextualRuleBasedRanker"
Loads ContextualRuleBasedRanker
from the given bytestring and
returns it.
Parameters¶​
- bytes_data :
bytes
The bytestring to load.
Returns¶​
get_construction_arguments​
class ContextualRuleBasedRanker(LexiconEntryRanker):
| ...
| @staticmethod
| def get_construction_arguments(
| rules: List['Rule']
| ) -> Tuple[int, int]
Given a List
of rules it will return the maximum_n_gram_length
and
maximum_number_wildcards
from the lexicon collections that those
pymusas.taggers.rules.rule.Rule
(s) are based on. The output from
this function can then be used as the arguments to the constructor of
ContextualRuleBasedRanker
.
Parameters¶​
- rules :
List[Rule]
AList
of rules. ThisList
is typically required when creating apymusas.taggers.rule_based.RuleBasedTagger
tagger.
Returns¶​
Tuple[int, int]
Examples¶​
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.taggers.rules.mwe import MWERule
from pymusas.lexicon_collection import MWELexiconCollection
pt_mwe_lexicon_url = "https://raw.githubusercontent.com/UCREL/Multilingual-USAS/master/Portuguese/mwe-pt.tsv"
mwe_dict = MWELexiconCollection.from_tsv(pt_mwe_lexicon_url)
mwe_rule = MWERule(mwe_dict)
ranker_construction_arguments = ContextualRuleBasedRanker.get_construction_arguments([mwe_rule])
ranker = ContextualRuleBasedRanker(*ranker_construction_arguments)
int_2_str​
class ContextualRuleBasedRanker(LexiconEntryRanker):
| ...
| @staticmethod
| def int_2_str(int_value: int, number_indexes: int) -> str
Converts the integer, int_value
, to a string with number_indexes
,
e.g. 10
and 05
both have number_indexes
of 2 and 001
, 020
,
and 211
have number_indexes
of 3.
Parameters¶​
- int_value :
int
The integer to converts to a string with the givennumber_indexes
. - number_indexes :
int
The number of indexes theint_value
should have in the returned string.
Returns¶​
str
Raises¶​
- ValueError
If thenumber_indexes
of theint_value
when converted to a string is greater than the givennumber_indexes
.
get_global_lowest_ranks​
class ContextualRuleBasedRanker(LexiconEntryRanker):
| ...
| @staticmethod
| def get_global_lowest_ranks(
| token_ranking_data: List[List[RankingMetaData]],
| token_rankings: List[List[int]],
| ranking_data_to_exclude: Optional[Set[RankingMetaData]] = None
| ) -> List[Optional[RankingMetaData]]
Returns the global lowest ranked entry match for each token. If the value
is None
then no global lowest ranked entry can be found for that token.
If the RankingMetaData
represents more than one token, like a Multi
Word Expression (MWE) match, then those associated tokens will have the
same RankingMetaData
object as the global lowest ranked entry match.
Time Complexity, given N is the number of tokens, M is the number of unique ranking data, and P is the number of ranking data (non-unique) then the time complexity is:
O(N + P) + O(M log M) + O(M)
Parameters¶​
- token_ranking_data :
List[List[RankingMetaData]]
For each token aList
ofpymusas.rankers.ranking_meta_data.RankingMetaData
representing the lexicon entry match. - token_rankings :
List[List[int]]
For each token contains the ranks of the lexicon entry matches. Note that theList
can be empty if a token has no lexicon entry matches. - ranking_data_to_exclude :
Set[RankingMetaData]
, optional (default =None
)
Anypymusas.rankers.ranking_meta_data.RankingMetaData
to exclude from the ranking selection, this can be useful when wanting to get the next best global rank for each token.
Raises¶​
AssertionError
If the length oftoken_ranking_data
is not equal to the length oftoken_rankings
, for both the outer and innerList
s.
Examples¶​
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.rankers.ranking_meta_data import RankingMetaData
from pymusas.lexicon_collection import LexiconType
from pymusas.rankers.lexical_match import LexicalMatch
north_east = RankingMetaData(LexiconType.MWE_NON_SPECIAL, 2, 0,
False, LexicalMatch.TOKEN, 0, 2,
'North_noun East_noun', ('Z1',))
east_london_brewery = RankingMetaData(LexiconType.MWE_NON_SPECIAL, 3, 0,
False, LexicalMatch.TOKEN, 1, 4,
'East_noun London_noun brewery_noun', ('Z1',))
token_ranking_data = [
[
north_east
],
[
north_east,
east_london_brewery
],
[
east_london_brewery
],
[
east_london_brewery
]
]
token_rankings = [[120110], [120110, 110111], [110111], [110111]]
expected_lowest_ranked_matches = [None, east_london_brewery,
east_london_brewery, east_london_brewery]
assert (ContextualRuleBasedRanker.get_global_lowest_ranks(token_ranking_data, token_rankings, None)
== expected_lowest_ranked_matches)
Following on from the previous example, we now want to find the next best
global match for each token so we exclude the current best global match
for each token which is the east_london_brewery
match:
expected_lowest_ranked_matches = [north_east, north_east, None, None]
ranking_data_to_exclude = {east_london_brewery}
assert (ContextualRuleBasedRanker.get_global_lowest_ranks(token_ranking_data, token_rankings,
ranking_data_to_exclude)
== expected_lowest_ranked_matches)
__call__​
class ContextualRuleBasedRanker(LexiconEntryRanker):
| ...
| def __call__(
| self,
| token_ranking_data: List[List[RankingMetaData]]
| ) -> Tuple[List[List[int]], List[Optional[RankingMetaData]]]
For each token it returns a List
of rankings for each lexicon entry
match and the optional pymusas.rankers.ranking_meta_data.RankingMetaData
object of the global lowest ranked match for each token.
See the ranking rules in the class docstring for details on how each lexicon entry match is ranked.
Time Complexity, given N is the number of tokens, M is the number of unique ranking data, and P is the number of ranking data (non-unique) then the time complexity is:
O(3(N + P)) + O(M log M) + O(M)
Parameters¶​
- token_ranking_data :
List[List[RankingMetaData]]
For each token aList
ofpymusas.rankers.ranking_meta_data.RankingMetaData
representing the lexicon entry match.
Returns¶​
Tuple[List[List[int]], List[Optional[RankingMetaData]]]
Examples¶​
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
from pymusas.rankers.ranking_meta_data import RankingMetaData
from pymusas.lexicon_collection import LexiconType
from pymusas.rankers.lexical_match import LexicalMatch
north_east = RankingMetaData(LexiconType.MWE_NON_SPECIAL, 2, 0,
False, LexicalMatch.TOKEN, 0, 2,
'North_noun East_noun', ('Z1',))
east_london_brewery = RankingMetaData(LexiconType.MWE_NON_SPECIAL, 3, 0,
False, LexicalMatch.TOKEN, 1, 4,
'East_noun London_noun brewery_noun', ('Z1',))
token_ranking_data = [
[
north_east
],
[
north_east,
east_london_brewery
],
[
east_london_brewery
],
[
east_london_brewery
]
]
expected_ranks = [[120110], [120110, 110111], [110111], [110111]]
expected_lowest_ranked_matches = [None, east_london_brewery,
east_london_brewery, east_london_brewery]
ranker = ContextualRuleBasedRanker(3, 0)
assert ((expected_ranks, expected_lowest_ranked_matches)
== ranker(token_ranking_data))
__eq__​
class ContextualRuleBasedRanker(LexiconEntryRanker):
| ...
| def __eq__(other: object) -> bool
Given another object to compare too it will return True
if the other
object is the same class and was initialised using with the same
maximum_n_gram_length
and maximum_number_wildcards
values.
Parameters¶​
- other :
object
The object to compare too.
Returns¶​
True