util
pymusas.taggers.rules.util
n_gram_indexes​
def n_gram_indexes(
sequence: Sequence[Any],
min_n: int,
max_n: int
) -> Iterator[Tuple[int, int]]
Returns n-grams as indexes of the sequence
,
in the range from max_n
to min_n
, in
order of largest n-grams first. If you only want one n-gram size then set
min_n
equal to max_n
, for example to get bi-grams indexes set both
min_n
and max_n
to 2
.
Parameters¶​
- sequence :
Sequence[Any]
The sequence to generate n-gram indexes from. - min_n :
int
Minimum size n-gram. Has to be greater than0
. - max_n :
int
Maximim size n-gram. This has to be equal to or greater thanmin_n
. If this is greater than the length of thesequence
then it is set to length of thesequence
.
Returns¶​
Iterator[Tuple[int, int]]
Raises¶​
ValueError
Ifmin_n
is less than1
ormax_n
is less thanmin_n
.
Examples¶​
from pymusas.taggers.rules.util import n_gram_indexes
tokens = ['hello', 'how', 'are', 'you', ',']
token_n_gram_indexes = n_gram_indexes(tokens, 2, 3)
expected_n_grams_indexes = [(0, 3), (1, 4), (2, 5), (0, 2), (1, 3), (2, 4), (3, 5)]
assert expected_n_grams_indexes == list(token_n_gram_indexes)
n_grams​
def n_grams(
sequence: Sequence[Any],
min_n: int,
max_n: int
) -> Iterator[Sequence[Any]]
Returns n-grams, in the range from max_n
to min_n
, of the sequence
in
order of largest n-grams first. If you only want one n-gram size then set
min_n
equal to max_n
, for example to get bi-grams set both min_n
and
max_n
to 2
.
Parameters¶​
- sequence :
Sequence[Any]
The sequence to generate n-grams from. - min_n :
int
Minimum size n-gram. Has to be greater than0
. - max_n :
int
Maximim size n-gram. This has to be equal to or greater thanmin_n
. If this is greater than the length of thesequence
then it is set to length of thesequence
.
Returns¶​
Iterator[Sequence[Any]]
Raises¶​
ValueError
Ifmin_n
is less than1
ormax_n
is less thanmin_n
.
Examples¶​
from pymusas.taggers.rules.util import n_grams
tokens = ['hello', 'how', 'are', 'you', ',']
token_n_grams = n_grams(tokens, 2, 3)
expected_n_grams = [['hello', 'how', 'are'], ['how', 'are', 'you'], ['are', 'you', ','],
['hello', 'how'], ['how', 'are'], ['are', 'you'], ['you', ',']]
assert expected_n_grams == list(token_n_grams)