search¶
Services for searching and matching of text.
indexing¶
Interface for differrent indexing engines for the Translate Toolkit.
CommonIndexer¶
base class for interfaces to indexing engines for pootle
-
class
translate.search.indexing.CommonIndexer.
CommonDatabase
(basedir, analyzer=None, create_allowed=True)¶ Base class for indexing support.
Any real implementation must override most methods of this class.
-
ANALYZER_DEFAULT
= 6¶ the default analyzer to be used if nothing is configured
-
ANALYZER_EXACT
= 0¶ exact matching: the query string must equal the whole term string
-
ANALYZER_PARTIAL
= 2¶ partial matching: a document matches, even if the query string only matches the beginning of the term value.
-
ANALYZER_TOKENIZE
= 4¶ tokenize terms and queries automatically
-
INDEX_DIRECTORY_NAME
= None¶ override this with a string to be used as the name of the indexing directory/file in the filesystem
-
QUERY_TYPE
= None¶ override this with the query class of the implementation
-
begin_transaction
()¶ begin a transaction
You can group multiple modifications of a database as a transaction. This prevents time-consuming database flushing and helps, if you want that a changeset is committed either completely or not at all. No changes will be written to disk until ‘commit_transaction’. ‘cancel_transaction’ can be used to revert an ongoing transaction.
Database types that do not support transactions may silently ignore it.
-
cancel_transaction
()¶ cancel an ongoing transaction
See ‘start_transaction’ for details.
-
commit_transaction
()¶ Submit the currently ongoing transaction and write changes to disk.
See ‘start_transaction’ for details.
-
delete_doc
(ident)¶ Delete the documents returned by a query.
- Parameters
ident (int | list of tuples | dict | list of dicts | query (e.g. xapian.Query) | list of queries) – [list of] document IDs | dict describing a query | query
-
delete_document_by_id
(docid)¶ Delete a specified document.
- Parameters
docid (int) – the document ID to be deleted
-
field_analyzers
= {}¶ mapping of field names and analyzers - see
set_field_analyzers()
-
flush
(optimize=False)¶ Flush the content of the database - to force changes to be written to disk.
Some databases also support index optimization.
- Parameters
optimize (bool) – should the index be optimized if possible?
-
get_field_analyzers
(fieldnames=None)¶ Return the analyzer that was mapped to a specific field.
See
set_field_analyzers()
for details.- Parameters
fieldnames (str | list of str | None) – the analyzer of this field (or all/multiple fields) is requested; leave empty (or None) to request all fields.
- Returns
The analyzer setting of the field - see CommonDatabase.ANALYZER_??? or a dict of field names and analyzers
- Return type
int | dict
-
get_query_result
(query)¶ return an object containing the results of a query
- Parameters
query (a query object of the real implementation) – a pre-compiled query
- Returns
an object that allows access to the results
- Return type
subclass of CommonEnquire
-
index_document
(data)¶ Add the given data to the database.
- Parameters
data (dict | list of str) – the data to be indexed. A dictionary will be treated as
fieldname:value
combinations. If the fieldname is None then the value will be interpreted as a plain term or as a list of plain terms. Lists of terms are indexed separately. Lists of strings are treated as plain terms.
-
make_query
(args, require_all=True, analyzer=None)¶ Create simple queries (strings or field searches) or combine multiple queries (AND/OR).
To specifiy rules for field searches, you may want to take a look at
set_field_analyzers()
. The parameter ‘match_text_partial’ can override the previously defined default setting.- Parameters
args (list of queries | single query | str | dict) –
queries or search string or description of field query examples:
[xapian.Query("foo"), xapian.Query("bar")] xapian.Query("foo") "bar" {"foo": "bar", "foobar": "foo"}
require_all (boolean) – boolean operator (True -> AND (default) / False -> OR)
analyzer (int) –
(only applicable for ‘dict’ or ‘str’) Define query options (partial matching, exact matching, tokenizing, …) as bitwise combinations of CommonIndexer.ANALYZER_???.
This can override previously defined field analyzer settings.
If analyzer is
None
(default), then the configured analyzer for the field is used.
- Returns
the combined query
- Return type
query type of the specific implementation
-
search
(query, fieldnames)¶ Return a list of the contents of specified fields for all matches of a query.
- Parameters
query (a query object of the real implementation) – the query to be issued
fieldnames (string | list of strings) – the name(s) of a field of the document content
- Returns
a list of dicts containing the specified field(s)
- Return type
list of dicts
-
set_field_analyzers
(field_analyzers)¶ Set the analyzers for different fields of the database documents.
All bitwise combinations of CommonIndexer.ANALYZER_??? are possible.
- Parameters
field_analyzers (dict containing field names and analyzers) – mapping of field names and analyzers
- Raises
TypeError – invalid values in field_analyzers
-
-
class
translate.search.indexing.CommonIndexer.
CommonEnquire
(enquire)¶ An enquire object contains the information about the result of a request.
-
get_matches
(start, number)¶ Return a specified number of qualified matches of a previous query.
- Parameters
start (int) – index of the first match to return (starting from zero)
number (int) – the number of matching entries to return
- Returns
a set of matching entries and some statistics
- Return type
tuple of (returned number, available number, matches) “matches” is a dictionary of:
["rank", "percent", "document", "docid"]
-
get_matches_count
()¶ Return the estimated number of matches.
Use
translate.search.indexing.CommonIndexer.search()
to retrieve the exact number of matches- Returns
The estimated number of matches
- Return type
int
-
-
translate.search.indexing.CommonIndexer.
is_available
()¶ Check if this indexing engine interface is usable.
This function must exist in every module that contains indexing engine interfaces.
- Returns
is this interface usable?
- Return type
bool
PyLuceneIndexer¶
interface for the PyLucene (v2.x) indexing engine
take a look at PyLuceneIndexer1.py for the PyLucene v1.x interface
-
class
translate.search.indexing.PyLuceneIndexer.
PyLuceneDatabase
(basedir, analyzer=None, create_allowed=True)¶ Manage and use a pylucene indexing database.
-
begin_transaction
()¶ PyLucene does not support transactions
Thus this function just opens the database for write access. Call “cancel_transaction” or “commit_transaction” to close write access in order to remove the exclusive lock from the database directory.
-
cancel_transaction
()¶ PyLucene does not support transactions
Thus this function just closes the database write access and removes the exclusive lock.
See ‘start_transaction’ for details.
-
commit_transaction
()¶ PyLucene does not support transactions
Thus this function just closes the database write access and removes the exclusive lock.
See ‘start_transaction’ for details.
-
delete_doc
(ident)¶ Delete the documents returned by a query.
- Parameters
ident (int | list of tuples | dict | list of dicts | query (e.g. xapian.Query) | list of queries) – [list of] document IDs | dict describing a query | query
-
delete_document_by_id
(docid)¶ delete a specified document
- Parameters
docid (int) – the document ID to be deleted
-
flush
(optimize=False)¶ flush the content of the database - to force changes to be written to disk
some databases also support index optimization
- Parameters
optimize (bool) – should the index be optimized if possible?
-
get_field_analyzers
(fieldnames=None)¶ Return the analyzer that was mapped to a specific field.
See
set_field_analyzers()
for details.- Parameters
fieldnames (str | list of str | None) – the analyzer of this field (or all/multiple fields) is requested; leave empty (or None) to request all fields.
- Returns
The analyzer setting of the field - see CommonDatabase.ANALYZER_??? or a dict of field names and analyzers
- Return type
int | dict
-
get_query_result
(query)¶ return an object containing the results of a query
- Parameters
query (a query object of the real implementation) – a pre-compiled query
- Returns
an object that allows access to the results
- Return type
subclass of CommonEnquire
-
index_document
(data)¶ Add the given data to the database.
- Parameters
data (dict | list of str) – the data to be indexed. A dictionary will be treated as
fieldname:value
combinations. If the fieldname is None then the value will be interpreted as a plain term or as a list of plain terms. Lists of terms are indexed separately. Lists of strings are treated as plain terms.
-
make_query
(*args, **kwargs)¶ Create simple queries (strings or field searches) or combine multiple queries (AND/OR).
To specifiy rules for field searches, you may want to take a look at
set_field_analyzers()
. The parameter ‘match_text_partial’ can override the previously defined default setting.- Parameters
args (list of queries | single query | str | dict) –
queries or search string or description of field query examples:
[xapian.Query("foo"), xapian.Query("bar")] xapian.Query("foo") "bar" {"foo": "bar", "foobar": "foo"}
require_all (boolean) – boolean operator (True -> AND (default) / False -> OR)
analyzer (int) –
(only applicable for ‘dict’ or ‘str’) Define query options (partial matching, exact matching, tokenizing, …) as bitwise combinations of CommonIndexer.ANALYZER_???.
This can override previously defined field analyzer settings.
If analyzer is
None
(default), then the configured analyzer for the field is used.
- Returns
the combined query
- Return type
query type of the specific implementation
-
search
(query, fieldnames)¶ Return a list of the contents of specified fields for all matches of a query.
- Parameters
query (a query object of the real implementation) – the query to be issued
fieldnames (string | list of strings) – the name(s) of a field of the document content
- Returns
a list of dicts containing the specified field(s)
- Return type
list of dicts
-
set_field_analyzers
(field_analyzers)¶ Set the analyzers for different fields of the database documents.
All bitwise combinations of CommonIndexer.ANALYZER_??? are possible.
- Parameters
field_analyzers (dict containing field names and analyzers) – mapping of field names and analyzers
- Raises
TypeError – invalid values in field_analyzers
-
-
class
translate.search.indexing.PyLuceneIndexer.
PyLuceneHits
(enquire)¶ an enquire object contains the information about the result of a request
-
get_matches
(start, number)¶ return a specified number of qualified matches of a previous query
- Parameters
start (int) – index of the first match to return (starting from zero)
number (int) – the number of matching entries to return
- Returns
a set of matching entries and some statistics
- Return type
tuple of (returned number, available number, matches) “matches” is a dictionary of:
["rank", "percent", "document", "docid"]
-
get_matches_count
()¶ Return the estimated number of matches.
Use
translate.search.indexing.CommonIndexer.search()
to retrieve the exact number of matches- Returns
The estimated number of matches
- Return type
int
-
XapianIndexer¶
Interface to the Xapian indexing engine for the Translate Toolkit
Xapian v1.0 or higher is supported.
If you are interested in writing an interface for Xapian 0.x, then you should checkout the following:
svn export -r 7235 https://translate.svn.sourceforge.net/svnroot/translate/src/branches/translate-search-indexer-generic-merging/translate/search/indexer/
It is not completely working, but it should give you a good start.
-
class
translate.search.indexing.XapianIndexer.
XapianDatabase
(basedir, analyzer=None, create_allowed=True)¶ Interface to the Xapian indexer.
-
begin_transaction
()¶ Begin a transaction.
Xapian supports transactions to group multiple database modifications. This avoids intermediate flushing and therefore increases performance.
-
cancel_transaction
()¶ cancel an ongoing transaction
no changes since the last execution of ‘begin_transcation’ are written
-
commit_transaction
()¶ Submit the changes of an ongoing transaction.
All changes since the last execution of ‘begin_transaction’ are written.
-
delete_doc
(ident)¶ Delete the documents returned by a query.
- Parameters
ident (int | list of tuples | dict | list of dicts | query (e.g. xapian.Query) | list of queries) – [list of] document IDs | dict describing a query | query
-
delete_document_by_id
(docid)¶ Delete a specified document.
- Parameters
docid (int) – the document ID to be deleted
-
flush
(optimize=False)¶ force to write the current changes to disk immediately
- Parameters
optimize (bool) – ignored for xapian
-
get_field_analyzers
(fieldnames=None)¶ Return the analyzer that was mapped to a specific field.
See
set_field_analyzers()
for details.- Parameters
fieldnames (str | list of str | None) – the analyzer of this field (or all/multiple fields) is requested; leave empty (or None) to request all fields.
- Returns
The analyzer setting of the field - see CommonDatabase.ANALYZER_??? or a dict of field names and analyzers
- Return type
int | dict
-
get_query_result
(query)¶ Return an object containing the results of a query.
- Parameters
query (xapian.Query) – a pre-compiled xapian query
- Returns
an object that allows access to the results
- Return type
XapianIndexer.CommonEnquire
-
index_document
(data)¶ Add the given data to the database.
- Parameters
data (dict | list of str) – the data to be indexed. A dictionary will be treated as
fieldname:value
combinations. If the fieldname is None then the value will be interpreted as a plain term or as a list of plain terms. Lists of terms are indexed separately. Lists of strings are treated as plain terms.
-
make_query
(*args, **kwargs)¶ Create simple queries (strings or field searches) or combine multiple queries (AND/OR).
To specifiy rules for field searches, you may want to take a look at
set_field_analyzers()
. The parameter ‘match_text_partial’ can override the previously defined default setting.- Parameters
args (list of queries | single query | str | dict) –
queries or search string or description of field query examples:
[xapian.Query("foo"), xapian.Query("bar")] xapian.Query("foo") "bar" {"foo": "bar", "foobar": "foo"}
require_all (boolean) – boolean operator (True -> AND (default) / False -> OR)
analyzer (int) –
(only applicable for ‘dict’ or ‘str’) Define query options (partial matching, exact matching, tokenizing, …) as bitwise combinations of CommonIndexer.ANALYZER_???.
This can override previously defined field analyzer settings.
If analyzer is
None
(default), then the configured analyzer for the field is used.
- Returns
the combined query
- Return type
query type of the specific implementation
-
search
(query, fieldnames)¶ Return a list of the contents of specified fields for all matches of a query.
- Parameters
query (xapian.Query) – the query to be issued
fieldnames (string | list of strings) – the name(s) of a field of the document content
- Returns
a list of dicts containing the specified field(s)
- Return type
list of dicts
-
set_field_analyzers
(field_analyzers)¶ Set the analyzers for different fields of the database documents.
All bitwise combinations of CommonIndexer.ANALYZER_??? are possible.
- Parameters
field_analyzers (dict containing field names and analyzers) – mapping of field names and analyzers
- Raises
TypeError – invalid values in field_analyzers
-
-
class
translate.search.indexing.XapianIndexer.
XapianEnquire
(enquire)¶ interface to the xapian object for storing sets of matches
-
get_matches
(start, number)¶ Return a specified number of qualified matches of a previous query.
- Parameters
start (int) – index of the first match to return (starting from zero)
number (int) – the number of matching entries to return
- Returns
a set of matching entries and some statistics
- Return type
tuple of (returned number, available number, matches) “matches” is a dictionary of:
["rank", "percent", "document", "docid"]
-
get_matches_count
()¶ Return the estimated number of matches.
Use
translate.search.indexing.CommonIndexer.search()
to retrieve the exact number of matches- Returns
The estimated number of matches
- Return type
int
-
lshtein¶
A class to calculate a similarity based on the Levenshtein distance.
See http://en.wikipedia.org/wiki/Levenshtein_distance.
If available, the python-Levenshtein will be used which will provide better performance as it is implemented natively.
-
translate.search.lshtein.
distance
(a, b, stopvalue=0)¶ Same as python_distance in functionality. This uses the fast C version if we detected it earlier.
Note that this does not support arbitrary sequence types, but only string types.
-
translate.search.lshtein.
native_distance
(a, b, stopvalue=0)¶ Same as python_distance in functionality. This uses the fast C version if we detected it earlier.
Note that this does not support arbitrary sequence types, but only string types.
-
translate.search.lshtein.
python_distance
(a, b, stopvalue=-1)¶ Calculates the distance for use in similarity calculation. Python version.
match¶
Class to perform translation memory matching from a store of translation units.
-
class
translate.search.match.
matcher
(store, max_candidates=10, min_similarity=75, max_length=70, comparer=None, usefuzzy=False)¶ A class that will do matching and store configuration for the matching process.
-
buildunits
(candidates)¶ Builds a list of units conforming to base API, with the score in the comment.
-
extendtm
(units, store=None, sort=True)¶ Extends the memory with extra unit(s).
- Parameters
units – The units to add to the TM.
store – Optional store from where some metadata can be retrieved and associated with each unit.
sort – Optional parameter that can be set to False to supress sorting of the candidates list. This should probably only be used in
matcher.inittm()
.
-
getstartlength
(min_similarity, text)¶ Calculates the minimum length we are interested in. The extra fat is because we don’t use plain character distance only.
-
getstoplength
(min_similarity, text)¶ Calculates a length beyond which we are not interested. The extra fat is because we don’t use plain character distance only.
-
inittm
(stores, reverse=False)¶ Initialises the memory for later use. We use simple base units for speedup.
-
matches
(text)¶ Returns a list of possible matches for given source text.
- Parameters
text (String) – The text that will be search for in the translation memory
- Return type
list
- Returns
a list of units with the source and target strings from the translation memory. If
self.addpercentage
is True (default) the match quality is given as a percentage in the notes.
-
setparameters
(max_candidates=10, min_similarity=75, max_length=70)¶ Sets the parameters without reinitialising the tm. If a parameter is not specified, it is set to the default, not ignored
-
usable
(unit)¶ Returns whether this translation unit is usable for TM
-
-
translate.search.match.
sourcelen
(unit)¶ Returns the length of the source string.
-
class
translate.search.match.
terminologymatcher
(store, max_candidates=10, min_similarity=75, max_length=500, comparer=None)¶ A matcher with settings specifically for terminology matching.
-
buildunits
(candidates)¶ Builds a list of units conforming to base API, with the score in the comment.
-
extendtm
(units, store=None, sort=True)¶ Extends the memory with extra unit(s).
- Parameters
units – The units to add to the TM.
store – Optional store from where some metadata can be retrieved and associated with each unit.
sort – Optional parameter that can be set to False to supress sorting of the candidates list. This should probably only be used in
matcher.inittm()
.
-
getstartlength
(min_similarity, text)¶ Calculates the minimum length we are interested in. The extra fat is because we don’t use plain character distance only.
-
getstoplength
(min_similarity, text)¶ Calculates a length beyond which we are not interested. The extra fat is because we don’t use plain character distance only.
-
inittm
(store)¶ Normal initialisation, but convert all source strings to lower case
-
matches
(text)¶ Normal matching after converting text to lower case. Then replace with the original unit to retain comments, etc.
-
setparameters
(max_candidates=10, min_similarity=75, max_length=70)¶ Sets the parameters without reinitialising the tm. If a parameter is not specified, it is set to the default, not ignored
-
usable
(unit)¶ Returns whether this translation unit is usable for terminology.
-
-
translate.search.match.
unit2dict
(unit)¶ converts a pounit to a simple dict structure for use over the web
segment¶
Module to deal with different types and uses of segmentation
-
translate.search.segment.
character_iter
(text)¶ Returns an iterator over the characters in text.
-
translate.search.segment.
characters
(text)¶ Returns a list of characters in text.
-
translate.search.segment.
sentence_iter
(text)¶ Returns an iterator over the senteces in text.
-
translate.search.segment.
sentences
(text)¶ Returns a list of senteces in text.
-
translate.search.segment.
word_iter
(text)¶ Returns an iterator over the words in text.
-
translate.search.segment.
words
(text)¶ Returns a list of words in text.
terminology¶
A class that does terminology matching