# HG changeset patch # User Paul Boddie # Date 1262907899 -3600 # Node ID 9d836f8a4075e76cf39e4995fdaabe66918cee2b # Parent 89465c390a4687b2fb173e5b0361c9e5652db8df Removed iterators and openers with the intention of having synchronised reading (such as that done by phrase queries) done by reading batches of positions, explicitly seeking for each batch, by employing a wrapper around readers for each term. diff -r 89465c390a46 -r 9d836f8a4075 iixr/files.py --- a/iixr/files.py Sat Oct 03 03:03:32 2009 +0200 +++ b/iixr/files.py Fri Jan 08 00:44:59 2010 +0100 @@ -148,17 +148,4 @@ return unicode(s, "utf-8") -class FileOpener: - - "Opening files using their filenames." - - def __init__(self, filename): - self.filename = filename - - def open(self, mode): - return open(self.filename, mode) - - def close(self): - pass - # vim: tabstop=4 expandtab shiftwidth=4 diff -r 89465c390a46 -r 9d836f8a4075 iixr/filesystem.py --- a/iixr/filesystem.py Sat Oct 03 03:03:32 2009 +0200 +++ b/iixr/filesystem.py Fri Jan 08 00:44:59 2010 +0100 @@ -85,12 +85,15 @@ tdif = open(join(pathname, "terms_index-%s" % partition), "rb") index_reader = TermIndexReader(tdif) - positions_opener = PositionOpener(join(pathname, "positions-%s" % partition)) - positions_index_opener = PositionIndexOpener(join(pathname, "positions_index-%s" % partition)) + pf = open(join(pathname, "positions-%s" % partition), "rb") + position_reader = PositionReader(pf) - positions_dict_reader = PositionDictionaryReader(positions_opener, positions_index_opener) + pif = open(join(pathname, "positions_index-%s" % partition), "rb") + position_index_reader = PositionIndexReader(pif) - return TermDictionaryReader(info_reader, index_reader, positions_dict_reader) + position_dict_reader = PositionDictionaryReader(position_reader, position_index_reader) + + return TermDictionaryReader(info_reader, index_reader, position_dict_reader) def get_field_reader(pathname, partition): diff -r 89465c390a46 -r 9d836f8a4075 iixr/positions.py --- a/iixr/positions.py Sat Oct 03 03:03:32 2009 +0200 +++ b/iixr/positions.py Fri Jan 08 00:44:59 2010 +0100 @@ -61,21 +61,6 @@ self.last_docnum = docnum -class PositionOpener(FileOpener): - - "Reading position information from files." - - def read_term_positions(self, offset, count): - - """ - Read all positions from 'offset', seeking to that position in the file - before reading. The number of documents available for reading is limited - to 'count'. - """ - - f = self.open("rb") - return PositionIterator(f, offset, count) - class PositionIndexWriter(FileWriter): "Writing position index information to files." @@ -107,21 +92,6 @@ self.last_pos_offset = pos_offset self.last_docnum = docnum -class PositionIndexOpener(FileOpener): - - "Reading position index information from files." - - def read_term_positions(self, offset, doc_frequency): - - """ - Read all positions from 'offset', seeking to that position in the file - before reading. The number of documents available for reading is limited - to 'doc_frequency'. - """ - - f = self.open("rb") - return PositionIndexIterator(f, offset, doc_frequency) - # Iterators for position-related files. class IteratorBase: @@ -142,18 +112,29 @@ def __iter__(self): return self -class PositionIterator(FileReader, IteratorBase): +class PositionReader(FileReader, IteratorBase): "Iterating over document positions." - def __init__(self, f, offset, count): + def __init__(self, f): FileReader.__init__(self, f) - IteratorBase.__init__(self, count) - self.f.seek(offset) + IteratorBase.__init__(self, 0) # no iteration initially permitted + self.reset() def reset(self): self.last_docnum = 0 + def seek(self, offset, count): + + """ + Seek to 'offset' in the file, limiting the number of documents available + for reading to 'count'. + """ + + self.f.seek(offset) + self.replenish(count) + self.reset() + def read_positions(self): "Read positions, returning a document number and a list of positions." @@ -190,20 +171,31 @@ else: raise StopIteration -class PositionIndexIterator(FileReader, IteratorBase): +class PositionIndexReader(FileReader, IteratorBase): "Iterating over document positions." - def __init__(self, f, offset, count): + def __init__(self, f): FileReader.__init__(self, f) - IteratorBase.__init__(self, count) - self.f.seek(offset) + IteratorBase.__init__(self, 0) # no iteration initially permitted + self.reset() def reset(self): self.last_docnum = 0 self.last_pos_offset = 0 self.section_count = 0 + def seek(self, offset, doc_frequency): + + """ + Seek to 'offset' in the file, limiting the number of documents available + for reading to 'doc_frequency'. + """ + + self.f.seek(offset) + self.replenish(doc_frequency) + self.reset() + def read_positions(self): """ @@ -319,65 +311,37 @@ class PositionDictionaryReader: - "Reading position dictionaries." - - def __init__(self, position_opener, position_index_opener): - self.position_opener = position_opener - self.position_index_opener = position_index_opener - - def read_term_positions(self, offset, doc_frequency): - - """ - Return an iterator for dictionary entries starting at 'offset' with the - given 'doc_frequency'. - """ - - return PositionDictionaryIterator(self.position_opener, - self.position_index_opener, offset, doc_frequency) - - def close(self): - pass - -class PositionDictionaryIterator: - "Iteration over position dictionary entries." - def __init__(self, position_opener, position_index_opener, offset, doc_frequency): - self.position_opener = position_opener - self.position_index_opener = position_index_opener - self.doc_frequency = doc_frequency + def __init__(self, position_reader, position_index_reader): + self.position_reader = position_reader + self.position_index_reader = position_index_reader + self.reset() - self.index_iterator = None - self.iterator = None - - # Initialise the iterators. - - self.reset(offset, doc_frequency) - - def reset(self, offset, doc_frequency): + def reset(self): # Remember the last values. self.found_docnum, self.found_positions = None, None - # Attempt to reuse the index iterator. - - if self.index_iterator is not None: - ii = self.index_iterator - ii.replenish(doc_frequency) - ii.f.seek(offset) - ii.reset() - - # Or make a new index iterator. - - else: - self.index_iterator = self.position_index_opener.read_term_positions(offset, doc_frequency) - # Maintain state for the next index entry, if read. self.next_docnum, self.next_pos_offset, self.next_section_count = None, None, None - # Initialise the current index entry and current position file iterator. + def seek(self, offset, doc_frequency): + + """ + Seek to 'offset' in the index file, limiting the number of documents + available for reading to 'doc_frequency'. + """ + + self.reset() + + # Seek to the appropriate index entry. + + self.position_index_reader.seek(offset, doc_frequency) + + # Initialise the current index entry and current position file reader. self._next_section() self._init_section() @@ -385,7 +349,7 @@ # Sequence methods. def __len__(self): - return self.doc_frequency + return len(self.position_index_reader) def sort(self): pass @@ -416,23 +380,23 @@ # Either return the next record. try: - return self.iterator.next() + return self.position_reader.next() # Or, where a section is finished, get the next section and try again. except StopIteration: - # Where a section follows, update the index iterator, but keep - # reading using the same file iterator (since the data should - # just follow on from the last section). + # Where a section follows, update the index reader, but keep + # reading using the same file reader (since the data should just + # follow on from the last section). self._next_section() - self.iterator.replenish(self.section_count) + self.position_reader.replenish(self.section_count) - # Reset the state of the iterator to make sure that document + # Reset the state of the reader to make sure that document # numbers are correct. - self.iterator.reset() + self.position_reader.reset() def from_document(self, docnum): @@ -451,7 +415,7 @@ try: if self.next_docnum is None: - self.next_docnum, self.next_pos_offset, self.next_section_count = self.index_iterator.next() + self.next_docnum, self.next_pos_offset, self.next_section_count = self.position_index_reader.next() # Read until the next entry is after the desired document number, # or until the end of the results. @@ -459,7 +423,7 @@ while self.next_docnum <= docnum: self._next_read_section() if self.docnum < docnum: - self.next_docnum, self.next_pos_offset, self.next_section_count = self.index_iterator.next() + self.next_docnum, self.next_pos_offset, self.next_section_count = self.position_index_reader.next() else: break @@ -472,7 +436,7 @@ try: while 1: - found_docnum, found_positions = self.iterator.next() + found_docnum, found_positions = self.position_reader.next() # Return the desired document positions or None (retaining the # positions for the document immediately after). @@ -493,7 +457,7 @@ "Attempt to get the next section in the index." if self.next_docnum is None: - self.docnum, self.pos_offset, self.section_count = self.index_iterator.next() + self.docnum, self.pos_offset, self.section_count = self.position_index_reader.next() else: self._next_read_section() @@ -509,43 +473,14 @@ def _init_section(self): - "Initialise the iterator for the section in the position file." - - # Attempt to reuse any correctly positioned iterator. + "Initialise the reader for the section in the position file." - if self.iterator is not None: - i = self.iterator - i.replenish(self.section_count) - i.f.seek(self.pos_offset) - i.reset() + # Seek to the position entry. - # Otherwise, obtain a new iterator. - - else: - self.iterator = self.position_opener.read_term_positions(self.pos_offset, self.section_count) + self.position_reader.seek(self.pos_offset, self.section_count) def close(self): - if self.iterator is not None: - self.iterator.close() - self.iterator = None - if self.index_iterator is not None: - self.index_iterator.close() - self.index_iterator = None - -class ResetPositionDictionaryIterator: - - """ - A helper class which permits the reuse of iterators without modifying their - state. - """ - - def __init__(self, iterator, offset, doc_frequency): - self.iterator = iterator - self.offset = offset - self.doc_frequency = doc_frequency - - def __iter__(self): - self.iterator.reset(self.offset, self.doc_frequency) - return iter(self.iterator) + self.position_reader.close() + self.position_index_reader.close() # vim: tabstop=4 expandtab shiftwidth=4 diff -r 89465c390a46 -r 9d836f8a4075 iixr/terms.py --- a/iixr/terms.py Sat Oct 03 03:03:32 2009 +0200 +++ b/iixr/terms.py Fri Jan 08 00:44:59 2010 +0100 @@ -208,7 +208,6 @@ self.info_reader = info_reader self.index_reader = index_reader self.position_dict_reader = position_dict_reader - self.position_dict_iterator = None # for sequential/iterator access self.terms = [] try: @@ -302,7 +301,8 @@ documents equal to the given 'doc_frequency'. """ - return self.position_dict_reader.read_term_positions(offset, doc_frequency) + self.position_dict_reader.seek(offset, doc_frequency) + return self.position_dict_reader # Iterator convenience methods. @@ -330,12 +330,8 @@ term, offset, frequency, doc_frequency = self.info_reader.read_term() - # For sequential access, attempt to reuse any iterator. - - if self.position_dict_iterator is None: - self.position_dict_iterator = self._get_positions(offset, doc_frequency) - - return term, frequency, doc_frequency, ResetPositionDictionaryIterator(self.position_dict_iterator, offset, doc_frequency) + self.position_dict_reader.seek(offset, doc_frequency) + return term, frequency, doc_frequency, self.position_dict_reader # Query methods. @@ -412,8 +408,5 @@ self.info_reader.close() self.index_reader.close() self.position_dict_reader.close() - if self.position_dict_iterator is not None: - self.position_dict_iterator.close() - self.position_dict_iterator = None # vim: tabstop=4 expandtab shiftwidth=4 diff -r 89465c390a46 -r 9d836f8a4075 test.py --- a/test.py Sat Oct 03 03:03:32 2009 +0200 +++ b/test.py Fri Jan 08 00:44:59 2010 +0100 @@ -68,7 +68,7 @@ w.close() f = open("testP", "rb") -r = PositionIterator(f, 0, None) +r = PositionReader(f) for doc_positions in all_doc_positions: for docnum, positions in doc_positions: d, p = r.read_positions() @@ -105,12 +105,12 @@ offsets.append((offset, doc_frequency)) w.close() -r = PositionIndexOpener("testPI") +r = PositionIndexReader(open("testPI", "rb")) offsets.reverse() indexed_positions.reverse() for (offset, doc_frequency), term_positions in zip(offsets, indexed_positions): - found_positions = r.read_term_positions(offset, doc_frequency) - for (docnum, pos_offset, count), (dn, po, c) in zip(term_positions, found_positions): + r.seek(offset, doc_frequency) + for (docnum, pos_offset, count), (dn, po, c) in zip(term_positions, r): print docnum == dn, docnum, dn print pos_offset == po, pos_offset, po print count == c, count, c @@ -129,13 +129,14 @@ offsets.append((offset, doc_frequency)) wd.close() -r = PositionOpener("testP") -r2 = PositionIndexOpener("testPI") +r = PositionReader(open("testP", "rb")) +r2 = PositionIndexReader(open("testPI", "rb")) rd = PositionDictionaryReader(r, r2) offsets.reverse() all_doc_positions.reverse() for (offset, doc_frequency), doc_positions in zip(offsets, all_doc_positions): - dp = list(rd.read_term_positions(offset, doc_frequency)) + rd.seek(offset, doc_frequency) + dp = list(rd) print doc_positions == dp, doc_positions, dp rd.close() @@ -298,8 +299,8 @@ r = TermReader(f) f2 = open("testI", "rb") r2 = TermIndexReader(f2) -r3 = PositionOpener("testP") -r4 = PositionIndexOpener("testPI") +r3 = PositionReader(open("testP", "rb")) +r4 = PositionIndexReader(open("testPI", "rb")) rp = PositionDictionaryReader(r3, r4) rd = TermDictionaryReader(r, r2, rp) terms_reversed = terms[:] @@ -360,8 +361,8 @@ r = TermReader(f) f2 = open("testI", "rb") r2 = TermIndexReader(f2) -r3 = PositionOpener("testP") -r4 = PositionIndexOpener("testPI") +r3 = PositionReader(open("testP", "rb")) +r4 = PositionIndexReader(open("testPI", "rb")) rp = PositionDictionaryReader(r3, r4) rd = TermDictionaryReader(r, r2, rp) terms_reversed = terms_with_positions[:]