ASM Planet Group

Let us introduce you a ASM Planet Group - it is group of thematic searching sites which based on the original engine of IOIX Ukraine. It is the first service oriented to the thematic context. Relatively soon we plan to open similar thematic services over few key areas and offer to do it and you. ASM indexes a fixed list of sites on thematic basis performing constant correction of index material, thus maintaining the system of current interest in time.

The main difference of this system from many other search systems is the indexing system with infinite refresh cycle and aiming of the search algorithm to quality relevancy according to the semantic contextual content of resources. This search system doesn't pay much attention to such standard relevancy calculation optimization parameters as citation rate (quotations), relations of resources references, searched words frequency values etc...

The main task resolved by the search algorithm is detecting resources that most precisely reflect the subject of user's query, best of all correspond to the query phrase, including distances between searched words of phrase, number of word chains and interval between them. The search engine treats user's query not as an intention to find resources, containing searched words entries, but as a criterion for detecting resources which are related with searched terms and subjects, including the synonyms and homonyms. Extended morphology engine based on the original dictionary subsystem and normalization subsystem works with the words paradigms and can detect morphological structure of words, using not only the single word, but context data as well. Morphological kernel operates with more than 240 thousands of words paradigms of Russian language and uses adaptive frequency algorithm for filling dictionary with paradigms in real time.

Morphological analysis algorithms based on original grammar kernel can normalize words, i.e. find their basic form. For unknown words hypothetical dictionary paradigm is constructed, allowing analyzing and generating new words as it can be done for known words.

At present stage the search engine supports many different forms of queries, including logical expressions and expressions with the distances between searched words in resources.. In addition, the set of filters enables manipulation with number of resources included to the search results. These filters concern such attributes as languages, text data sources, types and formats of documents, date of resource indexation, morphological precision, similar resources and resources from one site.

This search system release implements conceptions of infinite batch indexation, morphological linguistic kernel working upon words paradigms, index system giving optimized access to resources data according to the type of source, where the text fragment is taken from, as well as rank and contextual semantic characteristic of word; universal subsystem of data repository and extended query interface. Particularly here should be mentioned the concept of resources data division into so called significant and negligible textual components. Everyone has ever faced the problem of "white noise" while searching some common words both on global and thematic search systems. This kind of search gives a huge amount of results containing the searched words, which often placed very close to each other. But the main problem for users is that those words could be taken from the parts of site like menu, repeated banner fragments and other not nsignificant inclusions. Therefore users have to view many pages of results trying to find the proper words match and discarding some odd resources. The implemented subsystem of resources text division and its individual indexation is based on analysis of textual information source, document markup and syntax-oriented factors of sentence construction. The option of separate search by these components allows significantly narrow the criterion of semantic orientation.

Basic implementation of the system. The basic system implementation has a set of functions that could prove of interest to IT-managers and OEM-partners. The system represents a flexible platform for "black box" type solutions. All functions are embedded in the system user - infrastructure - data source which allows further supporting the extension of needs in search and processing of information.

In addition to all the above the system's user interface involves such options as receiving extended set of highlighted fragments containing the searched words from resources repository, resource's full text view in "read as a book" mode, lists of search history and resources selected from previous searches.

The system is capable of indexing documents in HTML, XML, RSS, PDF, DOC, PPT, XLS, TXT formats, as well as extracting the textual component from any binary contexts.

The set of algorithms detecting similar resources and copies of resources, and also optimizing the indexation process is been constantly improved.

Interface and scalability. The search system engine can be adapted for indexation of any data types, organization of fuzzy search, increasing the number of information source types. The engine can be easily used as for the purpose specified – i.e. for web-resources indexation, as for generation of library catalogs and references, news search systems and thematic informational portals. At the same time it retains its major qualities, – high refresh rate, guaranteed results relevancy and good search rate with ranges of 100 000 to tens millions of resources.

The primary concept of Horizontal scalability enables to extend the system via increasing the quantity of both repositories and search engines. The system is constructed as completely distributed and consists of a set of strictly customized applications.

The system's interface provides a possibility of topical repositories grouping for performing a through search. Moreover the system of design templates for displaying the search results can freely extend with customization both for every thematic group of sites and for each site registered in the system.

The developers' team continues enhancing all the components of the system. In the nearest future we are going to perform automatic resources categorization subsystem. In this subsystem based on division to significant and negligible parts there will be performed the interface grouping resources by context subject. Dynamic recalculation of this attribute will help users to divide search results to well-ordered groups, and also can be used for opening multitudes of contextual orientation of resources both for one site and for group of sites.

About developers' team. Our developers team is represented by small group of creative specialists in the area of fast data access systems and structures, search algorithms and system programming of distributed technologies and network solutions on TCP/IP base.