Query Formulation and Expansion


Context

Query formulation is a process which converts the actual query that the user inputs into a query that is configured specifically for the search engine in use. Query formulation is a fundamental component of any search engine. Queries that are entered by users will most likely not be in the optimal form for the search engine to retrieve the best results. Therefore, as part of the search engine for the system, the queries entered by the user are formulated to capture the user’s intent more accurately. Queries are then better suited to the search engine.


Query expansion is a technique used to enhance the operations of information retrieval, which can be applied to different languages and objects. Query expansion is useful for this problem because the user is uncertain of his or her exact needs. If a query would normally return no results, query expansion tries to return results that are similar to the initial query. An expansion enhances the exploratory aspect of the search with an increase in recall (number of results returned). Although this does decrease the precision of the search, it is not a major issue in this case due to the exploratory nature of the search.


Background information relevant to query expansion and formulation can be found in the literature review, as well as in the background section of the final report.


Goals

The main goal of the system was to formulate the query to retrieve the best results possible. In order to do this, the formulation section required a number of techniques, which were: tokenization, spell checking, removing stop words, singularizing, and stemming. All of these techniques would help to retrieve better results from the database. We also wanted to remove city names from the query if a user entered a city name, and send the name to the database as a possible city that the user associates with the other terms the user entered. Thereafter, we wanted to expand the query so that there are more descriptive words for the query. This was to retrieve a bigger variety of results with queries.

The pipe and filter architecture shown above is the architecture which we used in the query processing part of the system. The plan was for the query to be sent from the source (UI), through the first pipe to a filter, which performs one of the formulation techniques, and then sends it through the next pipe for the next formulation technique to be performed. This is done until all the formulation and expansion techniques have taken place, and then the formulated query is sent to the database to retrieve results.

With regards to the expansion of the query, we decided to test two methods to find out which would provide better results. The two methods were: Local context analysis and Semantic query expansion. Local context analysis uses words from a local database of words to find expansion terms that are similar to the queried terms. Semantic query expansion returns all the synonyms of words that are linked to a certain term.

The final goal of the system was to perform efficiently, because users tend to get frustrated when they have to wait a long time for a search engine to return results.


Results

The resulting system performed all the formulation techniques that were mentioned previously. The testing of the two expansion techniques proved that the local context analysis method was more effective because it returned more accurate results. Due to the semantic query expansion not narrowing down results, the terms that were sent to the database with the original query were off topic, and thus, precision of results was low. The system was optimised to perform as efficiently as possible; the results of the tests are shown in the final report.

A sequence diagram of the resulting system is shown below. We used an iterative software development cycle, and a blue box, blue message, or blue text shows the second iteration of the system. Red text shows functions from the first iteration. Hence, the black and blue functions show the way the system currently operates. The red functions are no longer part of the system.



Contact me

Luqmaan Salie

Luqmaan.Salie@alumni dot uct dot ac dot za