wiki:Search

Kaukolu Search and Document Summarization

This document describes Kaukolu's advanced search and document summarization components for both users and developers.

User Documentation

In this section we describe how the end user can leverage Kaukolu's paragraph retrieval and document summarization features to carry out sophisticated information management tasks. The following text is split in two parts: The first part describes the search interface which allows you to retrieve Wiki pages on paragraph-level granularity. We then turn our attention to the document summarization interface, which allows to aggregate search results into new wiki pages.

Paragraph Retrieval

In order to perform paragraph retrieval on the Wiki page repository, you will use Kaukolu's new search interface. You can reach the search interface by following the "Advanced search" link in Kaukolu's main menu. You should see a page similar to this:

You can see a three-paned view ("Active Constraints", "Facets" and "Restriction Values") and two tabs which allow switching between the search and summarization interfaces. Forget the second tab for now, we will discuss it later. The three panels are used to construct your search query. Our search interface follows a facet-oriented approach, where for some resource, you have to select a facet first and are then presented a list of possible restriction values for that facet to choose from. When we talk about resources, we mean either an RDFS resource, or a filter (we will talk about filters in a second). A facet is then some distinct, descriptive characteristic of this resource, which can be used to constrain it. You constrain a resource by applying a restriction value to a facet. For example, you could apply the restriction value "MyWikiPage" to the "Title" facet of a wiki page resource.

Basic Query Construction

In order to pose a query, you start off by creating what we call a filter. A filter is the basic entity that describes by what kind of resource you want to constrain your search. For Kaukolu, we ship two default filters: A page filter and an annotation filter. The filter paradigm is a very flexible one and can easily be extended. Refer to the developer documentation to learn how to create your own filters for Kaukolu search.

To select a filter, click the "Filter" drop-down button at the bottom of the first panel. A menu will pop-up showing all available filters that are registered with the system. For now, select the "Page Filter" entry. The interface will reflect this action by adding a "Page Filter" element to the "Active Constraints" panel and by updating the "Facets" panel with a list of available facets for this filter. Select the "Title" facet, by which you indicate that you want to constrain your search by only considering wiki pages having a specific title. The "Restriction Values" panel will update and present a list of available restriction values for the selected facet. Your view should now look something like this:

Note that we group possible restriction values alpha-numerically. This not only leads to a more consise list of values, it also leads to better performance, because only those values of the group you are expanding will be retrieved from the server. On a second note, you should be aware that the facet resolution for a selected resource happens dynamically. In particular, facets are not hard-coded (except for filters) and even then, facets for which no restriction values exist will be removed from the user interface. This reduces the likelihood of running into dead ends by selecting facets that don't help to yield the desired results. In other words, it is entirely possible that e.g. for the annotation filter, no facets to choose from will be displayed at all, simply because there aren't any annotations (and therefore restriction values) to choose from.

Now, select the desired page title (in this case, "OneMinuteWiki") and hit "Apply". You will notice that the formerly empty "Search Results" box now displays a list of paragraphs that match the query.

You see that search results are grouped by Wiki pages that qualified for the query. However, what really happens here is that single paragraphs qualified for the search result which are then grouped according to the Wiki page they belong to. Clicking on a result box opens the corresponding page in a new window and scrolls to the paragraph. In this example, all paragraphs that belong to a Wiki page called "OneMinuteWiki" have been retrieved. So, although it may occur to you that the retrieval is on a per-page basis (similar to using JSPWiki's built-in full text search), it is truly a paragraph-oriented retrieval. We will see another example in a minute, where not all paragraphs of a Wiki page qualify for the search query. You may have also noticed the checkboxes in the upper-right corner of a result paragraph. These are used for document summarization, which we will discuss later, so ignore them for now.

Of course you can choose to apply as many facets/restriction values as you like, in order to further constrain your query. We will demonstrate a more complicated example in the following section. Just remember that the addition of a new facet/restriction value constraint will pose an additional constraint on the already active query. You can therefore imagine a filter as being a series of AND-evaluated constraints (that means, partial results are combined using set intersection. We also allow other set operators, more on that in a second). Please note that currently we have no way for preventing the user from posing queries that lead to an empty result set (other faceted interfaces usually prevent this sort of dead-end queries). For example, choosing two different restriction values for the "Title" will ultimately lead to an empty result set, as there are never wiki pages with more than one title, obviously.

Advanced Query Construction

In the last few paragraphs, we went through the basics of query construction. We will now focus on some more sophisticated features of our search interface, such as operators, wildcards and full-text search.

  • Operators
    Unlike many other faceted interfaces, our search interface supports the use of set operators. Currently, the only supported operator is the OR-operator, which performs a set union. In order to create a constraint using the OR-operator, click the "Operator" drop-down menu at the bottom of the "Active Constraints" panel. A menu will pop up showing all available operators. Select the OR operator. A new operator instance will be inserted under the currently active resource (e.g. a filter) and will automatically be selected. An operator always inherits the properties of its parent resource, so for an OR-operator inserted under a "Page Filter" node, all facets available for a page filter will be shown. You can now aggregate as many constraints under the operator as you wish, by following the facet/restriction value selection procedure outlined in the previous paragraphs.
  • Wildcards
    Often you do not want to directly constrain a resource by some discrete restriction value, but you rather want to constrain that resource by some characteristic of a restriction value itself. An example would be not to search for annotations having a certain type, but for annotations that have any type which in turn has a property with the given restriction value. These (arbitrary) complex constraints are also called path expressions and are realized in our search interface using wildcards. Not all facets support wildcards as restriction values: Only for restriction values of type rdfs:Resource or rdfs:Class wildcards can be inserted. To insert a wildcard, select a facet with these types in its range first, then instead of selecting a distinct restriction value, click the "Wildcard" button at the bottom of the "Restriction Values" panel. A wildcard node will then be inserted under the active resource, and the facet panel will update by showing a list of facets that apply to the type of the selected wildcard. You can now constrain this resource using a discrete value, or continue applying wildcards until you have reached the desired constraint "depth".
  • Full-text search
    While not really being an advanced feature, we'd like to mention here that it is also possible to perform a full-text search on a selected resource, if such a facet is available for that resource. For example, the "Page Filter" supports a full-text search facet in order to perform page content search. Full-text search facets only have one "restriction value", namely the search expression that will be used for evaluation. When selecting this kind of facet, the restriction value panel will update and show a single "dummy value", which you have to click. A dialog will open, asking you to enter a search expression (you can enter a Lucene search query here). Clicking OK will bring you back to the search interface, and the single restriction value will have its label changed to whatever expression you provided. You can then apply this "value" like any other restriction value. (Note that unlike all other constraints, a full-text search never operates on paragraph level. So, although search results will still be displayed paragraph-wise, even when the search terms only occur in a single paragraph, all paragraphs from the containing page will be returned. This is because we are using the native JSPWiki search backend, which only operates on complete pages.)

Finally, consider the following, more complex szenario, which makes use of wildcard expressions: We want to find all paragraphs of the Wiki page "About", which are annotated using "Rating" annotations with a rating value of "10" and which have been created on Nov. 15th 2007 by user "Anonymous":

This concludes our introduction to the search interface. We will now turn to the document summarization component and how you can use it to dynamically create wiki pages from your search results.

Document Summarization

This section explains how you can use Kaukolu's new document summarization interface to dynamically create Wiki pages from a set of paragraphs in your search results.

We assume that you have already performed a query which yielded at least one paragraph in its result set, otherwise you won't be able to construct a new document. First, select the second of the two tabs of the search interface, the one that is labeled "Create New Document". The view will change from the three-paned layout to an empty scaffolding area with a series of buttons at the bottom. Ignore these for now.

From your search results, select those paragraphs that are of interest to you (those paragraphs you want to appear in your new document) by clicking the checkbox in the upper right corner of a paragraph. You may select as many paragraphs as you wish. Please note that for efficiency reasons, we only show the first five matching paragraphs directly for a single page. If more than five paragraphs qualified, you will have to click the link labeled "Show X more paragraphs..." where X is the number indicating how many more pargraphs qualified and have been held back. Moreover, if a lot of paragraphs from a lot of different pages qualified, you can navigate between single page results using the "Next", "Previous" and "Top" links in the upper right corner of a search result box. The following image depicts the construction panel after two paragraphs have been selected from the search results.

When you are done selecting the desired paragraphs, you will notice that these paragraphs have appeared in the document construction pane at the top of the page. The paragraphs will appear in the order you selected them from the search results. This ordering also implies the paragraph order when constructing a new page from your selection. If you want to change this order, e.g. move the third paragraph way down to the bottom (maybe because you think it is not as important as others), select the paragraph first by clicking the checkbox in the upper right corner. The paragraph will appear highlighted indicating that is ready for an action to be performed upon it. You can then use the "Move Up" and "Move Down" buttons to move the paragraph around. Note that this is also possible while having more than one paragraph selected at once.

Sometimes it may occur to you that a paragraph you selected from the search results (given you did not also select its direct neighbors as well) seems out-of-context because some sentences preceding or following that paragraph didn't qualify for the query (therefore can't be selected) which are vital for a reader to understand whatever the paragraph is about. For this situation, we support expansion and shrinking of single paragraphs to also include whatever appears before or after that paragraph on the original wiki page the paragraph had been taken from. To expand a paragraph, select it and click "Expand". To shrink it, click "Shrink". Note that paragraphs can obviously not be expanded/shrunk beyond certain boundaries. The upper boundary for expansion is the page content itself, the lower boundary for shrinking is the paragraph as it had been when selected from the search results. The following two pictures show a paragraph before and after a double expansion.

Before expansion:

After two expansions:

The construction panel also has buttons to select all paragraphs, clear all selections or remove the selected paragraphs from the construction panel. The latter will not remove anything from the search results, so you can re-add formerly removed paragraphs at any time.

If you think the paragraphs have the correct size/content and order, you can enter a name for the new page and finally click "Create" in order to instruct the server to create the new wiki page for you. You are then asked whether or not you want to visit the page or keep working in the construction area. Note that leaving the website will clear all previous queries, search results and paragraph selections, as this information is not (yet) stored in the session.

For each paragraph that is included in the newly constructed wiki page, a provenance annotation will be created which allows to keep track where each paragraph originally came from (Web URL or internal Wiki URI). These are just normal Kaukolu annotations, they appear on the wiki page like any other annotations. You can click them in order to highlight the paragraph they are attached to.

This concludes our user guide. If you are interested in the technical aspects and want to extend Kaukolu's search functionality, we suggest you also read the next section, which covers developing filters among other things.


Developer Documentation

This section will walk you through the architectural aspects of the search and summarization components and give you a jump start for developing custom search filters.

Technology

Before diving into the source code, I'd like to give a quick overview of the technology we use. Because all this is part of Kaukolu, which itself is based on the JSPWiki software, all server-side classes are implemented in Java. For the web interface we use a JSP as the container and client-side JavaScript for GUI logic and server communication. As to the JavaScript GUI, we make heavy use of the Dojo Toolkit in its 0.4.3 incarnation. Data transport between client and server is realized using asynchronous JavaScript and XML (AJAX). In layer terms, here is what we use on each layer:

Web layer (client): Plain JavaScript plus Dojo 0.4.3 for user interface functionality and server communication

Web layer (server): Java servlets, communicating with the client (Web browser) and the application layer objects

Application layer (server): POJOs implementing search and summarization logic

Data layer (server): Sesame 2.0 and custom POJOs (we have a DAO for storing/retrieving annotations)

Web Layer

The user interface is a central part of both the search and summarization components. As said above, we're making heavy use of the Dojo Toolkit to realize the GUI logic and server connectivity. All JavaScript objects we use are defined in a custom namespace in order to avoid potential name clashes. This means, we only define one single object that is a direct member of the host environment's global object and which is used to further split up our "packages". The namespace hierarchy pretty much follows the way we split up our code across the source files. Here are the namespaces and their responsibilities:

  • facetedsearch
    Can be considered the "entry point" script. Here we define some "globals" (i.e., JavaScript objects global to our own namespace, not global to the website), a couple of prototype objects ("classes"), and the initialization procedure that kicks off event handling, retrieves the initial filter list and so forth.
  • facetedsearch.event
    Contains all functionality related to handling events that are fired in the user interface, e.g. button-click handlers.
  • facetedsearch.server
    Contains all functionality involving communication with the server-side objects, such as retrieving facets, restriction values or dispatching search queries for evaluation.
  • facetedsearch.utility
    Contains utility functions for various purposes.
  • facetedsearch.constants (defined in AdvanedSearchHeader JSP)
    Definition of JavaScript "constants" (for the purists, yes, there are no real constants in JavaScript) that are used throughout the scripts. These constants have to be defined in a JSP, because we rely on JSP and JSTL functionality to initialize at least some of them (server URLs, for example). This JSP also defines the global namespace object called 'facetedsearch' and its alias object 'fsearch' (for all those lazy types).

The HTML code for layout and widgets is defined in the AdvancedSearchContent JSP.

On the server side, we have a couple of Java Servlets that communicate with our scripts. These servlets take HTML requests incoming from the Web browser, usually do some form of (de)serialization (we make heavy use of JSON) and talk to the manager object in the application layer to resolve the requests.

Application Layer

The application layer classes are all POJOs or JavaBeans and implement the actual logic like query evaluation and document summarization. All Java packages are sub-packages of the de.opendfki.kaukoluwiki meta-package. The search functionality is split up into several packages, from which the following are of particular importance:

  • facetedsearch.managers
    Contains the manager classes which are invoked by the servlets. They trigger the application logic like evaluation.
  • facetedsearch.providers
    Contains all provider classes for each kind of filter, like AnnotationFilterFacetProvider, PageFilterFacetProvider, AnnotationFilterRestrictionValueProvider, PageFilterRestrictionValueProvider and so on. Provider classes serve resources such as facets and restriction values. They interact with the data layer objects to retrieve resources from the store and map them to our search paradigm and are used by the manager classes to answer servlet requests.
  • facetedsearch.evaluators
    Contains the filter evaluator classes. These classes are responsible for evaluating a filter QueryElement to a final filter result set which is used in query resolution.
  • facetedsearch.model
    Contains all JavaBean classes used to carry data required for representing and evaluating search queries. Examples are Query, Filter, Operator, Constraint, and so on. It is interesting to note that these objects correspond 1:1 to objects used in the client-side JavaScript code. This is because we use Apache Commons DynaBeans, the EzMorph bean morphing facilities and JSON-lib to morph JavaBean objects from and to their JSON counterparts. This is very convenient, because even when coding the JavaScript side, you can basically refer to the JavaDoc of the server objects to see how they are structured. Moreover, this eliminates potential errors that could occur when performing this mapping manually, especially when the server-side classes change.

As for document summarization, the implementation is very lightweight and there is only one package with a hand full of classes (see trunk/kaukolu/src/de/opendfki/kaukoluwiki/dynamicdocs), most notably maybe the ParagraphExtractor which is responsible for splitting up wiki documents into lists of paragraphs that carry unique identifiers.

You most probably noticed a pattern in how we split up code, one that is very common for Web applications: First we have the servlets, that do the Web-ish stuff like handling requests and (de)serializing data from and to the JavaScript Object Notation (JSON). To actually answer a request, a servlet first invokes the respective functionality on some matching manager object. For example, to answer a query, the QueryServlet deserializes a query object encoded in JSON, morphs this JSONObject into a corresponding Query bean and passes this bean to QueryManager for evaluation. QueryManager then interacts with other application and data layer classes to resolve the result set which is then passed back (JSON encoded of course) by the QueryServlet to the Web browser.

For more details on each of the above mentioned classes, please refer to the corresponding JavaDoc.

Filter Development

Filters are what gives the search interface its expressiveness. For this reason, the search infrastructure has been designed to be extensible, so that more filters can be implemented in order to add more possibilities for posing queries using the faceted search interface. This section first describes what steps you (in general) have to perform when developing a new filter. We then walk through an example where we develop a new filter for Kaukolu that allows searching by user context.

Here are the steps you have to perform when implementing a new filter (not necessarily in this order):

  1. Write a FacetProvider implementation

    You will have to implement the FacetProvider interface (or subclass AbstractFacetProvider) so that the system e.g. knows what facets are initially available when the user selects your filter in the user interface. How you actually implement this is entirely up to you. You could hard code a list of facets in your provider, probe a Sesame 2 repository, read from a text file and so on. The only requirement is that for each facet in the list of facets returned by getFacets() at least one valid restriction value must exist. Facets are served by providing the caller with a list of FacetDescriptors. A facet has a domain (the resource for which it is defined), a range (the type its restriction values must have), a unique ID (can again be a custom ID or a URI if the facet corresponds to an RDF property) and a label used to display it.

    You may have noticed that the getFacets() method takes a resource ID as the single argument. This identifier tells you for which resource facets have to be provided. This identifer always corresponds to whatever resource you register your provider for at the FacetManager (see step 4). For filters, this is a filter ID, for other resources it's usually an RDF type URI (there aren't really any other things for which it makes sense to retrieve facets). When implementing a filter facet provider, you can simply ignore this; in other cases, this information is vital to process the request (see next paragraph).

    Please note that you don't have to write a facet provider for every possible resource that could potentially appear in a query (for example, rdfs:Class instances). If the FacetManager cannot find a provider for an incoming resource, it automatically delegates to a DefaultFacetProvider which knows how to resolve facets for RDF types.

  2. Write a RestrictionValueProvider implementation

    Implementations of RestrictionValueProvider must be able to resolve requests for available restriction values for some given facet (described by a FacetDescriptor, see above). How these values are retrieved is, again, entirely up to you. Again, you can alternatively also subclass AbstractRestrictionValueProvider in order to benefit from some sane default behavior. A RestrictionValue consists of the actual value, the RDF type of that value, a label and a flag indicating whether this RV is a "long literal". Long literals can be handled separately by displaying a single search node instead of a list of all possible values.

    You have to provide two methods when serving restriction values: getTopLevelValues() and getChildValues(). This distinction is necessary because restriction values can form a hierarchy (e.g. for rdf:type facets). If you know you need no special behavior for either method, you can simply delegate to DefaultRestrictionValueProvider. As with a facet provider, you have to register your restriction value with the RestrictionValueManager for (at least) your new filter, so the system knows how to resolve restriction values for that filter.

  3. Write a AbstractFilterEvaluator implementation

    This is probably the most interesting part when implementing a filter. A filter evaluator must be able to take a Filter bean and resolve it to a set of strings. What these strings describe is of no concern for the evaluation logic; they may be wiki page URIs (as is the case for the page filter evaluator) or annotation URIs (as is the case for the annotation filter evaluator) or anything else that can somehow be resolved to a set of matching Paragraphs of a wiki page. This is important, because the last step in query evaluation is to combine the results of each active filter in the query to a set of paragraph beans qualifying for this query (see Filter.evaluate()).

    AbstractFilterEvaluator is an abstract base class and already defines the algorithm (using the template method pattern) for evaluating a filter, you only have to implement the sub-steps. Actually, in many occasions you don't even have to do that, because there's only one abstract method in AbstractFilterEvaluator every filter has to implement on its own (resolveParagraphs() -- this method is always filter-specific and converts the filter result set to a set of matching page paragraphs). Whereever you need custom functionality, you must override AbstractFilterEvaluator's methods (evaluateDiscreteConstraint(), evaluateComplexConstraint() and evaluateOrOperator()). For more information about the evaluation algorithm, refer to the JavaDoc of AbstractFilterEvaluator. Once again, how you evaluate a filter in the end is up to you.

  4. Register your implementations with the system

    In order for the system to know about the new filter, you will have to register each implementation you coded in the previous steps for your new filter. You always register an implementation with a matching manager class, i.e. you have to register your facet provider with the FacetManager, your restriction value provider with the RestrictionValueManager and your evaluator with the FilterManger. You do so by supplying a filter ID which uniquely identifies your filter as a resource that can be resolved. Filter IDs can be arbitrary in form, but we suggest you use something like de.mydomain.myfilters.MyCoolFilter. Just make sure it's unique. Your new filter will appear automatically in the user interface for selection. Every request made for that filter will be associated to it via its filter ID.

    At this point you may have noticed that the ID parameter of the managers' registry methods is named 'resourceId', not 'filterId' (except for the FilterManager, which only allows to register evaluators for filters). This is intentional, because although we only talked filters in this guide, we actually allow for registering providers for any RDF resource. For example, if you DefaultFacetProvider does not match your idea of how facets should be resolved for RDF types, you do not need to alter its code. Instead, you can write an entirely new provider class that does what you think is right and register it for each RDF type resource with the FacetManager. Whenever a request for facets is then reaching the server for that particular resource, it will use your facet provider instead of the default provider. The same holds for restriction value providers.

Design Thoughts

  • Operators should be applicable to filters as well, not only constraints
  • Right now, there is no mechanism that checks pro-actively whether certain combinations of constraints would lead to an empty result set (other faceted interfaces do that and remove these facets from the interface). This is particularly hard to achieve for us, because of the flexibility in the nature of the constraints we allow (operators, full-text search nodes etc.). Would be nice to have, but probably hard (if at all possible) to implement.
  • Full-text search nodes do not follow the paragraph-oriented retrieval approach of other constraints. This may seem irritating for the user. This is because JSPWiki search uses Lucene to index pages and searches on pages, not paragraphs.
  • Usability enhancement: We should allow range queries for numerical values (especially dates)
  • Usability enhancement: Tree nodes should be draggable (I really tried hard on this one, but Dojo drag and drop is terrible).
  • Usability enhancement: When applying a constraint that yields an empty search result, you currently have to remove the node, click the filter again to show the facets, select the same facet again and then the new restriction value. This seems like an awful lot of steps just to apply some minor change. We should think about a usage paradigm that allows to edit currently active constraints (could be tough with the tree-based approach we use right now).
Last modified 10 years ago Last modified on 02/18/08 08:53:51