In this chapter ...
Subject gateways typically give access to Internet resources by providing both searching and browsing facilities. The browsing functions of gateways, in particular, are usually dependent upon the adoption and use of some directory-like structure - often based on subject classification schemes or thesauri.
Browsing a directory-type structure is user friendly. The structure makes it relatively easy to navigate and is an important help if users are not looking for a specific item. Users typically are able to choose categories from a subject hierarchy and to use these to make their way through the service, moving down the individual branches of a subject tree. On the other hand, if users want to make index-type searches, they must invent or look up suitable search terms. A browsing structure also gives a helpful overview of the scope of a service and how a particular service is organised.
Subject classification is a method of describing resources by their subject and a means of organising knowledge in libraries and other information environments. Universal classification schemes designed for use by libraries began to be developed in North America during the nineteenth century. For example, the most famous (and most widely used) scheme is the Dewey Decimal Classification (DDC) system that was first produced for a small college library in 1876.
Classification schemes differ from other subject indexing systems (subject headings, thesauri, etc.) by trying to create collections of related resources in a hierarchical structure. The use of notations/codes facilitates the creation of hierarchical subject trees. For example, using UDC we can create the following hierarchy (adapted from: McIlwaine, 1995, p. 17:
5
Natural sciences
504
Environmental science
504.05
Adverse effects of human activity on the environment
504.054
Effect of harmful materials. Pollution
504.054(44)
The effect of pollution on the environment in France
Libraries have long experience of applying classification schemes to resources - chiefly books. The idea of classification is to make it easier for users to find and retrieve resources. By building a hierarchical structure, a classification scheme enables users to look for related items that have not previously been identified as relevant. This facilitates browsing - both within a physical library or online.
The use of classification schemes offers one solution to providing improved access to Web resources. The Web is full of Web sites that have been created to act as guides to other Web sites selected according to some pre-specified criteria, e.g. they are judged to be good quality resources or relevant to a particular subject-area. Some of these sites typically consist of an alphabetical list of subjects, and selected Web resources are listed below each one. The lists can be very long and not so easy to overview.
In this context, it can be understood why classification schemes have begun to be used to give added-value subject access to Web sites. A site that organises knowledge with a classification scheme demonstrates several distinct advantages over sites that do not [Koch and Day, 1997]:
Classification schemes, however, have some disadvantages.
One advantage of classifying Internet resources is that you can assign more than one classification number to a resource, since they do not need to be put in numerical order on a shelf - they can be (virtually) kept in two places at once. An Internet service can easily offer several different (classification) "views" of the same resources.
Classification schemes can be broadly divided into:
All of these classification types are used to some extent on the Internet [Koch and Day, 1997]. Universal schemes like DDC and UDC are used by many Internet services and are readily available in machine-readable form. Subject services, however, appear more likely to use a subject-specific scheme.
When beginning the process of developing a new gateway, it may be felt advantageous to invent a new classification scheme for it. Inventing a new classification scheme has some advantages but may also create new problems that a gateway developer might not be aware of from the start.
The main advantage of creating a completely new classification scheme is that a gateway is able to create a customised scheme, adapted to its specific content and user groups, that should be able to meet all of its specific requirements. This should allow for easier and more consistent browsing of a gateway. For example, there should be no unnecessary parts of the structure that would end up being unused.
Another advantage is that home-made schemes should remain flexible and easy to change and therefore should be able to absorb new areas of interest relatively easily. On the other hand, this can itself become a problem if the initial scheme design is flawed. If a gateway wants to fit a new term or hierarchy into its own scheme that wasn't considered when it was created, it may be difficult to fit it into the scheme.
The process of creating a new classification scheme also has its disadvantages. It is time consuming - and therefore expensive - and requires extensive specialist subject knowledge. Even when the time and specialist knowledge is available, it is relatively easy to overlook something in a home-made scheme. Subject classification is a very subjective activity and this can lead directly to a lack of consistency. Custom made schemes are also not as well known to users as existing universal or subject-specific classification schemes potentially are.
Using an existing classification scheme is a way of not having to deal with some of the above issues. The scheme has already been made and it doesn't cost any additional time or money to develop it.
Probably the most important disadvantage is the more or less complete lack of interoperability with other services and databases when it comes to subject description for browsing and searching (cf. 16).
The best reason for inventing a classification system for a new service is when there is absolutely no suitable or adaptable system available or many different small ones only, not providing the necessary coverage. (cf. 12)
The established library classification schemes have developed over a long period of time, sometimes as long as 100 years. This means that their conception of the world can be outdated and this might be reflected in the structure. For example, all universal schemes have had to take account of the rapid growth in electronics and computing in the second half of the twentieth century. Updating classification schemes takes a long time, new concepts are sometimes being placed under non-logical headings. Because of their size, the classification schemes don't tend to update that often and, when they do, they tend to update a subject a time. Because of all this, traditional schemes can be rather complex to use.
There is, however, no requirement for subject services to use all layers of the classification hierarchy. Some current schemes organise material based on the first three levels only of a decimal scheme like DDC. The good thing, however, about the large established library classification schemes is that they are universal schemes. They are built to classify an entire world with all its content, and a user can, therefore, find most things using them.
The schemes developed for Internet usage are of course relatively young, often developed over the last few years. That means they are still often updating and trying to cover new things that are relevant. These schemes mirror the modern and changeable world . Sometimes they concentrate on a few areas of interest, ignoring the rest, sometimes they try to cover the whole world just like the universal library classification schemes. Many home-grown schemes, however, display severe weaknesses which hamper correct and efficient usage, for example: failures in logic and hierarchy; incorrect exhaustion of the classes and the application of multiple hierarchies; errors in terminology and in internal links and relationships between classes etc.
* EXAMPLE *
BUBL LINK is a comprehensive service covering academic resources in all subject areas. It uses the Dewey Decimal Classification (DDC) to classify individual documents.
Yahoo!, a commercial search service covering most popular subjects. Yahoo! uses its own universal classification scheme with 14 main categories.
Universal classification schemes and the subject-specific ones are aimed at different services. Therefore a new gateway would need to choose a scheme relevant for the target group for the service being created. That means that if the service includes all subjects and is aimed at a wide audience of Internet users, a universal classification scheme would be a good choice. If the service is a subject-specific one aimed at researchers within, say, the engineering area, it would be better to use a subject-specific classification scheme, if one exists and it is suitable. An alternative might be to use a suitable part of a universal scheme (cf. 12).
SOSIG (The Social Science Information Gateway) uses parts of UDC to generate browsing categories (which are then displayed in alphabetical order rather than in order of class number).
EELS is structured according to the subject classification scheme produced by Engineering Information Inc.
This issue also depends on your subject and target group as well as of the purpose of the service being classified. Who are the intended users? Is this likely to change in the future? If a gateway only aims at a single user group within a country or specific language area and does not see any other potential users for the service it could probably successfully use a national or language-based classification scheme. You would also possibly gain the recognition of a nationally-based scheme, if you use one that is common in libraries. If, on the other hand, a gateway aims at a user group that is international (or is intended to become international in the future), it would be better to use an international multilingual scheme, if available. If a gateway is thinking of cross-browsing or cross-searching with other gateways it needs to consider the possibility of mapping to other schemes at this stage (cf. 16).
Link larder, a Swedish catalogue for quality assessed Internet resources (especially aimed at children) within all fields uses the Swedish classification scheme SAB. The scheme is widely used in public and school libraries.
GERHARD, the German academic Web index classifies all documents using the UDC classification scheme from ETH Zuerich in three languages.
Clustering is a method with a similar goal to classification: to provide a group of closely related documents. Clustering is an automatic process, though, which according to specific criteria expressed in an algorithm, groups similar documents. The groups are normally not (hierarchically) related to each other and are of very different size. The subject covered by a cluster is very hard to describe. Every time that new documents are added to the collection the clusters have to be calculated again and the outcome can be different. Documents can frequently move to other clusters. Clustering methods (derived, "a posteriori" classification) should rather be compared with methods of automatic classification using established ("a priori") classification systems used to assign classification to documents. The characteristics show that the method of clustering frequently used in information retrieval systems, is more suited to post-process search results. It is not suitable for presenting a stable structure for browsing large gateways in which documents need to be grouped into clearly definded and related subject sections.
Information gateways will not only depend upon the use of classification schemes to record subject information. For example, they would use either uncontrolled keywords or terms taken from thesauri, subject headings, authority files and other vocabularies to record subject information. These subject terms complement the assigned classification codes by being used to improve the precision of search results from gateways and to guide users into using appropriate terminologies to aid resource discovery. Classification schemes are used to help group related documents within a well defined subject area; keywords can be used to give a detailed description of the concepts covered by an individual document. A combination of keywords should be used so that a document can be described as specifically as possible and to aid the retrieval of relevant documents. However, it is impossible to record relationships between uncontrolled keywords and therefore they are useless for structured browsing. Thesauri terms can sometimes come with explicit and complete (hierarchic) structures that may be suitable to replace a classification system and any conclusions about classification in this chapter could also be applied to this type of thesauri. Note that information gateways should include both subject keywords and subject classification in its records so that the gateway can explore different but complementary subject indexing approaches and can support both browsing and searching.
The scope of the service: its subject, language and geographic coverage and its target user population are normally the main criteria for the choice of classification scheme.
In some situations the solution is quite obvious. a)Where a gateway gives access to resources from all areas of knowledge, published throughout the world and in many languages and intended to be offered to an international multi-disciplinary community of users, an existing universal scheme should be selected, at least as a basic solution. DDC and UDC have a good multilingual capability due to the fact that the codes they produce are entirely numerical and their schedules have been widely translated (into up to 30 different languages). b) If the collection, however, focuses on a rather limited subject area or discipline and there is a suitable international subject-specific scheme available, this should be used. c) If your service is a national service targeted primarily to one country, you may want to consider a national general scheme.
Problems will occur for services covering subjects where several different schemes exist (e.g. the earth sciences) or services that cover more than one single subject area (e.g. the social sciences). In these cases mapping and linking between schemes, the use of concordances for conversion or extensions of a scheme may help (cf. 14-16). There will also be problems when there is no comprehensive scheme available for a service covering a particular geographic area or subject scope. Then, a classification structure has to be created for the specific service or, preferably, as a co-operative effort with recorded relations and mappings to existing schemes (cf. 6).
A list of web accessible classification systems and thesauri is maintained at: Koch, ongoing
It might influence your decision how familiar the staff is with the considered schemes and what kind of maintenance is provided by the owner of the classification system. This could affect how fast the gateway can grow in the beginning.
a) How do the considered systems compare in quality and controlled revision? (cf. 3, 6, 7, 8, 9, 12) b) Is the scheme you want to use available in machine-readable form? c) Is the scheme you want to use freely available for use on the Internet or do you need to acquire a license?
Are there any mappings available between the candidate schemes and other established subject-specific or universal schemes which can secure interoperability to other services, now or in the future? (cf. 16) ** CROSS REFERENCE to the Interoperability chapter **
How do the different alternative schemes and methods compare when it comes to total costs? The costs are for information specialists and technicians as well as for servers and programs being used. The initialization of a service will cost a lot since all issues within this handbook need to be investigated. When the service is up and running the costs will be lower.
Summary of questions you may need to ask
It has already been noted that adapting existing classification schemes is an important part of providing a user-friendly browsing structure for a gateway. Classification provides an excellent means of placing information objects within a detailed conceptual framework and in some cases to provide a means of organising physical resources like books on shelves. For classification schemes to be effective as browsing aids in subject gateways they need to be reduced in complexity and sometimes reordered. When adapting, a detailed table of the changes should be kept in order to be able to easily do the same local adaptations whenever the main system changes. When the hierarchy is rearranged, a mapping to the equivalent placings in the original system should be kept.
A very unequal distribution of resources throughout a classification system can be quite disturbing for the browsing process. Omitting empty classes might be necessary in order to create a user-friendly browsing structure. If there are only a few empty classes or branches, the best advice is to mark the classes as empty in your browsing structure and navigation area (as done in EELS). The system will still appear as a coherent and logical whole. If there are many empty classes, the display might hide these. Our advice is, however, to classify the individual resources in as detailed a way as is possible in the chosen system, but to display them for the time being in a broader category. This allows for a fully expanded display as soon as there are enough resources for a meaningful finer substructure, without requiring any reclassification effort. In any case, all resources should be displayed in order to keep consistency between browsing and searching the service.
It may be necessary to rearrange the hierarchy to make the browsing structure easier to use. Sometimes the hierarchies need a more logical arrangement so that users find their way through it. Sometimes an important 'branch' deep down in the tree structure needs to be lifted closer to the top of the hierarchy so that it can be found more easily. In the end if there is a potential conflict between the purpose of the gateway and the purpose of the classification scheme it is the classification scheme that needs to be rearranged.
Renaming captions is another way of adapting a classification scheme. For example, a classification scheme may use complicated technical terms that would be difficult for the target audience to understand, e.g. in a gateway designed for school children. In these cases renaming adds value and user-friendliness to the service (cf. DDC for children and DDC for end-users). The renaming should be done in a similar way throughout the service in order to keep the service consistent and the language level the same.
There are times when an existing classification scheme is not detailed enough in particular areas or omits subject categories closely related to the gateway´s coverage. If these are important areas for the gateway then the classification scheme needs to be extended.
There are several alternative approaches for the extension: a) Add a topical substructure to certain classes, without changing the existing classes. Besides the gateway's own creations, 'bits-and-pieces' from established subject specific systems could be used. b) Add facets to the classification which allow subdivision of classes, e.g. a geographical or historical facet, a facet for document types or languages etc. The facets should preferably be taken from established systems. c) "Glue" (parts of) an established system as a new branch onto your scheme to extend its topical coverage.
The possibility to automatically convert from existing classifications of documents (e.g. OPAC records, database records, documents in Internet services, etc.) into another scheme used in a subject gateway could become a potentially valuable support for the classification task. This method is occasionally used in co-operative cataloguing projects and union catalogues, sometimes even in individual OPACs as soon as cataloguing records using a different classification scheme are imported or exchanged.
If there are no "official" conversion tables available, an improvement of the classification task could still be reached by extracting frequent and statistically significant linkages between different classification schemes or between indexing terms and classification for the same document from existing databases and use it as conversion algorithm.
Mapping between different classification systems will become an increasingly important activity for subject services. Data exchange, cross-browsing between distributed services, multilinguality and improvements of classification structures are dependent on it. Producing such a mapping is often difficult and time consuming because of theoretical, conceptual, cultural and practical differences between the systems. Mappings have to apply many different types of equivalences, one-to-one relationships are certainly not sufficient. Nevertheless, mapping operations will be needed for the following efforts, amongst others: a) Conversion between different systems to include records into a local structure or exchange of data (cf. 15). An example is the mapping between DDC and UDC within the subject domains of economics and business for the SOSIG and Biz/ed projects [Hiom, 1998]. b) Support of the translation of categories and terms into other languages. Mapping is needed to help to represent the different coverage of terms in different languages and to make up for the occasional lack of equivalent terms. A combination of translation and mapping might be the best way to accomplish multilingual vocabulary access and support. The EU Language Engineering projects Aquarelle and Term-IT have been working in this area. c) Extension of the classification structure by "gluing" different systems into each other (cf. 14). This will be tested by the DESIRE II project together with OCLC. A couple of years ago a study was published exploring a mapping between DDC and the Mathematical Subject Classification MSC. d) Provide cross-browsing between different services (which keep their classification systems unchanged). (cf. 17) e) Securing of a wide and future proof interoperability with different and maybe as yet unknown services.
The mapping can be carried out between pairs of two or between several systems or as a mapping to a universal system like DDC as a "switching system" or "interlingua". The latter alternative is needed when trying to secure wide interoperability or when there is a rather small overlap between the used classifications.
Mapping between classification schemes is a field where neither theory or practice is yet mature. We recommend that before any large scale implementation is carried out, both advice and assistance should be sought from experts in this area.
Some subject areas are currently covered with more than one gateway. For example, engineering is covered by both EELS, EEVL and AVEL. This can be confusing to the user who may not know which one to use. An answer to this potential confusion is not easy to find. It is possible that one gateway may be more suitable for one subtype of resources than another but this can be difficult for users to know without extensive knowledge of each and every relevant gateway. The same problems arise for people interested in inter-disciplinary resource discovery.
One possible solution is to enable the cross-searching and/or cross-browsing of gateways. Cross-searching is relatively easy to provide in a networked environment, especially where the same search and retrieve protocols are in use. The description approach, i.e. the attributes used for bibliographic metadata, have to be similar, though, and fielded search requires in addition semantic equivalence between the content of the fields in all services. Cross-searching has been tested by the ROADS project and can already be implemented in gateways based on the ROADS software [Kirremuir et al, 1998].
Cross-browsing two or more gateways would be a useful way of combining logically separate or distributed services. In practice, this can be quite difficult to achieve. If the gateways use identical schemes, the classification codes should be the same, so a combined service could be generated which means that a user would be able to browse everything within the same virtual space. If identical schemes are not used, this would become extremely difficult if not impossible. Therefore, subject gateways that want to facilitate cross-browsing with each others should - where possible - use the same classification scheme. Even so, there are likely to be problems. Classification is often a subjective activity and this would affect how combined subject gateways could be sensibly browsed.
There are some other issues concerning cross-browsing and searching:
These things need to be solved before a gateway is able to offer cross-browsing with other services. Solutions, however, will all be based upon close co-operation between the different subject gateways. ** CROSS REFERENCE: Co-operation ** Cross-browsing through visible links between the browse sections of two or more services back and forth, without hiding the independence of the services, can be accomplished by mapping methods as described in 16. DESIRE II is currently testing different methods.
There are several ways in which to encode the classification. If the resources are being put into a database, a specific database field can be created for this purpose. This would need to give both the classification code assigned and the scheme (and edition) from which it is derived. For example, ROADS templates give the following elements:
Subject-Descriptor-v1: 551.46 Subject-Descriptor Scheme-v1: DDC21
Other metadata schemes permit this in other ways. The Dublin Core SUBJECT element allows the use of either controlled terms (from thesauri, classification schemes or other controlled vocabularies) or keywords. However, classifications assigned to the SUBJECT must be specified with a SCHEME qualifier so that an application encountering a code like "341.63" would be able to parse it. This is an example of HTML 4 encoding inside the head of the document:
<META NAME="DC.Subject" SCHEME="udc" CONTENT="341.63">
If the SCHEME qualifier tells the application that this is UDC notation it may be possible for it to be used in different ways, e.g. to check correctness, to support searching and improve result display.
If a gateway creates metadata for all resources selected and classifies them using a classification scheme it should be able to generate a browsing structure from this information. How to do this technically depends upon the software used for managing the subject gateway and displaying its resources. In some services (e.g. EELS) perl scripts look up the classification fields of all records and create stable HTML pages displayed and browsable according to the structure of the choosen system. The update frequency can be adapted to the occurrency of changes. In other cases (e.g. EEVL) Java programmes create browsing pages on the fly querying the same classification field in a database of all resources, kept at the subject service.
You can use classification as a way of making searches more powerful and to limit the number of irrelevant hits for the user.
Searching using classification data can be offered in different ways in the user interface of your service. Sections of the classification scheme can be offered as a filter (or option) in the search, limiting the results of the query to a certain topical part of the database. The best solution for that is probably to offer a list of all alternative sections/classifications for selection allowing the user to choose both one or several sections. An expert alternative would be to offer the classification field for direct searching with a truncation option, if the notations are made available. On the browsing pages a search option could be offered limiting the search to the currently viewed class and all sunclasses below (as in EELS and Yahoo!).
Subject classification is an activity usually carried out by librarians and other information specialists, such as the producers of bibliographic databases. Trained professionals tend to be used since detailed classification requires specialist subject knowledge. It may also be time consuming. The usefulness of any browsing structure depends on the accuracy of the classification. Because of this it is important to put a lot of effort into this task in order to get a browsing structure of high quality.
Uncontrolled keywords might be taken from authors of the documents included into the service. Another option is automatic classification (cf. 22).
As traditional classification is a time-consuming and expensive process it is obvious that investigations into the use of automated solutions are worthwhile. At the same time, classification is an activity where a significant level of human expertise, abstract thinking and understanding is needed and this is not easy to substitute with artificial intelligence or expert systems. There are no known examples of traditional library classification being undertaken completely by computer software. However, knowledge structuring in the Internet has to cope with far larger numbers of resources, exponential growth rates and a high risk of changes occurring in documents that already exist.
This is the background for the development of a growing number of research projects and experimental systems that are trying to support knowledge structuring activities on the Internet with automatic methods. Most of these projects use methods of derived indexing, i.e. they extract information from the documents and then use it for the structuring tasks (cf. 10).
Few research projects appear to make use of traditional library classification schemes with universal or subject-specific schemes constructed a priori over many years by co-operative organisations, independently from the contents of documents which actually exist in particular collections. This method is called assigned indexing, to devise an indexing language and assign the appropriate concepts or notations to each document.
Automatic classification will probably not replace intellectual classification as far as quality subject services are concerned. Automated methods will instead support and complement the manual selection and subject indexing efforts. Intellectual classification will always be needed to validate and improve automatic methods. However, automated classification will have a definite role for robot-generated indexes as an important additional feature for gateways. This is being investigated further in the DESIRE II project [Koch and Vizine-Goetz, 1999].
One practical goal in DESIRE II is to explore simple applications of automatic classification methods on a robot-generated subject index in the Web. A lot of different tests will be carried out on the "All" Engineering (AE) robot-generated database of engineering documents from the Internet. The efforts required will be studied and the resulting outcomes evaluated. A pilot service of the "All" Engineering Web index will offer a full classification and browsing structure with the most suitable solution found during the project. In addition, a comprehensive state-of-the-art report on projects, methods, alternatives and problems with automatic classification will also be presented. Some of the results from DESIRE II work in this area [Koch, Ardö and Nooden, 1999] will be included in the next edition of this handbook.
A more detailed analysis of the use of classification schemes in Internet resource description and discovery and a list of services using them can be found in the DESIRE I report produced by Koch and Day [Koch and Day, 1997]. This report describes the use of several classification schemes on the Internet in some detail and provides an introduction to the use of automated classification techniques on the Internet.
Another useful Web page that lists some Internet-based services that use classification schemes for organising resource discovery services is Gerry McKiernan's Beyond Bookmarks page [McKiernan, 1996 and ongoing].
Aquarelle. <URL:http://aqua.inria.fr>
Hiom, D., 1998, Mapping classification schemes. Bristol: SOSIG. <URL:http://www.sosig.ac.uk/desire/class/mapping.html>
Kirriemuir, J., Brickley, D., Welsh, S., Knight, J. and Hamilton, M., 1998, Cross-Searching Subject Gateways - The Query Routing and Forward Knowledge Approach. D-Lib Magazine, January. <URL:http://www.dlib.org/dlib/january98/01kirriemuir.html>
McKiernan, G., 1996 and ongoing, Beyond bookmarks: schemes for organising the Web. Iowa State University. <URL:http://www.iastate.edu/~CYBERSTACKS/CTW.htm>
McIlwaine, I.C., 1995, Guide to the use of UDC: an introductory guide to the use and application of the Universal Decimal Classification, rev. ed. The Hague: International Federation for Information and Documentation (FID).
Koch, T. and Day, M., 1997, The role of classification schemes in Internet resource description and discovery. Deliverable 3.2 for the DESIRE project. <URL:http://www.ub2.lu.se/desire/radar/reports/D3.2.3/class_v10.html>
Koch, T., 1998, Nutzung von Klassifikationssystemen zur verbesserten Beschreibung, Organisation und Suche von Internet Ressourcen. In: Buch und Bibliothek 50:5, pp.326-335. <URL:Manuscript: http://www.ub2.lu.se/tk/publ/bubmanus.html>
Koch, Traugott, Ardö, Anders and Noodén, Lars, 1999, The construction of a robot-generated subject index. (EU Project DESIRE II D3.6a, Working Paper 1) <URL:http://www.lub.lu.se/desire/DESIRE36a-WP1.html>
Koch, Traugott and Vizine-Goetz, Diane, 1999, Automatic Classification and Content Navigation Support for Web Services. DESIRE II co-operates with OCLC. In: Annual Review of OCLC Research 1998 <URL:http://www.oclc.org/oclc/research/publications/review98/koch_vizine-goetz/automatic.htm>
Koch, T., ongoing, Controlled vocabularies, thesauri and classification systems available in the WWW <URL:http://www.ub2.lu.se/metadata/subject-help.html>
Term-IT. <URL:http://www.mda.org.uk/term-it/>
BC
Nederlandse Basisclassificatie
DDC
Dewey Decimal Classification
Ei
Engineering Information
LCC
Library of Congress Classification
LCSH
Library of Congress Subject Headings
NLM
National Library of Medicine
SAB
Sveriges Allmänna Biblioteksförening
UDC
Universal Decimal Classification