Thesaurus: what is it. A thesaurus dictionary that is more than a dictionary

Increasingly, the concept of “thesaurus” can be found in numerous projects, books, brochures, and Internet resources. Like a mysterious phenomenon, it frightens with its unknownness, because it is much easier to say “dictionary” than to use a strange definition.

Thesaurus: what is it? How is it different from a regular dictionary? Let's try to study these issues in more detail and in an accessible way.

Interpretation of the term

Initially, the concept of a thesaurus was considered from the point of view of a dictionary, which represented the vocabulary of a language with examples of use in the text.

Ozhegov interprets a thesaurus as a dictionary of a specific language, reflecting vocabulary in full, while Efremova considers this phenomenon from the point of view of a systematized set of data in a certain field of knowledge.

The most specific definition is used in philology, where a thesaurus is understood as a component of a dictionary type, where all word meanings are connected by semantic relationships and reflect the key relationships of concepts in a certain subject area.

As we can see, it is quite difficult to answer the question: “Thesaurus: what is it?” definitely. For a more narrow study of the term, we will consider the history of its occurrence, types and relationships of lexical units in a dictionary of this type.

History of origin

The English physicist Roger is considered the founding father of thesauri; it was he who systematized it in 1852, dividing it into groups. Moreover, each group was represented by the name of the concept, and then there were its synonyms for certain parts of speech, lists of related names, as well as references to names of other categories. The idea of such a classification was very valuable, since the dictionary was considered the most natural and described the vocabulary of the language to the fullest extent. At the same time, it could be used as a quick search for important concepts. Since the time of the first thesaurus and until now, there has been a regular transformation of this type of dictionary, which is used in many fields of knowledge and is widely popular throughout the world. At the same time, studying the topic: “Thesaurus: what is it?” relevant in many educational institutions.

To this day, thesauri remain the most popular way of describing knowledge in any field necessary for effective human perception.

Word relationships in thesaurus

The most common relations in the classical thesaurus are:

Synonymy is a phenomenon in which words of the same part of speech that are similar in lexical meaning are associated. For example: power-fatherland, brigade-detachment, scarlet - red etc.
Antonymy is a connection between words of one part of speech that have opposite lexical meanings. For example: silence - roar, gentle - rude.
Hyperonymy (hyponymy) is a key relationship for the purpose of describing nouns. A hypernym has a broad lexical meaning, expresses the generic, general name of a class (set) of objects, items, namely its properties and characteristics. A hyponym has a narrow meaning; it names an object (sign, property) as an element of a specific set or class. To make these relationships clear, let us give a simple example. Words beast and tiger are interconnected, with the common name being beast- is a hypernym in relation to the hyponym tiger.
Meronymy (partonymy) is a relationship for nouns that is formed according to the “part - whole” principle. As an example, consider the words airplane, landing gear, porthole. In this case, the general name of the transport is holonym (whole, name), and its components are meronyms.
Consequence (relationships between verbs). For example, words go and come connected by a process and its consequence (result).
Reason (also only valid for verbs). Let's consider an example of such relationships, take the words: to be sick - to miss. In this case, the reason can be traced - to miss it because there were health problems.

We will see what a thesaurus is from the following example.

A bed is a device for sleeping.

[hyperonym]: furniture
[meronym]: house
[synonym]: bed, bed.

This is just a classic example of a thesaurus of the Russian language, but all dictionaries of this type are built precisely on this principle.

Thesaurus functions

A thesaurus dictionary has important social, communication, scientific and other functions.

He is:

a source of specialized knowledge in a broad or narrow subject area, a way of ordering, describing terms;
search tool in the information flow;
a tool for manual analysis of documentation in search engines;
a tool for automatic indexing of complex texts.

Types of thesauri

The variety of dictionaries requires considering not only the question: “Thesaurus: what is it?”, but also paying attention to the types. This will help us better understand the features of this type of dictionary.

Conclusion

We hope that we were able to explain in an accessible language what a thesaurus is. Thanks to the examples, it is easy to understand how it differs from other dictionaries. We also covered the issue of information retrieval thesauri, which are widely used by the information system to quickly search and systematize millions of items.

, antonyms, paronyms, hyponyms, hypernyms, etc.) between lexical units. Thesauruses are one of the most effective tools for describing individual subject areas.

In the past the term thesaurus dictionaries were designated primarily, representing the vocabulary of the language with maximum completeness with examples of its use in texts.

Also term thesaurus used in information theory to denote the totality of all information possessed by the subject.

In psychology, an individual's thesaurus is characterized by the perception and understanding of information. Communication theory also considers the general thesaurus of a complex system through which its elements interact.

Story

One of the first thesauri is called the “Dictionary of Synonyms” by Philo of Byblos. A more precise correspondence to the term is Amara-kosha, written in Sanskrit in poetic form in the 6th century. The first modern English thesaurus was created by Peter Mark Roger in 1805. It was published in 1852 and has been used without reprint since then.

In the 1970s, thesauri began to be actively used for information retrieval tasks. In such thesauri, words are mapped to descriptors through which semantic connections are established.

Thesauri

Write a review about the article "Thesaurus"

Notes

Excerpt characterizing the Thesaurus

- What a dandy you are today! – Nesvitsky said, looking at his new mantle and saddle pad.
Denisov smiled, took out a handkerchief from his bag, which smelled of perfume, and stuck it in Nesvitsky’s nose.
- I can’t, I’m going to work! I got out, brushed my teeth and put on perfume.
The dignified figure of Nesvitsky, accompanied by a Cossack, and the determination of Denisov, waving his saber and shouting desperately, had such an effect that they squeezed onto the other side of the bridge and stopped the infantry. Nesvitsky found a colonel at the exit, to whom he needed to convey the order, and, having fulfilled his instructions, went back.
Having cleared the road, Denisov stopped at the entrance to the bridge. Casually holding back the stallion rushing towards his own and kicking, he looked at the squadron moving towards him.
Transparent sounds of hooves were heard along the boards of the bridge, as if several horses were galloping, and the squadron, with officers in front, four in a row, stretched out along the bridge and began to emerge on the other side.
The stopped infantry soldiers, crowding in the trampled mud near the bridge, looked at the clean, dapper hussars marching orderly past them with that special unfriendly feeling of alienation and ridicule that is usually encountered with various branches of the military.
- Smart guys! If only it were on Podnovinskoye!
- What good are they? They just drive for show! - said another.
- Infantry, don't dust! - the hussar joked, under which the horse, playing, splashed mud at the infantryman.
“If I had driven you through two marches with your backpack, the laces would have been worn out,” the infantryman said, wiping the dirt from his face with his sleeve; - otherwise it’s not a person, but a bird sitting!
“If only I could put you on a horse, Zikin, if you were agile,” the corporal joked about the thin soldier, bent over from the weight of his backpack.
“Take the club between your legs, and you’ll have a horse,” responded the hussar.

The rest of the infantry hurried across the bridge, forming a funnel at the entrance. Finally, all the carts passed, the crush became less, and the last battalion entered the bridge. Only the hussars of Denisov's squadron remained on the other side of the bridge against the enemy. The enemy, visible in the distance from the opposite mountain, from below, from the bridge, was not yet visible, since from the hollow along which the river flowed, the horizon ended at the opposite elevation no more than half a mile away. Ahead there was a desert, along which here and there groups of our traveling Cossacks were moving. Suddenly, on the opposite hill of the road, troops in blue hoods and artillery appeared. These were the French. The Cossack patrol trotted away downhill. All the officers and men of Denisov’s squadron, although they tried to talk about outsiders and look around, did not stop thinking only about what was there on the mountain, and constantly peered at the spots on the horizon, which they recognized as enemy troops. The weather cleared again in the afternoon, the sun set brightly over the Danube and the dark mountains surrounding it. It was quiet, and from that mountain the sounds of horns and screams of the enemy could occasionally be heard. There was no one between the squadron and the enemies, except for small patrols. An empty space, three hundred fathoms, separated them from him. The enemy stopped shooting, and the more clearly one felt that strict, menacing, impregnable and elusive line that separates the two enemy troops.
“One step beyond this line, reminiscent of the line separating the living from the dead, and - the unknown of suffering and death. And what's there? who's there? there, beyond this field, and the tree, and the roof illuminated by the sun? Nobody knows, and I want to know; and it’s scary to cross this line, and you want to cross it; and you know that sooner or later you will have to cross it and find out what is there on the other side of the line, just as it is inevitable to find out what is there on the other side of death. And he himself is strong, healthy, cheerful and irritated, and surrounded by such healthy and irritably animated people.” So, even if he doesn’t think, every person who is in sight of the enemy feels it, and this feeling gives a special shine and joyful sharpness of impressions to everything that happens in these minutes.
The smoke of a shot appeared on the enemy’s hill, and the cannonball, whistling, flew over the heads of the hussar squadron. The officers standing together went to their places. The hussars carefully began to straighten out their horses. Everything in the squadron fell silent. Everyone looked ahead at the enemy and at the squadron commander, waiting for a command. Another, third cannonball flew by. It is obvious that they were shooting at the hussars; but the cannonball, whistling evenly quickly, flew over the heads of the hussars and struck somewhere behind. The hussars did not look back, but at every sound of a flying cannonball, as if on command, the entire squadron with their monotonously varied faces, holding their breath while the cannonball flew, rose in their stirrups and fell again. The soldiers, without turning their heads, glanced sideways at each other, curiously looking for the impression of their comrade. On every face, from Denisov to the bugler, one common feature of struggle, irritation and excitement appeared near the lips and chin. The sergeant frowned, looking around at the soldiers, as if threatening punishment. Junker Mironov bent down with each pass of the cannonball. Rostov, standing on the left flank on his leg-touched but visible Grachik, had the happy look of a student summoned before a large audience for an exam in which he was confident that he would excel. He looked clearly and brightly at everyone, as if asking them to pay attention to how calmly he stood under the cannonballs. But in his face, too, the same feature of something new and stern, against his will, appeared near his mouth.
-Who is bowing there? Yunkeg "Mig"ons! Hexog, look at me! - Denisov shouted, unable to stand still and spinning on his horse in front of the squadron.
The snub-nosed and black-haired face of Vaska Denisov and his entire small, beaten figure with his sinewy (with short fingers covered with hair) hand, in which he held the hilt of a drawn saber, was exactly the same as always, especially in the evening, after drinking two bottles. He was only more red than usual and, raising his shaggy head up, like birds when they drink, mercilessly pressing spurs into the sides of the good Bedouin with his small feet, he, as if falling backwards, galloped to the other flank of the squadron and shouted in a hoarse voice to be examined pistols. He drove up to Kirsten. The headquarters captain, on a wide and sedate mare, rode at a pace towards Denisov. The staff captain, with his long mustache, was serious, as always, only his eyes sparkled more than usual.
- What? - he told Denisov, - it won’t come to a fight. You'll see, we'll go back.
“Who knows what they’re doing,” Denisov grumbled. “Ah! G” skeleton! - he shouted to the cadet, noticing his cheerful face. - Well, I waited.
And he smiled approvingly, apparently rejoicing at the cadet.
Rostov felt completely happy. At this time the chief appeared on the bridge. Denisov galloped towards him.
- Your Excellency! Let me attack! I will kill them.
“What kind of attacks are there,” said the chief in a bored voice, wincing as if from a bothersome fly. - And why are you standing here? You see, the flankers are retreating. Lead the squadron back.
The squadron crossed the bridge and escaped the gunfire without losing a single man. Following him, the second squadron, which was in the chain, crossed over, and the last Cossacks cleared that side.
Two squadrons of Pavlograd residents, having crossed the bridge, one after the other, went back to the mountain. Regimental commander Karl Bogdanovich Schubert drove up to Denisov's squadron and rode at a pace not far from Rostov, not paying any attention to him, despite the fact that after the previous clash over Telyanin, they now saw each other for the first time. Rostov, feeling himself at the front in the power of a man before whom he now considered himself guilty, did not take his eyes off the athletic back, blond nape and red neck of the regimental commander. It seemed to Rostov that Bogdanich was only pretending to be inattentive, and that his whole goal now was to test the cadet’s courage, and he straightened up and looked around cheerfully; then it seemed to him that Bogdanich was deliberately riding close to show Rostov his courage. Then he thought that his enemy would now deliberately send a squadron on a desperate attack to punish him, Rostov. It was thought that after the attack he would come up to him and generously extend the hand of reconciliation to him, the wounded man.

Under thesaurus is understood as a complex component of a dictionary type, in which all the meanings of the dictionary are interconnected by semantic relationships that reflect the basic relationships of concepts in the described subject area of knowledge. In the past, the term thesaurus primarily denoted dictionaries that presented the vocabulary of a language with maximum completeness with examples of its use in texts.

The thesaurus includes lexemes, relating to the four parts of speech: adjective, noun, verb and adverb. Descriptions corresponding to each part of speech have a different structure.

The main relations in the thesaurus are:

synonymy– the connection between words of the same part of speech, different in sound and spelling, but having the same or very similar lexical meaning, for example: cavalry - cavalry, brave - brave;
antonymy– the connection between words of the same part of speech, different in sound, having directly opposite meanings: truth - lie, good - evil;
hyponymy/hyperonymy. Hypernym– a word with a broader meaning, expressing a general, generic concept, the name of a class (set) of objects (properties, attributes). Hyponym– a word with a narrower meaning that names an object (property, attribute) as an element of a class (set). These relations are transitive and asymmetrical. A hyponym inherits all the properties of a hypernym. They are central relations for describing nouns;
meronymy/partonymy– “PART-WHOLE” relationship. Within this relation, the relations “to be an element” and “to be made of” stand out. The relation is defined only for nouns;
consequence (this relationship connects verbs);
reason (also defined for verbs).

Example thesaurus:

Hut - wooden peasant house [hyperonym]: residential building [meronym]: rural settlement [synonym]: house

All relationships create a complex hierarchical network of concepts, and knowing where a concept is located in this network is an important part of knowing about that concept. The properties of relations are different when describing different parts of speech.

In different systems, a thesaurus can perform different functions:

a source of specialized knowledge in a narrow or broad subject area, a way of describing and organizing the terminology of the subject area;
search tool in information retrieval systems;
a tool for manual indexing of documents in information retrieval systems (the so-called controlling dictionary);
automatic text indexing tool.

Thesauruses as conceptual dictionaries were started by Roger (or Roget, an English physicist), who systematized the vocabulary of the English language into groups. Each group is represented by the name of a concept (“categories”, of which there were at first one thousand; these are ordinary words arranged in alphabetical order, for example AFFIRMATION ... AGENCY ...), followed by its synonyms by parts of speech (nouns, verbs, adjectives, adverbs), antonyms and then lists of related words (there are many of them, and some are references to names of other categories, in the dictionary entry of which the list of “distant relatives” can continue, for example, from AGENCY... see BUSINESS). Since the publication of Roger's thesaurus in 1852. and its reprints are still ongoing in different forms and for different users, the thesaurus is constantly updated with new vocabulary and connections, but the name of the creator of the first version remains behind all the options. The value of this thesaurus is in its naturalness, in the fact that it is a description of the entire vocabulary of the language, and not just terminology, and also in the fact that it can be used in information retrieval systems as a means of increasing the semantic power of the system.

Thesauruses remain to this day the most accepted form of describing knowledge of a subject area, suitable for human perception. Examples of modern foreign thesauri are WordNet and EuroWordNet.

The English language thesaurus WordNet appeared in 1990. and began to be actively involved in various areas of automatic text processing. WordNet covers about 100 thousand different units (almost half of them are phrases), organized into 70,000 concepts.

The EuroWordNet multilingual thesaurus is currently being developed. Initially, for four languages (Danish, Italian, Spanish and American English), a network of word meanings is developed, connected by semantic relationships and allowing one to find words of different languages that are similar in meaning. Unlike Roger's thesaurus and the WordNet network, which were created to describe the lexical and conceptual system of the English language, EuroWordNet is created primarily to solve practical problems of automatic processing of large amounts of text. The most important tasks that are supposed to be solved with the help of this thesaurus are the following:

providing multilingual information retrieval;
increasing the completeness of information retrieval;
formulating a request in natural language;
semantic indexing of documents, etc.

In addition to these relations, thematic relations are also introduced that connect concepts of one subject area. It is also proposed to introduce special notes on the relationships between concepts, denoting the disjunction or conjunction of relations. If a certain concept in the network has several relations of the same name, then they can be disjunctive, i.e., one of these relations is actually realized, or conjunctive, i.e., all these relations are valid for the concept.

Domestic institutes have created more than a hundred industry-specific thesauri that meet a certain state standard for dictionaries of this type. They are called - IRT - information retrieval thesauruses. Of all the possible semantic relationships between concepts, three are fixed in them: synonymous, generic (which usually includes the “PART-WHOLE” relationship) and “all others”, also called associative.

Standard IPTs are intended mainly for manual indexing of documents, as well as for formulating and varying queries during searches. There are non-standard thesauruses that set the task of selective systematization of terminology in a specific field of knowledge - this is especially true for new subject areas. There is a growing tendency to enrich thesauri with definitions of terms, which is important for distinguishing ambiguity of terms, especially in the case of related disciplines and when moving beyond the boundaries of narrow subject areas.

N. V. Lukashevich

[email protected]

B. V. Dobrov

Research Computing Center of Moscow State University. M.V. Lomonosov;

ANO Center for Information Research

[email protected]

Keywords: thesaurus, information retrieval, automatic text processing,

The vast majority of technologies working with large collections of texts are based on statistical and probabilistic methods. This is due to the fact that lexical resources that could be used to process text collections using linguistic methods must have a volume of tens of thousands of dictionary entries and have a number of important properties that must be specifically monitored when developing the resource. In the report, we examine the basic principles of developing lexical resources for automatic processing of large text collections using the example of the Russian language thesaurus for computer text processing RuTez, created in 1997, which is currently a hierarchical network of more than 42 thousand concepts. We describe the current state of the thesaurus based on a comparison of its lexical composition and the text corpus of the University Information System RUSSIA (www.cir.ru) - 400 thousand documents. Examples of thesaurus use in various automatic word processing applications are discussed.

Introduction

Currently, millions of documents have become available in electronic form, thousands of information systems and electronic libraries have been created. At the same time, information systems that use lexical and terminological resources for searching are calculated in fractions of a percent. This is due to the serious challenges of creating such linguistic resources for automatic processing of modern collections of electronic documents.

First, these collections are usually very large; the resource must include descriptions of thousands of words and terms. Secondly, collections are a set of documents of different structures with various syntactic structures, which makes it difficult to automatically process text sentences. In addition, important information is often distributed between different sentences of the text.

All this acutely raises the question of what a linguistic resource should be, which, on the one hand, would be useful for automatic processing and searching in electronic collections, on the other hand, could be created in a foreseeable time and maintained with relatively little effort.

In this article we will look at the basic principles of developing lexical resources for automatic processing of large text collections. These principles will be examined using the example of the Russian language thesaurus created by the ANO Center for Information Research since 1997 for computer text processing RuTez. RuTez is currently a hierarchical network of more than 42 thousand concepts, which includes more than 95 thousand Russian words, expressions, and terms. We will describe the current state of the thesaurus based on a comparison of its lexical composition and the vocabulary of the text corpus of the University Information System RUSSIA, supported by the Research Computing Center of Moscow State University. M.V. Lomonosov and ANO TSII. UIS RUSSIA (www.cir.ru) contains 400 thousand documents on socio-political topics (about 3 GB of texts, 200 million words). The article will also discuss examples of using thesaurus in various automatic word processing applications.

Principles for developing a linguistic resource

for information retrieval tasks

To ensure effective automatic processing of electronic documents (automatic indexing, categorization, comparison of documents), it is necessary to build a basis for their comparison - a list of what was mentioned in the document. For such an index to be more effective than a word-by-word index, it is necessary to overcome the lexical diversity of the text: synonyms, polysemy, parts of speech, stylistics, and reduce it to an invariant - a concept that becomes the basis for comparing different texts. Thus, concepts should become the basis of a linguistic resource, and linguistic expressions: words, terms - become only text inputs that initialize the corresponding concept.

In order to be able to compare different but similar concepts, relationships must be established between them. Traditionally, linguistic resources for automatic processing of texts in natural language used certain sets of semantic relations, such as part, source, reason and so on. However, when working with large and heterogeneous text collections, we must understand that with the current state of word processing technology, a computer system will not be able to reliably detect these relationships in the text in order to perform the procedures that we have associated with these or other relationships. Therefore, the relations between concepts must first of all describe certain invariant properties that do not depend or weakly depend on the topic of the specific text in which the concept is mentioned.

The main function of this relationship is to answer the following question:

if it is known that the text is dedicated to discussing C1, and C2 is related

attitudeRwith C1, can we say that the topic of the text(*)

related to C2?

When creating a linguistic resource for automatic processing, it is important to determine which properties of the concepts C1 and C2 allow us to establish correct (*) relationships between them.

So, for example, no matter what texts are written about birch trees, we can always say that these lyrics are about trees. But despite the popularity and frequent discussion of the relationship tree as part forests, very few texts about trees are texts about forests. Note that the problem is not related to the name of the relationship. So clearing is part of the forest, and texts about clearings are texts about forests.

The invariance of relations relative to the spectrum of possible topics of texts in a subject area is largely determined by deeper properties than those reflected by the names of relations, namely its quantifier and existential properties. Thus, the quantifier properties of relations describe whether all examples of a concept have a given relation, whether this relation persists throughout the entire life cycle of the example. Problem with using relation tree – forest It is precisely due to the fact that not every specific tree is located in the forest, but the clearing cannot be outside the forest.

An example of a description of the existential properties of relations - does it follow from the existence of the concept C1 the existence of the concept C2 (for example, the existence of the concept GARAGE requires the existence of a concept AUTOMOBILE) or the existence of examples C1 depends on the existence of examples C2 (so specific FLOOD inseparable from a specific example RIVERS). The discussion in the text of the dependent concept C2, especially dependent on the example, suggests that the text is also related to the main concept C1.

Let's consider the relationship between concepts FOREST and TREE in details. In fact, part of the concept FOREST is TREE IN THE FOREST, while there are FREE-STANDING TREE,TREE IN THE GARDEN etc. In any case, it is necessary to break the relationship of subordination of the concept TREE concept FOREST.

On the other side, FOREST is a species COLLECTIONS OF TREES, does not exist without trees (as well as GARDEN). Thus, the concept FOREST must be in relation to the concept TREE. Starting with an analysis of the needs of specific application problems, we came to the conclusion that it is important to describe the deep properties of relations that were previously very little reflected in linguistic resources, but which are of paramount importance for the tasks of automatic processing of large text collections, and, possibly, for many other tasks.

Now we model the description of quantifier and existential properties of concepts with a set of traditional thesaurus relations ABOVE-BELOW (66% of all relations), PART-WHOLE (30% of relations), ASSOCIATION (4%), in combination with a certain set of additional modifiers (20% of relations are marked ). Note that the PART-WHOLE and ASSOCIATION relationships are interpreted taking into account the rule (*). In total, about 160 thousand direct connections between concepts are described, which, taking into account the transitivity of relationships, gives a total number of different connections of more than 1350 thousand connections, that is, on average, each concept is connected with 30 others.

RuTez Thesaurus: general structure

The RuTez thesaurus is a hierarchical network of concepts corresponding to the meanings of individual words, text expressions or synonymous series. Thus, the main elements of a thesaurus are concepts, linguistic expressions, relationships between linguistic expressions and concepts, and relationships between concepts.

The thesaurus combines into a single system both linguistic knowledge - descriptions of lexemes, idioms and their connections, traditionally related to lexical, semantic knowledge, and knowledge about terms and relationships within subject areas, traditionally related to the field of activity of terminologists, described in information retrieval thesauri . As such subject sub-areas, the thesaurus describes such subject areas as economics, legislation, finance, international relations, which are so important for everyday human life that they have significant lexical representation in traditional explanatory dictionaries. In them, lexical and terminological are strongly interconnected and strongly interact with each other.

Linguistic expressions are individual lexemes (nouns, adjectives and verbs), nominal and verbal groups. Thus, the thesaurus does not currently include adverbs and function words as linguistic expressions. Multiword groups may include terms, idioms, lexical functions ( influence e).

For each linguistic expression the following is described:

Its polysemy is a connection with one or more concepts, which means that a given linguistic expression can serve as a textual expression of this concept. Attributing a linguistic expression to different concepts is also an implicit indication of its polysemy;

Its morphological composition (part of speech, number, case);

Writing features (for example, with a capital letter), etc.

Each thesaurus concept has a unique name, a list of linguistic expressions with which this concept can be expressed in the text, and a list of relationships with other concepts.

One of its unambiguous text expressions is usually chosen as a unique name for a concept. But the name of a concept can also be formed by a pair of its ambiguous text expressions - synonyms, written separated by commas and unambiguously defining it (for example, the concept THICK). An ambiguous text expression of the name of a concept can also be provided with a mark or a shortened fragment of interpretation, for example, concept CROWD (GROUP OF PEOPLE).

Example dictionary entry

We chose as an example the dictionary entry for the concept FOREST, corresponding to one of the meanings of the word forest. This dictionary entry is interesting because it includes different types of knowledge, traditionally classified as lexical (semantic) knowledge and encyclopedic knowledge (knowledge about the subject area, terminology).

Synonyms for the concept FOREST(total 13):

forest(M), forest zone, forest environment,

forest, forest quarter, forest landscape,

forest area, woodland, wooded area,

forest area, little forest,

array of forests.

Below concepts with synonyms:

JUNGLE(jungle);

FOREST PARK(city garden, green area,

green area, forest park,

forest management, forest park

belt, park(M), park area);

FORESTRY;

LEAVED FOREST(soft-leaved forest, hard-leaved

forest);

GROVE(oak grove);

CONIFEROUS FOREST (coniferous forest, dark coniferous forest)

Concepts-parts with synonyms:

WINDBREAK(windfall, windfall);

CUTTING(cutting area);

FOREST CULTURE(forest species, forestry

culture);

FOREST LAND (forest lands; lands covered

forest; forest lands, forest territory;

forested land, forested

area);

FOREST PLANTATIONS(forest plantations, forest plantations,

afforestation);

EDGE OF THE FOREST(edge, edge);

UNDERFLOWER(undergrowth);

PROSEKA;

DRY WOOD(deadwood).

Here the symbols (M) reflect a note about the ambiguity of the text input.

Concept FOREST It also has other relationships, the so-called dependency relationships (in the modern version they are called ASC 2 - asymmetric association): FOREST FIRE(forest fire, forest fire; FOREST USE (forest use, use of forest fund areas); FORESTRY; FOREST SCIENCE (forest science). As already noted in paragraph 2, the concept of FOREST depends on the concept of TREE, which in the thesaurus is denoted by the relation ASC 1.

Total concept FOREST is connected directly with 28 other concepts, taking into account the transitivity of relations - with 235 concepts (in total more than 650 text inputs).

Assessment of the current state

Russian language thesaurus RuTez

5.1. Lexical composition

Currently, the thesaurus network includes more than 95 thousand linguistic expressions, of which 61 thousand are single-word.

This volume of work forced us to decide what words and linguistic expressions needed to be included in the Thesaurus descriptions. The natural desire was to see how the most frequent words in the Russian language were represented in the thesaurus. For this purpose, the text collection of the University Information System RUSSIA (400 thousand documents) was used. The collection contains official documents from various bodies of the Russian Federation (55 thousand documents since 1992), as well as press materials since 1999 (newspapers Izvestia, Nezavisimaya Gazeta, Komsomolskaya Pravda, Argumenty i Fakty, Expert magazine and others), materials from scientific journals (“Bulletin of Moscow University”, “Sociological Journal”). A comparison was made between the list of lemmas included in the Thesaurus and the list of the most frequent 100,000 lemmas in the text collection (frequency more than 25).

Polexeme marking of the list showed that among these hundred thousand lemmas, 35 thousand are described in RuTez, only about 7 thousand lexemes deserve inclusion in the Thesaurus, the rest are lemmatic variants of various proper names. Therefore, replenishment has ceased to be a priority task and is carried out gradually, starting with the most frequent words. It is assumed that as soon as this list is mostly exhausted, another comparison will be made with the text array of the information system, new lexemes with a frequency of more than 25 will be selected. Next, the viewing threshold is supposed to be lowered. The presence of a large number of text examples in the text collection allows you to quickly respond to “lexical innovations” (for example, installation,blockbuster, beau monde, thriller) and include them in the appropriate places in the Thesaurus hierarchical system.

Constant work with a current text collection provides unique opportunities for checking the significance and quality of lexical descriptions proposed in dictionaries. For example, an unusually high frequency of use of the word Mother See(more than 400 times). Checking the array showed that the word is indeed often used as a synonym for the word Moscow, while explanatory dictionaries often mark this word as obsolete. Another example of a frequently used word (more than 300 times) marked as obsolete in dictionaries is the word blissful.

5.2 Description of word meanings

Comparison with the text collection shows that many of the frequency words in the array are well represented in the Thesaurus in at least one of their (usually basic) meanings. Finding out to what extent the spectrum of meanings of polysemantic words in the Russian language is represented in the Thesaurus is our primary task at the present time.

As is known, often different dictionary sources give a different set of meanings for polysemous words, highlight shades of meaning, and the same type of polysemy can be described differently for different words even in the same dictionary. Therefore, the task of consistently and representatively describing the meanings of lexemes is an important task for the creators of any vocabulary resource.

However, if the resource is intended for automatic processing, then the task of balanced description of values becomes much more important. Excessive value inflation can result in the computer system's inability to select the desired value, which in turn results in a significant reduction in the performance of the automatic word processing system. So, one of the disadvantages of the WordNet resource as a resource for automatic word processing is the excessive number of meanings described for some words (in WordNet 1.6: 53 meanings for run, 47 for play and so on.). These meanings are difficult to distinguish even for humans when semantically annotating texts. It is clear that the computer system also cannot cope with choosing the appropriate value. Therefore, different authors propose different ways to combine values to improve processing quality.

At the same time, the opposite factor operates: if the meanings really differ in their set of dictionary connections (in our case, thesaurus connections) - they cannot be glued into one unit (one concept) - this will also lead to a deterioration in the quality of automatic processing.

Let's take an example of the words school And church, each of which can be considered as an organization and as a building.

Each school organization has a building (most often one). All parts of the school building (classrooms, blackboards) are related to school how to an organization. There are no specific types of school buildings. Therefore the description schools As buildings, it is inappropriate to separate them into a separate concept. However, the description of such a collective concept SCHOOL as an organization and as a building must have a specially designed relationship with the concept BUILDING. When describing such relationships in the Thesaurus, a mark on the relationship is used - the modifier “A” (“aspect”; during automatic analysis, “confirmation” by other concepts is required to take this relationship into account).

SCHOOL

HIGHER EDUCATIONAL INSTITUTION

ABOVE A PUBLIC BUILDING

Corresponding meanings of the word church not that close. Churches As an organization, it can have a large number of church buildings in different places, and also has many other buildings. Church-building is closely related to religion and confession, but can change affiliation church organizations. Church-organization And church-building have different subspecies. That's why CHURCH (ORGANIZATION) And CHURCH (BUILDING) are presented in RuTez as different concepts.

The significant divergence in thesaurus connections correlates in an interesting way with the ability of denotations corresponding to meanings to exist separately from each other. Thus, a church-building does not cease to exist and even be called a church even when its use changes, unlike a school-building.

The process of verifying the representation of values in the Thesaurus is constantly underway, starting with the most frequent lemmas. For each frequency lexeme, it is checked how its meanings are described in explanatory dictionaries, what meanings are used in the collection and how they are presented in the Thesaurus. As a result, a list of 10,000 lexemes has now been formed, the ambiguity of which still requires either additional analysis or additional description. The list was obtained based on 30 thousand of the most frequent lemmas.

It should be noted that in the Thesaurus the problem of polysemy is partially removed due to the fact that thesaurus connections can be described between different meanings of a word, and therefore the highest concept in the hierarchy can be selected by default. It was definitely discussed in the text. For example, the word photo has three meanings: photography as a field of activity, photography as a photographic image, photography as a photo studio:

PHOTOGRAPHY(photographing, photo business, ..., photo )

PART PHOTOGRAPHIC IMAGE

(photo, photograph, photo )

PART PHOTO STUDIO (photo ).

Thus, if it was not possible to figure out what meaning the word was used photo, the default is to assume that a photo was taken (a process, a result, or a location), which is sufficient for many automatic text processing applications.

Application of the RuTez thesaurus

for automatic text processing

Since 1995, the socio-political terminology RuTez (socio-political thesaurus) has been actively and successfully used for various applications of automatic text processing, such as automatic conceptual indexing, automatic rubrication using several rubricators, automatic annotation of texts, including English-language ones. Socio-political thesaurus (27 thousand concepts, 62 thousand text entries) is a basic search tool in the UIS RUSSIA search system (www.cir.ru).

All vocabulary of the RuTez thesaurus is used in the procedures for automatically categorizing texts using complex hierarchical rubricators. In the existing technology, each category is described as a Boolean expression of terms, after which the original formula is expanded along the thesaurus hierarchy. The resulting Boolean expression may already include hundreds and thousands of conjuncts and disjuncts.

Let us give, as an example, a fragment of a description using thesaurus concepts (and linguistic expressions after expanding the formula) of the “Image of a Woman” rubric of the SOFIST 2 rubricator, used by VTsIOM to classify public opinion poll questionnaires:

(WOMAN[N]

|| GIRL[N]

|| RELATIVE [L] (grandmother, granddaughter, cousin,

daughter, sister-in-law, mother, stepmother, daughter-in-law, stepdaughter, ...))

(CHARACTER TRAIT[L] (thrifty, heartless, forgetful,

frivolous, mocking, intolerant, sociable, ...)

|| IMAGE [E] (presentation, appearance, appearance,

appearance, appearance, image, look)

|| PLEASANT [L] (..., interesting, beautiful, cute,

attractive, cute, attractive, ...)

|| UNPLEASANT[L] (unsympathetic, rude, nasty, ...)

|| APPRECIATE[L] (to revere, adore, adore,

worship, adore, ...)

|| PREFER[N]

The symbol “E” denotes full expansion along the thesaurus hierarchy, the symbol “L” - according to species relations (“BELOW”), the symbol “N” - do not expand.

Research is being carried out to develop a combined technology for automatic text categorization, combining thesaurus knowledge and machine learning procedures.

The issues of using a thesaurus to expand a query formulated in natural language are being explored (currently, only the socio-political part of the thesaurus is used to expand a terminological query in the information retrieval system of the UIS RUSSIA), and searching for answers to questions in large text collections.

7. Conclusion

The paper presents the basic principles of developing linguistic resources for automatic processing of large text collections. The created linguistic resource - Thesaurus of the Russian language RuTez - is intended for use in such automatic text processing applications as conceptual indexing of documents, automatic rubrication according to complex hierarchical rubricators, automatic expansion of natural language queries.

This work is partially supported by the Russian Humanitarian Foundation grant No. 00-04-00272a.

Literature

Lukashevich N.V., Saliy A.D., Representation of knowledge in the system of automatic text processing //NTI, Ser.2. 1997. No. 3. P. 1‑6.
Zhuravlev S.V., Yudina T.N., Information system RUSSIA //NTI, Ser.2. 1995. No. 3. P. 18‑20.
Winston M., Chaffin R., Herman D., A Taxonomy of Part-Whole Relations // Cognitive Science. 1987. No. 11. P. 417‑444.
Priss U.E., The Formalization of WordNet by Methods of Relational Concept Analysis // WordNet. An Electronic Lexical Database/Ed. by C. Fellbaum. Cambridge, Massachusetts, London, England.: The MIT Press 1998. P. 179‑196.
Guarino N., Welty C., A Formal Ontology of Properties // Proceedings of the ECAI-00 Workshop on Applications of Ontologies and Problem Solving Methods. Berlin: 2000. P. 121-128. (http://citeseer.nj.nec.com/guarino00formal.html).

Some Ontological Principles for Designing Upper Level Lexical Resources // First Int. Conf. on Language Resources and Evaluation. 1998.

Lukashevich N.V., Dobrov B.V., Modifiers of conceptual relations in thesaurus for automatic indexing // NTI, Ser.2. 2000, No. 4, pp. 21‑28.
Large explanatory dictionary of the Russian language / Ed. S.A. Kuznetsova. St. Petersburg: Norint, 1998.
Ozhegov S.I., Shvedova N.Yu., Explanatory Dictionary of the Russian Language - 3rd edition. M.: Az, 1996.
Apresyan Yu.D., Selected works, volume I. Lexical semantics: 2nd ed. M.: School “Languages of Russian Culture”, Ed. Firm "Oriental Literature" RAS, 1995.
G. Miller, R. Beckwith, C. Fellbaum, D. Gross and K. Miller, Five papers on WordNet, CSL Report 43. Cognitive Science Laboratory, Princeton University, 1990.
Chugur, J. Gonzalo and F. Verdjeo, Sense distinctions in NLP applications // Proceedings of “OntoLex-2000”: Ontologies and Lexical Knowledge Bases. Sofia: OntoTextLab. 2000.
Loukachevitch N., Dobrov B., Thesaurus-Based Structural Thematic Summary in Multilingual Information Systems // Machine Translation Review. 2000. No. 11. P. 10‑20. (http://www.bcs.org.uk/siggroup/nalatran/mtreview/mtr-11/mtr-11-8.htm).

Thesaurus of Russian language for natural language processing

of large text collections

Natalia V. Loukachevitch, Boris V. Dobrov

Keywords: thesaurus, natural language processing, informational retrieval

In our presentation we consider the main principles of developing lexical resources for automatic processing of large text collections and describe the structure of Thesaurus of Russian Language, which is developed since 1997 specially as a tool for automatic text processing. Now the Thesaurus is a hierarchical net of 42 thousand concepts. We describe the current stage of the Thesaurus developing in comparison with 100,000 the most frequent lemmas of the text collection of University Information System RUSSIA (www.cir.ru), including 400 thousand documents. Also we consider the use of the Thesaurus in different applications of automatic text processing.

Conceptual system of a subject area The basis of any subject area is the system of concepts of this area. Definition of a concept: A concept is a thought that reflects in a generalized form objects and phenomena of reality by fixing their properties and relationships; the latter (properties and relationships) appear in the concept as general and specific features, correlated with classes of objects and phenomena (Linguistic Dictionary)

Concepts and terms To express the concept of a subject area in texts, words or phrases called terms are used. The set of terms of a subject area form its terminological system. The relationship of a specific term with other terms of the term system of the subject area is specified by means of a definition

Definitions of the term? A word (or combination of words) that is an exact designation of a specific concept of any special field of science, technology, art, social life, etc. || A special word or expression used to designate something. in one environment or another, profession (Big Explanatory Dictionary of the Russian Language)

Terms - exact names of concepts Usually, each concept in the field corresponds to at least one unambiguously understood term, the meaning of which is this concept. - terms, in the sense of the traditional theory of terminology Properties of terms - exact names of concepts - the term must relate directly to the concept, it must express the concept clearly; - the meaning of the term must be precise and must not overlap in meaning with other terms; - the meaning of the term should not depend on the context. Terms that accurately name a concept are the subject of research by the theory of terminology, terminologists

Text terms In real texts of the subject area, to refer to a concept, in addition to basic terms, many different language expressions can be used, which we call text terms: - syntactic and word-formation options: recipient of budget funds - budget recipient; - lexical options – direct write-off, undisputed write-off; - polysemantic expressions, depending on the context, which serve as a reference to different concepts of the field, for example, the word currency in different contexts can mean national currency or foreign currency.

Descriptors with marks Litter - part of the name of the descriptor cranes (lifting equipment) vs cranes (birds) shells (structures) – comparison of different thesauruses Preferences for phrases: –Phonograph records vs. records (phonograph) Marks and plural: Wood (material) Woods (forested areas)

Including descriptors based on multi-word expressions Splitting a term increases ambiguity: plant food The meaning of the expression depends on the word order: information science - scientific information One of the component words is outside the scope of the thesaurus or is too general: first aid The relations of the descriptor do not follow from its structure: –Artificial kidneys, refugee status, traffic lights

Associative relations Field of activity - actor - Mathematics - mathematician Discipline - object of study - Neurology - nervous system Action - agent or tool - Hunting - hunter Action - result of action - Weaving - fabric Action - goal - Bookbinding - book Cause-effect - Death – funeral Value – unit of measurement – Current strength – ampere Action – counterparty – Allergen – antiallergic drug, etc.

Information retrieval thesauri: stages of development First stage: indexers describe the main topic of the text using arbitrary words and phrases Terms obtained from many texts are brought together Among terms that are similar in meaning, the most representative is selected Some of the remaining ones become conditional synonyms, the rest are deleted Specific terms are usually not included

Information retrieval thesauri: the art of development Descriptors are terms that are needed to express the main topic of the document Synonyms are included only the most necessary (for example, starting with a different letter) so as not to complicate the work of the indexer Related terms should be reduced to one term to avoid subjectivity indexing Hierarchy levels, inclusion of specific terms is limited

Information retrieval thesaurus: the art of development - 2 In complex cases, descriptors are supplied with marks and comments –LIV: bombardment – bombing – Polysemantic terms: one meaning in the thesaurus (capital), do not fit in the thesaurus, marks!!! Traditional information retrieval thesaurus is an artificial language built on the basis of real terms

Traditional IPT: application in automatic processing Lack of knowledge about the real language of the software Lack of knowledge about the real language of the software Legislative Indexing Vocabulary: Legislative Indexing Vocabulary: – in the text TROOPS – in the thesaurus MILITARY FORCES – in the text CAPITAL – capital, in the thesaurus only capital Suggested: each descriptor supplement with lists of words and terms It is proposed: each descriptor is supplemented with lists of words and terms But: polysemy or relating to different descriptors. But: polysemy or relating to different descriptors. Disambiguation resolution Disambiguation resolution

Traditional IPT: automatic query expansion Problem with associations Suggested: enter weights enter weights enter names of relations: object, property, etc. enter the names of relationships: object, property, etc. CONCLUSION: you need to learn how to build linguistic resources specifically for automatic processing of text collections

Thesaurus EUROVOC – multilingual thesaurus of the European Community Thesaurus in 9 languages Russian version of EUROVOC – +5 thousand concepts reflecting Russian specifics Multilingual thesaurus – Descriptor – names in different languages – Ascriptors – for some languages

Automatic indexing according to the EUROVOC thesaurus, based on rules (Hlava, Heinebach, 1996) Example rule: IF (near "Technology" AND with "Development") USE Community program USE development aid ENDIF 40 thousand rules. Testing: 20 most frequent descriptors in the text, generated automatically - 42% completeness, compared to manual rubrication

Automatic indexing based on establishing correspondence weights between words and descriptors (Steinberger et al., 2000) Stage 1 - establishing correspondence between text words and assigned descriptors based on statistical measures (chi-square or log-likelihood) FISHERY MANAGEMENT descriptor - the following words ( in descending order of weight): fishery, fish, stock, fishing, conservation, management, vessel, etc. Stage 2 indexing itself - summing the logarithms of weights or as a scalar product of vectors

A combination of free queries and queries based on an information retrieval thesaurus. A manually indexed collection – establishing correlations. A user asks a query in natural language. The query is expanded by the thesaurus descriptors that are most strongly correlated with the query (Petras 2004; Petras 2005). For example, at the request of Insolvent Companies, a list of descriptors liquidity, indebtness, enterprise, firm. can be obtained, and the query can be expanded. Accuracy in the experiment increased by 13%.