ABBYY pushes the boundaries of computer linguistics

Director of ABBYY linguistics Vladimir Selegey: "Computer linguistics, overall, is an engineering science that arises from attempts to either anticipate or follow the needs of people who work with language on their computers." Source: ABBYY

Director of ABBYY linguistics Vladimir Selegey: "Computer linguistics, overall, is an engineering science that arises from attempts to either anticipate or follow the needs of people who work with language on their computers." Source: ABBYY

Director of ABBYY linguistics Vladimir Selegey talks to about the possibilities of the modern computer linguistics.

What is the difference between the search engine function and computer linguistics? Can natural language and machine language be merged? What counts as a language? Vladimir Selegey, head of the faculty of computer linguistics at the Russian State University for the Humanities and Moscow Institute of Physics and Technology, and director of the ABBYY linguistics research company, answered these and other questions for does your way of doing things differing from other search engines on the market?

Vladimir Selegey:

The linguistics technology behind modern search engines is pretty basic. Let me explain what I mean here. Search engines are currently dominated by statistical methodology: they deliver quick results but without any linguistic analysis; they compare humongous quantities of text data on the system and the history of queries that have been fed in.

It’s a way of applying a modern mathematical approach to machine-learning, but the search criteria bear no relation to any analysis of meaning.

We’ve got a different way of working. We want to compare the semantic proximity of the question to the text it finds, based on semantic analysis. Sure — it’s more risky and it costs more, so fewer people go down this “knowledge-hungry” route. Google and Microsoft Research have been doing related research on this stuff. They can afford the luxury of addressing these types of problems. So your ABBYY Compreno system — is it really a machine that can “understand for itself?” What does “understand” really imply here, in computer linguistics terms?

V.S.: We’re not merely comparing the superficial language chains or the word-sequences, but their deeper concepts, which can be associated with “meanings.” Was such a pragmatic perception of computer linguistics applied to research activities within ABBYY?

V.S.: Actually, fundamentality was our major concern when we went ahead with the methodology we’re using. From the outset we took the pretty costly and resource-intensive decision to build a universal linguistic model that compelled us to follow a specific sequence. When you work with language, you can’t leave out any of the stages: you need a complete morphology, syntax, semantics, grammatical semantics, and so on.

Creating the linguistic model took us a long time, and you can add the description of any language to it. What we created was a model that demonstrated its own functionality, and we tested it in five languages — Russian, English, German, French and Chinese. Where do you go from here? What’s your department working on currently?

V.S.: We’ve reached the point where we’ve got solid guarantees that no new language could throw us any kind of curve ball. Our linguistic analysis technologies can be put to use on almost any practical task. Our initial projects were on computerized translation.

But nowadays the monumental market for informational search has opened up, and this demands new linguistic technology: known factors and tasks connected with the classification of documents; extrapolating facts and connections; comparing and identifying variances in documents. One of the main research fields for us currently is a shift from linguistic write-ups to a system of formal descriptions of specific subject areas.

For example, this might be a universal ontology of space and time, or a systemic description of some particular environment that’s referenced in the text. Is this really linguistic work?

V.S.: In the strictest sense, it is. In the description of a location there’s no language that couldn’t be included. Let’s say, for example, you’re setting up a model of a computer game in which you need to model some objects. There’s also a temporary kind of ontology, systems of cause-and-effect relationships, and other descriptive systems that are reflected in language not in a trivial way, but which exist on their own terms. Wouldn’t you say that the future lies in a fusion of computer language with natural language? How far the difference between natural and non-natural languages can be justified, if we take into consideration the enormous role of machine systems in the virtualization of reality — a role that they are playing right now, and will play in the future?

V.S.: Natural language has many functions. The direct “coding” of what can be represented in machine language is just one of them, but other functions are implemented in the process of communication. The “incompleteness” of natural language, which precludes it being used as a formal language — ambiguities, redundancies, ellipses — is simultaneously the source of its infinite possibility for communication.

The formal language of universal semantic entities has value, because it permits us to work with text not only on a computer, but within special subjects such as physics or math. Using this go-between language we can project text onto, let’s say, the language of logical predicates, or a descriptive system of some types of laws of physics. But we’re not merging anything here — they’re just different languages for different ends. Computer linguistics is something with a practical use, but a concept of what a language is based upon starts to become increasingly blurred, wouldn’t you say? Isn’t it possible that linguistics has lost its own language?

V.S.: Computer linguistics, overall, is an engineering science that arises from attempts to either anticipate or follow the needs of people who work with language on their computers — that’s all it is. We don’t yet know how mankind’s language ability works — the mechanism that’s permitting us to talk. And that’s the issue that the global study of linguistics is dealing with. We’re trying to create a model of how language is able to relay information. So if we took that to the limit, we could make virtual movies of the texts of classic novels, for example?

V.S.: Sure, ideally. But the world model through which you’d have to play the literary text would be unfeasibly complex, compared with football, with its basic action. I love the idea of programmed visualizations of classic books! But it would be majorly complicated: you’d need to include not just the widest-imaginable knowledge of the world, but deeply-sourced models of the understanding of human psychology too.

But I guess the day will come when someone makes an attempt at it, for sure!

The interview is first published in Russian in

All rights reserved by Rossiyskaya Gazeta.