The program can analyze content chat conversations based on the frequent use of sexually explicit language.ShutterStock/Legion-Media
In 2011, experts at the Higher School of Economics (HSE) in Moscow, in cooperation with European scientists, developed a computer program to analyze large volumes of unstructured text. While it can be used to solve various problems, the most interesting application is the detection of pedophiles in Internet chat rooms.
The software has been successfully used by the Amsterdam police, but its precise name remains a carefully guarded secret, as well as most details of its application.
This program can analyze content chat conversations based on the frequent use of sexually explicit language. Police officers get a visual representation of the relationships between chat participants, the vocabulary used, and possible sexual acts. A final conclusion is then made by a criminalist.
The program has at least six major components. The Russian part analyzes the bulk of texts based on formal concepts (Formal Concept Analysis), and it arranges the data in a so-called concept lattice diagram – a convenient visual chart. This component is part of the automated research system, Formal Concept Analysis Research Toolbox (FCART).
"The database organizes concepts from a computer's point of view --- what are things such as 'pedophile,' 'crime,' 'flirting,' or 'personal meeting," explained Alexei Neznanov, a senior researcher at the HSE's Laboratory of Intelligent Systems and Structural Analysis. "This is how we helped pass information to the computer from forensic experts, who can now determine the nature of texts by looking at diagrams. Previously, they read and analyzed chat texts almost entirely themselves."
The program is able to trace suspects even when he or she uses different nicknames. It tracks similarities in word usage by analyzing different chat sessions, identifies the sequence in which fragments of the text were created, and how they are related in time. This feature was developed by Belgian and Dutch scientists.
When creating the program and its database, developers took into account the particularities of Internet chat culture. "We compiled slang names for body parts, the use of numbers for words -- such as `2' for `to,' or `4' for `for' -- as well as standard chat abbreviations such as LOL and commonly misspelled words," Dr. Neznanov told RBTH.
In addition, it was necessary to implement measures to protect legitimate professional chats. "A classic example is chats between photographers who discuss photographing a group of children," said Dr. Neznanov.
"Specifically for such cases we had to clarify the concept of 'request for photo and video materials,' taking into account that most photographers are not pedophiles.''
The program was tested on bulk texts provided by a U.S. organization that fights pedophilia, as well as on a database of actual crimes committed. It can be used not only for chat sessions but also for other Internet texts, including those from social networks.
The program can scan both open or closed chats of underage members with their parents' permission. Closed chats are examined by undercover police officers, who save chat sessions in police databases.
Presently, the program can analyze text in English, Dutch, and German. Researchers cannot work with other languages, including Russian. The necessary computer linguistic tools for Russian, French, and other languages have not yet been developed.
All rights reserved by Rossiyskaya Gazeta.