For citation:

Grigorieva E. G., Klyachin V. A. The Study of the Statistical Characteristics of the Text Based on the Graph Model of the Linguistic Corpus. Izvestiya of Saratov University. Mathematics. Mechanics. Informatics, 2020, vol. 20, iss. 1, pp. 116-126. DOI: 10.18500/1816-9791-2020-20-1-116-126, EDN: KNPYYV

This is an open access article distributed under the terms of Creative Commons Attribution 4.0 International License (CC-BY 4.0).

Published online:

02.03.2020

Full text:

download

(downloads: 669)

Language:

Russian

Heading:

Computer Sciences

Article type:

Article

UDC:

519.688+004.942

DOI:

10.18500/1816-9791-2020-20-1-116-126

EDN:

KNPYYV

The Study of the Statistical Characteristics of the Text Based on the Graph Model of the Linguistic Corpus

Autors:

Grigorieva Elena Gennad'evna, Volgograd State University

Klyachin Vladimir Aleksandrovich, Volgograd State University

Abstract:

The article is devoted to the study of the statistical characteristics of the text, which are calculated on the basis of the graph model of the text from the linguistic corpus. The introduction describes the relevance of the statistical analysis of the texts and some of the tasks solved using such an analysis. The graph model of the text proposed in the article is constructed as a graph in the vertices of which the words of the text are located, and the edges of the graph reflect the fact that two words fall into any part of the text, for example, in — a sentence. For the vertices and edges of the graph, the article introduces the concept of weight as a value from some additive semigroup. Formulas for calculating a graph and its weights are proved for text concatenation. Based on the proposed model, calculations are implemented in the Python programming language. For an experimental study of statistical characteristics, 24 values are distinguished, which are expressed in terms of the weights of the vertices, edges of the graph, as well as other characteristics of the graph, for example, the degrees of its vertices. It should be noted that the purpose of numerical experiments is to squeak in the characteristics of the text, with which you can determine whether the text is man-made or randomly generated. The article proposes one of the possible such algorithms, which generates random text using some other text created by man as a template. In this case, the sequence of parts of speech in an auxiliary text alternation is preserved in the random text. It turns out that the required conditions are satisfied by the median value of the ratio of the text graph edge weight value to the number of sentences in the text.

Key words:

text

graph

linguistic corpus

automatic text processing

References:

Kipyatkova I. S., Karpov A. A. Automatic processing and statistic analysis of the news text corpus for a language model of a Russian language speech recognition system. Informatsionno-upravliayuschie sistemy [Information and Control Systems], 2010, no. 4 (47), pp. 2–8 (in Russian).
Kolmogorova A. V., Kalinin A. A., Malikova A. V. Linguistic principles and computational linguistics methods for the purposes of sentiment analysis of Russian texts. Aktual’nye problemy filologii i pedagogicheskoi lingvistiki [Actual problems of philology and pedagogical linguistics], 2018, no. 1 (29), pp. 139–148 (in Russian). DOI: https://doi.org/10.29025/2079-6021-2018-1(29)-139-148
Voronina I. E., Kretov A. A., Popova I. V. Algorithms of semantic proximity assessment based on the lexical environment of the key words in a text. Proceedings of Voronezh State University. Ser. Systems analysis and information technologies, 2010, no. 1, pp. 148–153 (in Russian).
Berman N. D., Levenets A. V., Sergeeva L. A. Statistical analysis of textual information. In: Informatsionnye tekhnologii XXI veka [Information Technologies of the XXI Century. Collection of Scientific Papers]. Khabarovsk, Izdatel’stvo Tikhookeanskgo gosudarstvennogo universiteta, 2016, pp. 282–286 (in Russian).
Donina O. V The application of data mining methods in linguistics. Proceedings of Voronezh State University. Ser. Systems analysis and information technologies, 2017, no. 1, pp. 154–160 (in Russian).
Mikolov T., Chen K., Corrado G., Dean J. Efficient Estimation of Word Representations in Vector Space. arxiv.org/abs/1301.3781v3
Raigorodskii A. M. Random Graphs. In: Matematika v zadachakh [Mathematics in Problems]. Moscow, Izdatel’stvo Moskovskogo tsentra nepreryvnogo matematicheskogo obrazovaniya, 2009, pp. 312–315 (in Russian).
Erdos P., R ˝ anyi A. On random graphs I. ´ Publ. Math. Debrecen, 1959, vol. 6, pp. 290–297.
Newman M. E. J., Strogatz S. H., Watts D. J., Random graphs with arbitrary degree distribution and their applications. Phys. Rev. E, 2001, vol. 64, pp. 26–118.
Pavlov Yu. L., Cheplyukova I. A. Random graphs of Internet type and the generalised allocation scheme. Discrete Mathematics and Applications, 2008, vol. 18, iss. 5, pp. 447–463. DOI: https://doi.org/10.1515/DMA.2008.033
Pavlov Yu. L. On the limit distributions of the vertex degrees of conditional Internet graphs. Discrete Mathematics and Applications, 2009, vol. 19, iss. 4, pp. 349–359. DOI: https://doi.org/10.1515/DMA.2009.023

Received:

28.02.2019

Accepted:

19.05.2019

Published:

02.03.2020

Journal issue:

Izv. Saratov Univ. (N. S.), Ser. Math. Mech. Inform., 2020, vol. 20, iss. 1

1366 reads

Headings

For citation:

The Study of the Statistical Characteristics of the Text Based on the Graph Model of the Linguistic Corpus

User login