Linguistic Indicators in the Identification of Fake News

The issue of fake news identification was approached from the corpus linguistics and discursive studies perspective. The texts of both actual and fake news have been analysed in search of dependences that would permit the increase of the ability to determine the probability of the given news being real or fake, taking into account the discursive characteristics of the particular texts.


Introduction
Spreading false or manipulated information is not a new phenomenon. It has its own, long history, however, it became truly common in the world of the half-truth. Internet media, the change in the manner of functioning of the media previously defined as traditional as well as the way in which politics and business is conducted made fake news part of our everyday lives. It is used both as disinformation and in political or media conflicts by means of rendering the inconvenient information untrue. Therefore, the following questions are worth posing: is there any way to identify such false news?
If so, what mechanisms should one focus on? Are there any universal mechanisms?
The article focuses mainly on the identification of fake news, drawing the attention to the elements associated with the construction of such information, the used language (on the semantic, syntactic, pragmatic and discursive level) as well as the issues of trust towards the information source and the manner in which the information is distributed.
The research was divided into two fundamental parts. In the first one, the corpora of true and false news underwent a qualitative-quantitative comparative analysis. As the result of this analysis, the language dependencies, differences and similarities associated with the sentiment of the text, the sentence construction and the issues of the utterance construction, including the text complexity level, were indicated. In the second part, a model was created that permitted the increase of the probability of recognizing the information as true or false on the basis of language indicators and the credibility of the source.

The state of research in linguistics and IT
The research on false information flooding the contemporary media are conducted on many levels and from the viewpoints of multiple branches of science. In this article, the focus is placed mostly on the linguistic and IT aspects (with the elements of media studies 1 ).
The amount of direct linguistic research on fake news is low (Newman et al. 2003). However, if we approach the issue in a broader sense, it transpires that false information appearing in the media is connected with the subject of lies in the language (Antas 2008). This articles utilizes the most important findings from this area. One should also keep in mind that all of the research results obtained on the basis of an English material can be directly translated into results for the Polish language.
The research papers on lies in the language focus on the construction of a sentence (a lie is characterized by more complex syntax structures with a lower amount of conveyed information) (Antas 2008). The studies on deception in the language show that the stories on a given event feature significantly less "I" statements and fewer references to the authors' own experiences in the fake news. However, these studies were associated with an individual's statement on their own action, whereas the gathered corpus is connected with the coverage and conveying of information that the author was not part of, and the news form excludes the use of first person singular statements. Moreover, the research shows that false texts were dominated by expressions of a negative sentiment (Newman et al. 2003;Hancock, Toma, Ellison 2007;Enos et al. 2015). Lying in language can also be approached from the point of view of the theory of speech acts (Austin 1961). John L. Austin distinguishes the truthfulness of a reference from its effectiveness (denotation from connotation), points to the problem of pretending to be a speech act and its illocution. Media researchers and cognitivists also draw attention to illocution as an important issue. Leonard Shedletsky emphasizes the functioning of disinformation ("bullshit" phenomenon) by writing that in order to explain this phenomenon it is necessary to focus on the intent or character of the speaker, on the audience (their values and beliefs) and on the text itself (Shedletsky 2018). When studying fake news, the computer scientists focus mostly on: the way the false information is distributed, the 2017/2018 academic year in compliance with the rules of creating information published in the Internet.

Research proceedings
The analysis begins with the quantitative research of the real and fake texts. The sentiment of the article is defined (whether there are more positive or negative statements, what are the dominant values for the given article), a frequency list of the words in the given articles is created (on the basis of which the fundamental interpretative framework is established) and the amount of words in the sentences, the noun-to-verb ratio and the text complexity level (using the FOG index) are determined.
The results of the quantitative analyses become the basis of the qualitative analysis. What will be verified is the coherence of the interpretative framework and the convergence of the text and the title (Chopra, Jain, Sholar 2017). Moreover, the discursive analysis will also be conducted.
Next, the analysis will be complimented with conclusions on the sources of information and the authors of the given texts (Twitter entries), which were taken from the already conducted Twitter fake news research.

Methodology
The entirety of the qualitative analysis will be fitted into the cognitive paradigm. The interpretative framework theories evolving from the semantic framework will be used. The discursive analysis will also refer to the cognitive assumptions. The quantitative research will be conducted with the aid of the software created as part of the Clarin (clarin-pl.eu) project, the author's own software as well as the Jasnopis software.

Keywords
Keywords have been determined for each article, which allowed to establish the article's subject and which will form the base of the interpretative framework re-creation. The keywords have been divided into two groups: verbs, which form the framework centre, and nouns, adjectives and other parts of speech which may play the role of such information was constructed. They indicated the areas one should focus on when receiving information from the media to avoid considering the false information as real/true one. arguments in the sentences created around the verbs. The noun-to-verb ratio as well as the mean number of words in a sentence have also been determined for each of the articles. The results of the analysis are available in Tables 1 and 2.
When comparing the analyses of the true and fake news corpora one may notice a significant difference in the noun-to-verb ratio. In real news the ratio is significantly higher: an average of 4.27 in the true news corpus compared to 2.73 in the fake news corpus. When it comes to the amount of words in a sentence, the situation is similar. The true news corpus features an average of 20.6 words, whereas the fake news corpus -14.3. The domination of verbs is visible in both corpora and this may be a result of the popularity of the use of passive voice in the informative style.

The analysis of the sentiment and text complexity level
The corpora of true and fake news were analysed using the software created as part of the Clarin project (https://ws.clarin-pl.eu/sentyment.shtml) -the sentiment analyser. The analysis allows to determine how many words in the text have a positive sentiment and how many are described as negative. Moreover, it permits the identification of emotions dominating in the text (positive and negative), where one word may be a conglomerate of various emotions. Therefore, the number of words with, e.g. a positive sentiment will not be equal to the number of positive emotions in the Pobrane z czasopisma Mediatizations Studies http://mediatization.umcs.pl Data: 19/11/2019 03:11:07 U M C S given text (the number of emotions is usually higher). What was also determined is the number of words in the text to create the index of the saturation of the text with emotions depending on its length.
The texts were also analysed in terms of their difficulty (complexity) level, referring to, i.a., the Gunning readability index: FOG = 0.4×(LW/LZ) + 100×(LWT/LW)], where LW is the number of words in a text, LWT -the number of words having 4 or more syllables 8 65 and LZ -the number of sentences. The Jasnopis (http://jasnopis.pl/ aplikacja) application was utilized in the analysis. It allows, i.a., the determination of the text difficulty level, scoring it on the basis of a scale of 1-7, where 1 is a very simple text (comprehensible for grade 1-3 pupils) and 7 is a very complex text, understandable for specialists in the given area, doctorate holders 9 76.
The results of the analysis are available in Tables 3 and 4. 8 For the English language it was assumed that difficult words consist of 3 or more syllables; due to the characteristics of the Polish language, and after Bartosz Broda and others, it was assumed that it will be 4 syllables for the Polish language (Broda et al. 2010). 9 Cf. the scale and the scoring manner -http://jasnopis.pl/aplikacja#. Source: Author's own study. The analysed text corpora have a different (although not diametrically different) difficulty levels: an average of 5.17 for the true and 4 for the fake news corpus. The fake news texts are generally understandable for individuals with secondary education or those having a large amount of life experience, whereas the true news texts are, in general, more difficult and comprehensible for educated individuals. The text difficulty difference is one point in the seven-point scale.
However, what is worth noting is the large differentiation of the text difficulty due to the subjects and the defined target audience. In the true news corpus, the information on the politics in the country was the most simple (usually 3 or 4 points), whereas the most difficult texts were associated with scientific (ecology, medicine) and legal matters. It is not possible to distinguish such dependencies in the fake news corpus. What is significant, the fake texts corpus did not feature any texts with the highest difficulty levels -6 or 7 points on the difficulty scale. In the elaborations associated with lies in the language, the negative sentiment notions are considered as the indicators of falsity. In the true news corpus, the negative notions dominated over the positive ones in two texts (one of the articles focused on the assassination of a former Russian agent and his daughter in Great Britain, the other -on war in Syria). There were also two articles with an identical number of positive and negative sentiment notions (one focusing on the anniversary of the Soviet aggression on Poland and the second one -on the effects of turtle extinction). On the other hand, in the fake news corpus there were seven texts where the negative notions dominated and one where the number of positives and negatives were equal.

Qualitative analysis
The conducted discursive analysis shows that the true news often features an individual's full name, preceded with the name of their position (sometimes only the name of the institution is available) and one of the statements such as: said, announced, highlighted, underlined, added, called, stated, wrote, thanked, advised, informed, confirmed, etc. Some of the conclusions from the analysis prove challenging to be used in the model directly, e.g. the length of the sentence, despite the fact that the corpus conclusion analysis indicates a different average for the real and fake news, is not a decisive criterion in a single text. A similar situation occurs in terms of the average text complexity level.
The conclusions on whether the text is true or false should also be associated with the declared genre as news, articles, etc. differ. Therefore, some of the variables should be adjusted to the statement genre. The conclusions drawn in this article apply to news only.
When re-creating the interpretative framework, it is worthwhile to commence with a simple comparison of the article title and the keywords. If the determined keywords are not clearly connected with the title or are connected with it in vague manner, one may assume that the given information is fake. Should one encounter such an article, it is worth to compare the sentiment of the subject with the sentiment of the whole article. If the title is unambiguously positive or negative and the text content presents an opposite sentiment, the probability of the text being false is higher -cf., e.g. the "Painful loss in the life of Sławomir" text, where 2 of the 7 words in the title resonate with a negative and none with a positive sentiment and where the entire text seems definitely positive (13:4) 10 87 . The article keywords feature the word "sell", which is not relevant to the framework mentioned in the title -the framework of loss, usually of a close person or a valuable item. 10 It is similar in case of the following texts: "Sunday riots followed by a crisis" (lack of compliance between the title and the keywords, negative sentiment of the title, a 5:5 sentiment of the article), "Crowd of disappointed fans! Maciej Musiał reveals the shocking truth!" (low compliance of the keywords and the title, dominant presence of negative notions in the title, domination of positive notions in the text 10:4), "The drama of Polish families" (low compliance of the keywords and the title, dominant presence of negative notions in the title, domination of positive notions in the text 3:1).

The probability model of the recognition of the information as truthful
The conducted analyses prove that only the compilation of many factors associated with the text analysis and (as per the literature research) the determination of the source credibility may increase the probability with which one can decide whether the information in focus is true or fake. The model aiding in such distinctions has been presented below.
In order to increase the probability of the information being true or false, one should: 1. Analyse the text complexity level: if the article score is 6 or 7 points, the probability of the article's truthfulness increases. 1198 2. Verify the compliance of the keywords with the title (initially, a simple verification whether the keywords appear in the title; next, whether most important verbs permit, as per the Valence dictionary, constructions compliant with the interpretative framework appearing in the title): low or lack of compliance increases the probability that the given article is false, high compliance -that it is true.
3. Verify the sentiment of the subject and compare it with the sentiment of the whole article: if the sentiments are not in line, the probability that the news is fake increases.
4. Verify whether the text contains a source, usually an individual's full name along with their position and one of the verbs introducing an utterance. The presence of such formulae increases the probability that the news in focus is real.
5. Create a list of credible sources. If it contains Internet websites, newspapers, magazines, TV and radio stations, etc. then the credibility is defined on the basis of the quality of the conveyed information. In social media -it is defined on the basis of what the given user shared previously, whether they quote (forward, re-tweet) credible sources, how long they function in the given media, whether they possess a confirmed account, how many individuals follow the particular person, how many friends does he or she have and other factors dependent on the specific medium.

Conclusions
The analysis of the corpora of the gathered news shows that some of its elements may prove useful when creating a fake news identification model. The results presented in the article are associated with small corpuses -this research is exploratory, it will be broadened and the results will be verified on larger corpuses that are more diverse in terms of genre and the publication location (featuring both news and articles as well as posts/entries on the social media).
The analysis proved that the creation of a fake news identification model may be possible, however, it will not be a tool allowing to determine beyond all doubt, and in all cases, whether the given information is true or not. This is due to the fact that there are various actions, including the actions of the particular countries' institutions, that misinform, distort and create false information that is confirmed by the representatives of these institutions. In such cases, the journalists conveying the given information are almost certain that it is true -e.g. the fictional assassination of Arkady Babchenko. And in such cases, the model will not be of much assistance.