Title: An application of data machine learning to explore relationships between factors of organisational silence and culture, with specific focus on predicting silence behaviours
Abstract: Research indicates that there are many individual reasons why people do not speak up when confronted with situations that may concern them within their working environment. One of the areas that requires more focused research is the role culture plays in why a person may remain silent when such situations arise. The purpose of this study is to use data science techniques to explore the patterns in a data set that would lead a person to engage in organisational silence. The main research question the thesis asks is: Is Machine Learning a tool that Social Scientists can use with respect to Organisational Silence and Culture, that augments commonly used statistical analysis approaches in this domain.
This study forms part of a larger study being run by the third supervisor of this thesis. A questionnaire was developed by organisational psychologists within this group to collect data covering six traits of silence as well as cultural and individual attributes that could be used to determine if someone would engage in silence or not. This thesis explores three of those cultures to find main effects and interactions between variables that could influence silence behaviours.
Data analysis was carried out on data collected in three European countries, Italy, Germany and Poland (n=774). The data analysis comprised of (1) exploring the characteristics of the data and determining the validity and reliability of the questionnaire; (2) identifying a suitable classification algorithm which displayed good predictive accuracy and modelled the data well based on eight already confirmed hypotheses from the organisational silence literature and (3) investigate newly discovered patterns and interactions within the data, that were previously not documented in the Silence literature on how culture plays a role in predicting silence.
It was found that all the silence constructs showed good validity with the exception of Opportunistic Silence and Disengaged Silence. Validation of the cultural dimensions was found to be poor for all constructs when aggregated to individual level with the exception of Humane Orientation Organisational Practices, Power Distance Organisational Practices, Humane Orientation Societal Practices and Power Distance Societal Practices. In addition, not all constructs were invariant across countries. For example, a number of constructs showed invariance across the Poland and Germany samples, but failed for the Italian sample.
Ten models were trained to identify predictors of a binary variable, engaged in Organisational Silence. Two of the most accurate models were chosen for further analysis of the main effects and interactions within the dataset, namely Random Forest (AUC = 0.655) and Conditional Inference Forests (AUC = 0.647). Models confirmed 9 out of 16 of the known relationships, and identified three additional potential interactions within the data that were previously not documented in the silence literature on how culture plays a role in predicting silence. For example, Climate for Authenticity was discovered to moderate the effect of both Power Distance Societal Practices and Diffident Silence in reducing the probability of someone engaging in silence.
This is the first time this instrument was validated via statistical techniques for suitability to be used across cultures. The techniques of modelling the silence data using classification algorithms with Partial Dependency Plots is a novel and previously unexplored method of exploring organisational silence. In addition, the results identified new information on how culture plays a role in silence behaviours. The results also highlighted that models such as ensembles that identify non-linear relationships without making assumptions about the data, and visualisations depicting interactions identified by such models, can offer new insights over and above the current toolbox of analysis techniques prevalent in social science research.
Supervisors: Dr. Geraldine Gray; Dr. Colm McGuinness
Title: A Wikipedia powered state-based approach to automatic search query enhancement
Abstract: The act of searching for documents on the Internet is a personal experience. Attempting to aid a user during their search can be difficult, as the results aimed at one user may
not fit another user’s search intent. During search, a number of factors can influence how the search performs. These factors include: the search query entered by the user, which can be too long, leading to the Information Retrieval (IR) engine guessing which elements of a query are central to their intent, or too short, leading the system to attempt and guess the user’s intent. In addition to this, the precision of the terms entered by the user can be questionable, often leading to irrelevant or conceptually distant terms being added to the query. In an effort to aid the user during search, a process titled Automatic Search Query Enhancement (ASQE) is often used. This process is designed to modify the user’s query, in an effort to return more relevant documents. The key to this process is the automatic element, which unlike other Search Query Enhancement (SQE) techniques, it does not require any user intervention. ASQE can perform a number of different modifications on the user’s query such as the addition, removal or replacement of search terms. Each alteration is performed on the query before it has been submitted to an information retrieval engine, allowing ASQE to be implemented on any IR engine. To achieve this, many ASQE techniques utilise an external data source for a priori. Recent developments in ASQE have shown the utilisation of the free encyclopedia Wikipedia, as an external source. Wikipedia offers a wealth of knowledge in the form of highly structured documents, search facilities, page redirection, term disambiguation, link analysis and API functionality, each of which can be harnessed during ASQE. In this research, five existing ASQE algorithms which utilise Wikipedia as the sole data source were tested and analysed. To further improve on these existing algorithms, this research describes the development and testing of a novel ASQE algorithm, the Wikipedia N Sub-state Algorithm, which utilises Wikipedia as the sole data source for a priori. This algorithm is built upon the concept of iterative states and sub-states, harnessing the power of Wikipedia’s data set and link information to identify and utilise reoccurring terms to aid term selection and weighting during ASQE. This algorithm is designed to prevent query drift by making call-backs to the user’s original search intent by persisting the original query between internal states with additional selected candidate enhancement terms. The developed algorithm has shown to improve both short and long queries, by providing a better understanding of the query and available a priori. The proposed algorithm was compared against the five test existing ASQE algorithms that utilise Wikipedia as the sole data source, with 18,000 individual relevance judgements made showing an average Mean Average Precision (MAP) improvement of 0.263 A comprehensive analysis of the key parameters for the WNSSA were also tested, outlining the optimal values to achieve higher precision during ASQE and to provide an insight into impact they have on overall performance.
Supervisor: Dr. Markus Hafmann
Title: A Linguistic Approach to Detecting Public Textual Cyberbullying
Abstract: Cyberbullying is one of the most prevalent risks encountered by young people on the internet (O‟Neill and Dihn, 2015) and our research focuses on developing rules for detecting public textual cyberbullying – cyberbullying that occurs as a result of text posted in the public domain – and advance current research in the field of cyberbullying by addressing not only explicit, but also implicit forms. We first employ a qualitative linguistic analysis of public textual cyberbullying to identify the linguistic parameters associated with public textual cyberbullying and its specific forms. We then propose a linguistically-motivated definition of public textual cyberbullying that provides three necessary and sufficient elements for qualifying public textual cyberbullying: the personal marker/pointer, the dysphemistic element and the link between them. We also propose a decomposition system based on the inherent characteristics of explicitness and implicitness, in order to break down the problem of textual cyberbullying and account for various levels of complexity associated with specific rules of detection. Based on our data, explicit forms are mostly realised by means of profane/obscene, insulting/offensive and violent language, while implicit forms are mostly realised by means of negation, animal metaphors and similes. The resulting detection system comprises the following distinct components: the lexical resources module, the pre-processing module, the discourse-dependent module, the explicit textual cyberbullying detection module and the implicit textual cyberbullying module. We then formulate the overall detection mechanism which takes advantage of pre-existing natural language processing techniques, but is also informed by our definition of public textual cyberbullying. Subsequently, we describe each module and the role it plays in the detection process. We also characterise the most common forms of public textual cyberbullying that we identified in our dataset and, for each of these forms, we describe a corresponding set of detection rules. We test the effectiveness of each set of rules against human performance in terms of several metrics, while the overall system is also tested against a baseline approach that employs a Naïve Bayes classification algorithm which uses local and sentiment features. The results of our experiments show that the specific sets of rules that we have developed for each public textual cyberbullying form approximates human performance on both, development and test, sets. Additionally, the results of the final experiment indicate that our detection system greatly outperforms the baseline across all measures. We also analyse discourse-dependent forms of public textual cyberbullying and how previous discourse can contribute to inferring the three fundamental cyberbullying elements. Based on this analysis, we then put forward the cyberbullying constructions associated with discourse-dependent forms, as well as resolution rules that allow us to infer the cyberbullying elements. Finally, we discuss the contributions, merits and limitations of the present research, and provide suggestions for future work.
Supervisors: Dr Anthony Keane, Dr Brian Nolan, and Prof Brian O’Neill
Title: Investigating the Efficacy of Algorithmic Student Modelling in Predicting Students at Risk of Failing in the Early Stages of Tertiary Education: Case study of experience based on first year students at an Institute of Technology in Ireland.
Abstract: The application of data analytics to educational settings is an emerging and growing research area. Much of the published works to-date are based on ever-increasing volumes of log data that are systematically gathered in virtual learning environments as part of module delivery. This thesis took a unqiue approach to modelling academic performance; it is a first study to model indicators of students at risk of failing in first year of tertiary education, based on data gathered prior to commencement of first year, facilitating early engagement with at-risk students.
The study was conducted over three years, in 2010 through 2012, and was based on a sample student population (n=1,207) aged between 18 and 60 from a range of academic disciplines. Data was extracted from both student enrolment data maintained by college administration, and an online, self-reporting, learner profiling tool developed specifically for this study. The profiling tool was administered during induction sessions for students enrolling into the first year of study. Twenty-four factors relating to prior academic performance, personality, motivation, self-regulation, learning approaches, learner modality, age and gender were considered.
Eight classification algorithms were evaluated. Cross validation model accuracies based on all participants were compared with models trained on the 2010 and 2011 student cohorts, and tested on the 2012 student cohort. Best cross validation model accuracies were a Support Vector Machine (82%) and Neural Network (75%). The k-Nearest Neighbour model, which has received little attention in educational data mining studies, achieved highest model accuracy when applied to the 2012 student cohort (72%). The performance was similar to its cross validation model accuracy (72%). Model accuracies for other algorithms applied to the 2012 student cohort also compared favourably; for example Ensembles (71%), Support Vector Machine (70%) and a Decision Tree (70%).
Supervisors: Dr. Colm McGuiness; Dr. Philip Owende.