Current PhD students:

Document Length-Normalization and Re-Weighting in Representation-Based Methods for Neural Information Retrieval

The main purpose of ad-hoc retrieval in Information Retrieval (IR) is to fulfil users’ information needs. In traditional retrieval models, this usually is achieved using ranking methods such as exact keyword matching. However, this frequently does not fit the users’ requirements, as several factors can influence the search. One of such factors is the precision of the query terms describing the information need or length of the query. The basic problems, like misspelled queries or short queries, can be solved relatively easily, however, it becomes challenging for the retrieval system when distinct queries which differ semantically are used. This often results in relevant documents being ranked lower due to keyword mismatch and limited semantic knowledge. Matching relevance and matching semantic properties are two tales of the same story. Relevancy is all about finding exact matches to the query, whereas semantics is about the similarity between two words or sentences. The more advanced language models are considered the best choice for such constraints involving both matching semantic similarity and relevancy. The current generation of language models excels at learning these semantic relationships between words to sentences from large amounts of unlabelled text data. Given these capabilities, neural network-based approaches such as representation-based ranking is increasingly applied to various tasks in IR. This method extensively addresses the semantic mismatch problem during retrieval. However, it is sensitive to basic IR characteristics such as the length, word occurrence, and noise in candidate documents. All this, together, challenges the language modelling-based paradigm in IR. To improve these challenges, this research makes several contributions to representation-based ad-hoc retrieval by significantly improving existing methods while proposing new approaches. Firstly, in embeddings settings, it proposes to normalize length of the document using traditional length-normalization methods to improve the retrieval efficiency. Because, documents with long text are likely to skew the notion of similarity while averaging over vectors resulting in a lower ranking. The length-normalization here prevents the unfair ranking of relevant documents with different lengths and shows significant improvement over defined baselines. Secondly, it proposes a novel approach to re-weighting word vectors based on contextual information extracted from language models. This outperforms existing reweighting methods such as Inverse-Document-Frequency and Smoothed Inverse-Frequency, with an average Mean Average Precision (MAP) increase of 6.67%. In other scenarios, together with traditional rankers it also outperforms learning-to-rank baselines with an average increment of 2.93% in terms of Normalized Discounted Cumulative Gain (NDCG). This research further shows that the proposed re-weighting method can also benefit other language modelling related tasks, such as Semantic Textual Similarity (STS).

Data science methods for drug repurposing and hypothesis expansion using literature-based discovery

In 2011, Emily Whitehead was at the Children’s Hospital of Philadelphia suffering from acute lymphoblastic leukaemia when by chance, a member of the medical team recognized that an elevated protein blocking defensive cells is involved in rheumatoid arthritis (RA), and there is an RA drug that stops production of that protein. Emily went on to fully recover, and Emily Whitehead’ case became a prominent example of a chance finding creating a positive outcome through serendipity. Ultimately, the matching of a stratified disease profile with a stratified patient profile, and then aligning a targeted treatment strategy is the goal of precision medicine. For the pharmaceutical industry, the opportunity lies in looking for ways to manage costs and increase productivity in an environment where drug development is lengthy (15-20 years) and requires significant investment ($500m-$2+bn). Repurposing of existing drugs to new diseases can achieve a cost reduction by a factor of 7.5, and data is a key enabler.

It is these types of situations that lead to the hypothesis that computational drug repositioning can have an influential impact. The hypothesis is that the types of searches previously described can be accelerated and improved by using a computational approach. For example, text mining could be used to mine the world’s research and clinical literature for the relevant connections between drugs and diseases, and thus empower doctors and scientists to make faster, more informed decisions. The most important problem to solve for, is one of scale; it is not possible for a single person or persons to conduct a thorough review of thousands of relevant documents – the challenge is to be able to process the different document types and distributed nature of the literature that combines all the evidence available for biomedical associations.

Computational approaches to drug repurposing have developed as we progress through the era of Big Data. As scientific publication rates increase, knowledge acquisition and the research development process have become more complex and time-consuming. Literature-Based Discovery (LBD), supporting automated knowledge discovery, helps facilitate this process by eliciting novel knowledge by analysing existing scientific literature. The traditional hypothesis generation models of Swanson and Vos involved connecting disjoint concepts from literature to create new hypotheses, though were produced via manual means. Hypothesis generation involves the processing of large datasets, in this case from literature, in order to find novel connections between biomedical entities. Where a novel linkage is found, this forms the basis of a new hypothesis that can be tested in a research or clinical setting or reviewed for relevance by subject matter experts. These novel hypotheses form the basis of a data-driven approach to drug repurposing and thus precision medicine.

This LBD PhD research aims to use a hybrid of data science and natural language processing methods to implement the ANC discovery model. The objective of the ANC model is to unify the classic hypothesis generation models of Swanson and Vos, with a modern application of machine learning methods. The research aims to develop an approach for combining natural language processing-based machine learning, multiple biological entity types, and custom evaluation metrics into a methodology to predict and rank biomedical discoveries.

The dataset is a large literature corpus from the biomedical research publisher PubMed. The data are pre-processed sentences, containing various biomedical entity co-occurrences. The project has defined stages relating to data input, processing, output, and evaluation. The input stage relates to the application of classification models being applied to biomedical entity pairs to express the strength of the relation. Once sentences are scored, processing relates to a secondary stage that aggregates sentence data to the relation level. To produce the output, that is, a predicted set of A-C relations, weighting schemes will be developed that express pathways through the model and tested against a set of evaluation metrics, that make up the evaluation stage.

A Deeper Analysis of Text Misclassifcations

This study presents a Python library developed as a means of gaining an understanding and insight into the occurrence of misclassified instances in text classification tasks. The principle is that from the insight, this in turn can be used in an effort to reduce the number of misclassifications. The Python programmed library, called py text misclass, produces meaningful comprehensive analysis of binary text misclassifications. The library is structured as one principal module to generate comprehensive analysis and a supplementary and optional module to pre-process and classify the raw text, transforming it to appropriate formats in preparation for the analysis phase. In this study, we use a sample binary text data set to demonstrate the library in use and illustrate the analysis that is accumulated through use of the py text misclass library.

Title: An Analysis of Grammatical Classes from the Signs of Ireland Corpus Using Association Rules Learning

Completed PhDs:

Title: An application of data machine learning to explore relationships between factors of organisational silence and culture, with specific focus on predicting silence behaviours 

Abstract: Research indicates that there are many individual reasons why people do not speak up when confronted with situations that may concern them within their working environment. One of the areas that requires more focused research is the role culture plays in why a person may remain silent when such situations arise. The purpose of this study is to use data science techniques to explore the patterns in a data set that would lead a person to engage in organisational silence. The main research question the thesis asks is: Is Machine Learning a tool that Social Scientists can use with respect to Organisational Silence and Culture, that augments commonly used statistical analysis approaches in this domain.

This study forms part of a larger study being run by the third supervisor of this thesis. A questionnaire was developed by organisational psychologists within this group to collect data covering six traits of silence as well as cultural and individual attributes that could be used to determine if someone would engage in silence or not. This thesis explores three of those cultures to find main effects and interactions between variables that could influence silence behaviours.

Data analysis was carried out on data collected in three European countries, Italy, Germany and Poland (n=774). The data analysis comprised of (1) exploring the characteristics of the data and determining the validity and reliability of the questionnaire; (2) identifying a suitable classification algorithm which displayed good predictive accuracy and modelled the data well based on eight already confirmed hypotheses from the organisational silence literature and (3) investigate newly discovered patterns and interactions within the data, that were previously not documented in the Silence literature on how culture plays a role in predicting silence.

It was found that all the silence constructs showed good validity with the exception of Opportunistic Silence and Disengaged Silence. Validation of the cultural dimensions was found to be poor for all constructs when aggregated to individual level with the exception of Humane Orientation Organisational Practices, Power Distance Organisational Practices, Humane Orientation Societal Practices and Power Distance Societal Practices. In addition, not all constructs were invariant across countries. For example, a number of constructs showed invariance across the Poland and Germany samples, but failed for the Italian sample.

Ten models were trained to identify predictors of a binary variable, engaged in Organisational Silence. Two of the most accurate models were chosen for further analysis of the main effects and interactions within the dataset, namely Random Forest (AUC = 0.655) and Conditional Inference Forests (AUC = 0.647). Models confirmed 9 out of 16 of the known relationships, and identified three additional potential interactions within the data that were previously not documented in the silence literature on how culture plays a role in predicting silence. For example, Climate for Authenticity was discovered to moderate the effect of both Power Distance Societal Practices and Diffident Silence in reducing the probability of someone engaging in silence.

This is the first time this instrument was validated via statistical techniques for suitability to be used across cultures. The techniques of modelling the silence data using classification algorithms with Partial Dependency Plots is a novel and previously unexplored method of exploring organisational silence. In addition, the results identified new information on how culture plays a role in silence behaviours. The results also highlighted that models such as ensembles that identify non-linear relationships without making assumptions about the data, and visualisations depicting interactions identified by such models, can offer new insights over and above the current toolbox of analysis techniques prevalent in social science research.

Supervisors: Dr. Geraldine Gray; Dr. Colm McGuinness

Title: A Wikipedia powered state-based approach to automatic search query enhancement

Abstract: The act of searching for documents on the Internet is a personal experience. Attempting to aid a user during their search can be difficult, as the results aimed at one user may
not fit another user’s search intent. During search, a number of factors can influence how the search performs. These factors include: the search query entered by the user, which can be too long, leading to the Information Retrieval (IR) engine guessing which elements of a query are central to their intent, or too short, leading the system to attempt and guess the user’s intent. In addition to this, the precision of the terms entered by the user can be questionable, often leading to irrelevant or conceptually distant terms being added to the query. In an effort to aid the user during search, a process titled Automatic Search Query Enhancement (ASQE) is often used. This process is designed to modify the user’s query, in an effort to return more relevant documents. The key to this process is the automatic element, which unlike other Search Query Enhancement (SQE) techniques, it does not require any user intervention. ASQE can perform a number of different modifications on the user’s query such as the addition, removal or replacement of search terms. Each alteration is performed on the query before it has been submitted to an information retrieval engine, allowing ASQE to be implemented on any IR engine. To achieve this, many ASQE techniques utilise an external data source for a priori. Recent developments in ASQE have shown the utilisation of the free encyclopedia Wikipedia, as an external source. Wikipedia offers a wealth of knowledge in the form of highly structured documents, search facilities, page redirection, term disambiguation, link analysis and API functionality, each of which can be harnessed during ASQE. In this research, five existing ASQE algorithms which utilise Wikipedia as the sole data source were tested and analysed. To further improve on these existing algorithms, this research describes the development and testing of a novel ASQE algorithm, the Wikipedia N Sub-state Algorithm, which utilises Wikipedia as the sole data source for a priori. This algorithm is built upon the concept of iterative states and sub-states, harnessing the power of Wikipedia’s data set and link information to identify and utilise reoccurring terms to aid term selection and weighting during ASQE. This algorithm is designed to prevent query drift by making call-backs to the user’s original search intent by persisting the original query between internal states with additional selected candidate enhancement terms. The developed algorithm has shown to improve both short and long queries, by providing a better understanding of the query and available a priori. The proposed algorithm was compared against the five test existing ASQE algorithms that utilise Wikipedia as the sole data source, with 18,000 individual relevance judgements made showing an average Mean Average Precision (MAP) improvement of 0.263 A comprehensive analysis of the key parameters for the WNSSA were also tested, outlining the optimal values to achieve higher precision during ASQE and to provide an insight into impact they have on overall performance.

Supervisor: Dr. Markus Hafmann

Title: A Linguistic Approach to Detecting Public Textual Cyberbullying

Abstract: Cyberbullying is one of the most prevalent risks encountered by young people on the internet (O‟Neill and Dihn, 2015) and our research focuses on developing rules for detecting public textual cyberbullying – cyberbullying that occurs as a result of text posted in the public domain – and advance current research in the field of cyberbullying by addressing not only explicit, but also implicit forms. We first employ a qualitative linguistic analysis of public textual cyberbullying to identify the linguistic parameters associated with public textual cyberbullying and its specific forms. We then propose a linguistically-motivated definition of public textual cyberbullying that provides three necessary and sufficient elements for qualifying public textual cyberbullying: the personal marker/pointer, the dysphemistic element and the link between them. We also propose a decomposition system based on the inherent characteristics of explicitness and implicitness, in order to break down the problem of textual cyberbullying and account for various levels of complexity associated with specific rules of detection. Based on our data, explicit forms are mostly realised by means of profane/obscene, insulting/offensive and violent language, while implicit forms are mostly realised by means of negation, animal metaphors and similes. The resulting detection system comprises the following distinct components: the lexical resources module, the pre-processing module, the discourse-dependent module, the explicit textual cyberbullying detection module and the implicit textual cyberbullying module. We then formulate the overall detection mechanism which takes advantage of pre-existing natural language processing techniques, but is also informed by our definition of public textual cyberbullying. Subsequently, we describe each module and the role it plays in the detection process. We also characterise the most common forms of public textual cyberbullying that we identified in our dataset and, for each of these forms, we describe a corresponding set of detection rules. We test the effectiveness of each set of rules against human performance in terms of several metrics, while the overall system is also tested against a baseline approach that employs a Naïve Bayes classification algorithm which uses local and sentiment features. The results of our experiments show that the specific sets of rules that we have developed for each public textual cyberbullying form approximates human performance on both, development and test, sets. Additionally, the results of the final experiment indicate that our detection system greatly outperforms the baseline across all measures. We also analyse discourse-dependent forms of public textual cyberbullying and how previous discourse can contribute to inferring the three fundamental cyberbullying elements. Based on this analysis, we then put forward the cyberbullying constructions associated with discourse-dependent forms, as well as resolution rules that allow us to infer the cyberbullying elements. Finally, we discuss the contributions, merits and limitations of the present research, and provide suggestions for future work.

Supervisors: Dr Anthony Keane, Dr Brian Nolan, and Prof Brian O’Neill

Title: Investigating the Efficacy of Algorithmic Student Modelling in Predicting Students at Risk of Failing in the Early Stages of Tertiary Education: Case study of experience based on first year students at an Institute of Technology in Ireland.

Abstract: The application of data analytics to educational settings is an emerging and growing research area. Much of the published works to-date are based on ever-increasing volumes of log data that are systematically gathered in virtual learning environments as part of module delivery. This thesis took a unqiue approach to modelling academic performance; it is a first study to model indicators of students at risk of failing in first year of tertiary education, based on data gathered prior to commencement of first year, facilitating early engagement with at-risk students.

The study was conducted over three years, in 2010 through 2012, and was based on a sample student population (n=1,207) aged between 18 and 60 from a range of academic disciplines. Data was extracted from both student enrolment data maintained by college administration, and an online, self-reporting, learner profiling tool developed specifically for this study. The profiling tool was administered during induction sessions for students enrolling into the first year of study. Twenty-four factors relating to prior academic performance, personality, motivation, self-regulation, learning approaches, learner modality, age and gender were considered.

Eight classification algorithms were evaluated. Cross validation model accuracies based on all participants were compared with models trained on the 2010 and 2011 student cohorts, and tested on the 2012 student cohort. Best cross validation model accuracies were a Support Vector Machine (82%) and Neural Network (75%). The k-Nearest Neighbour model, which has received little attention in educational data mining studies, achieved highest model accuracy when applied to the 2012 student cohort (72%). The performance was similar to its cross validation model accuracy (72%). Model accuracies for other algorithms applied to the 2012 student cohort also compared favourably; for example Ensembles (71%), Support Vector Machine (70%) and a Decision Tree (70%).

Supervisors: Dr. Colm McGuiness; Dr. Philip Owende.