The main purpose of BioTAGME is to provide a tool that allows algorithmic text analysis texts and the extraction of latent associations in order to enrich their comprehension. In particular, BioTAGME focuses on biology, using the PubMed database as a source of knowledge, and as a basis for researching new biological knowledge. Given an input set of texts annotated with terms characterizing each document, the aim of this methodology consists of computing a new set of annotation terms as much as possible related to the input set but having no synonyms among the old annotations. To reach this purpose, approach consists of defining a correlation measure to compute a score which simultaneously ensures high correlation with the source and no possibility to build a random set of terms of the same size having a correlation greater than or similar to the computed one. BioTAGME implies the execution of four main steps:
- Apply TagMe algorithm (Ferragina and Scaiella, 2010), to each input text to build the first set of related terms.
- Execute the recommendation procedure DT-Hybrid algorithm (Alaimo et al., 2013) to extend this annotations.
- Compute for each annotation a correlation score through a similarity function;
- Use this score to compute a set of highly correlated terms together with a probability expressing the quality of such set.