Effettua una ricerca
Pierpaolo Basile
Ruolo
Ricercatore a tempo determinato - tipo B
Organizzazione
Università degli Studi di Bari Aldo Moro
Dipartimento
DIPARTIMENTO DI INFORMATICA
Area Scientifica
AREA 09 - Ingegneria industriale e dell'informazione
Settore Scientifico Disciplinare
ING-INF/05 - Sistemi di Elaborazione delle Informazioni
Settore ERC 1° livello
Non Disponibile
Settore ERC 2° livello
Non Disponibile
Settore ERC 3° livello
Non Disponibile
This paper proposes two approaches to compositional semantics in distributional semantic spaces. Both approaches conceive the semantics of complex structures, such as phrases or sentences, as being other than the sum of its terms. Syntax is the plus used as a glue to compose words. The former kind of approach encodes information about syntactic dependencies directly into distributional spaces, the latter exploits compositional operators reflecting the syntactic role of words. We present a preliminary evaluation performed on GEMS 2011 “Compositional Semantics” dataset, with the aim of understanding the effects of these approaches when applied to simple word pairs of the kind Noun-Noun, Adjective-Noun and Verb-Noun. Experimental results corroborate our conjecture that exploiting syntax can lead to improved distributional models and compositional operators, and suggest new openings for future uses in real-application scenario.
This work presents a virtual player for the quiz game “Who Wants to Be a Millionaire?”. The virtual player demands linguistic and common sense knowledge and adopts state-of-the-art Natural Language Processing and Question Answering technologies to answer the questions. Wikipedia articles and DBpedia triples are used as knowledge sources and the answers are ranked according to several lexical, syntactic and semantic criteria. Preliminary experiments carried out on the Italian version of the boardgame proves that the virtual player is able to challenge human players.
This paper provides an overview of the work done in the Linked Open Data-enabled Recommender Systems challenge, in which we proposed an ensemble of algorithms based on popularity, Vector Space Model, Random Forests, Logistic Regression, and PageRank, running on a diverse set of semantic features. We ranked 1st in the top-N recommendation task, and 3rd in the tasks of rating prediciton and diversity.
This paper describes a new Word Sense Disambiguation (WSD) algorithm which extends two well-known variations of the Lesk WSD method. Given a word and its context, Lesk algorithm exploits the idea of maximum number of shared words (maximum overlaps) between the context of a word and each definition of its senses (gloss) in order to select the proper meaning. The main contribution of our approach relies on the use of a word similarity function defined on a distributional semantic space to compute the gloss-context overlap. As sense inventory we adopt BabelNet, a large multilingual semantic network built exploiting both WordNet and Wikipedia. Besides linguistic knowledge, BabelNet represents also encyclopedic concepts coming from Wikipedia. The evaluation performed on SemEval-2013 Multilingual Word Sense Disambiguation shows that our algorithm goes beyond the most frequent sense baseline and the simplified version of the Lesk algorithm. Moreover, when compared with the other participants in SemEval-2013 task, our approach is able to outperform the best system for English.
This paper proposes an approach to the construction of WordSpaces which takes into account temporal information. The proposed method is able to build a geometrical space considering several periods of time. This methodology enables the analysis of the time evolution of the meaning of a word. Exploiting this approach, we build a framework, called Temporal Random Indexing (TRI) that provides all the necessary tools for building WordSpaces and performing such linguistic analysis. We propose some examples of usage of our tool by analysing word meanings in two corpora: a collection of Italian books and English scientific papers about computational linguistics.
The textual similarity is a crucial aspect for many extractive text summarization methods. A bag-of-words representation does not allow to grasp the semantic relationships between concepts when comparing strongly related sentences with no words in common. To overcome this issue, in this paper we propose a centroid-based method for text summarization that exploits the compositional capabilities of word embeddings. The evaluations on multi-document and multilingual datasets prove the effectiveness of the continuous vector representation of words compared to the bag-of-words model. Despite its simplicity, our method achieves good performance even in comparison to more complex deep learning models. Our method is unsupervised and it can be adopted in other summarization tasks.
Distributional semantics approaches have proven their ability to enhance the performance of overlap-based Word Sense Disambiguation algorithms. This paper shows the application of such a technique for the Italian language, by analysing the usage of two different Distributional Semantic Models built upon ItWaC and Wikipedia corpora, in conjunction with two different functions for leveraging the sense distributions. Results of the experimental evaluation show that the proposed method outperforms both the most frequent sense baseline and other state-of-the-art systems.
In this paper we deal with the problem of providing users with cross-language recommendations by comparing two dierent content- based techniques: the rst one relies on a knowledge-based word sense disambiguation algorithm that uses MultiWordNet as sense inventory, while the latter is based on the so-called distributional hypothesis and exploits a dimensionality reduction technique called Random Indexing in order to build language-independent user proles.
This paper provides an overview of the work done in the ESWC Linked Open Data-enabled Recommender Systems challenge, in which we proposed an ensemble of algorithms based on popularity, Vector Space Model, Random Forests, Logistic Regression, and PageRank, running on a diverse set of semantic features. We ranked 1st in the top-N recommendation task, and 3rd in the tasks of rating prediction and diversity.
The exponential growth of the Web is the most influential factor that contributes to the increasing importance of text retrieval and filtering systems. Anyway, since information exists in many languages, users could also consider as relevant documents written in different languages from the one the query is formulated in. In this context, an emerging requirement is to sift through the increasing flood of multilingual text: this poses a renewed challenge for designing effective multilingual Information Filtering systems. How could we represent user information needs or user preferences in a language-independent way? In this paper, we compared two content-based techniques able to provide users with cross-language recommendations: the first one relies on a knowledge-based word sense disambiguation technique that uses MultiWordNet as sense inventory, while the latter is based on a dimensionality reduction technique called Random Indexing and exploits the so-called distributional hypothesis in order to build language-independent user profiles. Since the experiments conducted in a movie recommendation scenario show the effectiveness of both approaches, we tried also to underline strenghts and weaknesses of each approach in order to identify scenarios in which a specific technique fits better.
The exponential growth of the Web is the most influential factor that contributes to the increasing importance of cross-lingual text retrieval and filtering systems. Indeed, relevant information exists in different languages, thus users need to find documents in languages different from the one the query is formulated in. In this context, an emerging requirement is to sift through the increasing flood of multilingual text: this poses a renewed challenge for designing effective multilingual Information Filtering systems. Content-based filtering systems adapt their behavior to individual users by learning their preferences from documents that were already deemed relevant. The learning process aims to construct a profile of the user that can be later exploited in selecting/recommending relevant items. User profiles are generally represented using keywords in a specific language. For example, if a user likes movies whose plots are written in Italian, content-based filtering algorithms will learn a profile for that user which contains Italian words, thus movies whose plots are written in English will be not recommended, although they might be definitely interesting. In this paper, we propose a language-independent content-based recommender system, called MARS (MultilAnguage Recommender System), that builds cross-language user profiles, by shifting the traditional text representation based on keywords, to a more advanced language-independent representation based on word meanings. The proposed strategy relies on a knowledge-based word sense disambiguation technique that exploits MultiWordNet as sense inventory. As a consequence, content-based user profiles become language-independent and can be exploited for recommending items represented in a language different from the one used in the content-based user profile. Experiments conducted in a movie recommendation scenario show the effectiveness of the approach.
This paper investigates the role of Distributional Semantic Models (DSMs) into a Question Answering (QA) system. Our purpose is to exploit DSMs for answer re-ranking in QuestionCube, a framework for building QA systems. DSMs model words as points in a geometric space, also known as semantic space. Words are similar if they are close in that space. Our idea is that DSMs approaches can help to compute relatedness between users’ questions and candidate answers by exploiting paradigmatic relations between words, thus providing better answer reranking. Results of the evaluation, carried out on the CLEF2010 QA dataset, prove the effectiveness of the proposed approach.
This paper investigates the role of Distributional Semantic Models (DSMs) in Question Answering (QA), and specifically in a QA system called QuestionCube. QuestionCube is a framework for QA that combines several techniques to retrieve passages containing the exact answers for natural language questions. It exploits Information Retrieval models to seek candidate answers and Natural Language Processing algorithms for the analysis of questions and candidate answers both in English and Italian. The data source for the answer is an unstructured text document collection stored in search indices. In this paper we propose to exploit DSMs in the QuestionCube framework. In DSMs words are represented as mathematical points in a geometric space, also known as semantic space. Words are similar if they are close in that space. Our idea is that DSMs approaches can help to compute relatedness between users’ questions and candidate answers by exploiting paradigmatic relations between words. Results of an experimental evaluation carried out on CLEF2010 QA dataset, prove the effectiveness of the proposed approach.
In this paper we propose an innovative Information Retrieval system able to manage temporal information. The system allows temporal constraints in a classical keyword-based search. Information about temporal events is automatically extracted from text at indexing time and stored in an ad-hoc data structure exploited by the retrieval module for searching relevant documents. Our system can search textual information that refers to specific period of times. We perform an exploratory case study indexing all Italian Wikipedia articles.
A number of works have shown that the aggregation of several information Retrieval (IR) systems works better than each system working individually. Nevertheless, early investigation in the context of CLEF Robust-WSD task, in which semantics is involved, showed that aggregation strategies achieve only slight improvements. This paper proposes a re-ranking approach which relies on inter-document similarities. The novelty of our idea is twofold: the output of a semantic based IR, system is exploited to re-weigh documents and a new strategy based on Semantic Vectors is used to compute inter-document similarities.
Traditional Information Retrieval (IR) systems are based on bag-of-words representation. This approach retrieves relevant documents by lexical matching between query and document terms. Due to synonymy and polysemy, lexical methods produce imprecise or incomplete results. In this paper we present how named entities are integrated in SENSE (SEmantic N-levels Search Engine). SENSE is an IR system that tries to overcome the limitations of the ranked keyword approach, by introducing semantic levels which integrate (and not simply replace) the lexical level represented by keywords. Semantic levels provide information about word meanings, as described in a reference dictionary, and named entities. Our aim is to prove that named entities are useful to improve retrieval performance.
This paper proposes an Information Retrieval (IR) system that integrates sense discrimination to overcome the problem of word ambiguity. Word ambiguity is a key problem for systems that have access to textual information. Semantic Vectors are able to divide the usages of a word into different meanings, by discriminating among word meanings on the ground of information available in unannotated corpora. This paper has a twofold goal: the former is to evaluate the effectiveness of an IR system based on Semantic Vectors, the latter is to describe how they have been integrated in a semantic IR framework to build semantic spaces of words and documents. To achieve the first goal, we performed an in vivo evaluation in an IR scenario and we compared the method based on sense discrimination to a method based on Word Sense Disambiguation (WSD). Contrarily to sense discrimination, which aims to discriminate among different meanings not necessarily known a priori, WSD is the task of selecting a sense for a word from a set of predefined possibilities. To accomplish the second goal, we integrated Semantic Vectors in a semantic search engine called SENSE (SEmantic N-levels Search Engine).
The exponential growth of the Web is the most influential factor that contributes to the increasing importance of cross-lingual text retrieval and filtering systems. Indeed, relevant information exists in different languages, thus users need to find documents in languages different from the one the query is formulated in. In this context, an emerging requirement is to sift through the increasing flood of multilingual text: this poses a renewed challenge for designing effective multilingual Information Filtering systems. Content-based filtering systems adapt their behavior to individual users by learning their preferences from documents that were already deemed relevant. The learning process aims to construct a profile of the user that can be later exploited in selecting/recommending relevant items. User profiles are generally represented using keywords in a specific language. For example, if a user likes movies whose plots are written in Italian, a content-based filtering algorithm will learn a profile for that user which contains Italian words, thus failing in recommending movies whose plots are written in English, although they might be definitely interesting. Moreover, keywords suffer of typical Information Retrieval-related problems such as polysemy and synonymy. In this paper, we propose a language-independent content-based recommender system, called MARS (MultilAnguage Recommender System), that builds cross-language user profiles, by shifting the traditional text representation based on keywords, to a more complex language-independent representation based on word meanings. The proposed strategy relies on a knowledge-based word sense disambiguation technique that exploits MultiWordNet as sense inventory. As a consequence, content-based user profiles become language-independent and can be exploited for recommending items represented in a language different from the one used in the content-based user profile. Experiments conducted in a movie recommendation scenario show the effectiveness of the approach.
The exponential growth of the Web is the most influential factor that contributes to the increasing importance of cross-lingual text retrieval and filtering systems. Indeed, relevant information exists in different languages, thus users need to find documents in languages different from the one the query is formulated in. In this context, an emerging requirement is to sift through the increasing flood of multilingual text: this poses a renewed challenge for designing effective multilingual Information Filtering systems. In this paper, we propose a language-independent content-based recommender system, called MARS (MultilAnguage Recommender System), that builds cross-language user profiles, by shifting the traditional text representation based on keywords, to a more complex language-independent representation based on word meanings. As a consequence, the recommender system is able to suggest items represented in a language different from the one used in the content-based user profile. Experiments conducted in a movie recommendation scenario show the effectiveness of the approach.
Recommender Systems suggest items that are likely to be the most interesting for users, based on the feedback, i.e. ratings, they provided on items already experienced in the past. Time-aware Recommender Systems (TARS) focus on temporal context of ratings in order to track the evolution of user preferences and to adapt suggestions accordingly. In fact, some people's interests tend to persist for a long time, while others change more quickly, because they might be related to volatile information needs. In this paper, we focus on the problem of building an effective profile for short-term preferences. A simple approach is to learn the short-term model from the most recent ratings, discarding older data. It is based on the assumption that the more recent the data is, the more it contributes to find items the user will shortly be interested in. We propose an improvement of this classical model, which tracks the evolution of user interests by exploiting the content of the items, besides time information on ratings. When a new item-rating pair comes, the replacement of an older one is performed by taking into account both a decay function for user interests and content similarity between items, computed by distributional semantics models. Experimental results confirm the effectiveness of the proposed approach.
This paper describes OTTHO (On the Tip of my THOught), a system designed for solving a language game, called Guillotine. The rule of the game is simple: the player observes five words, generally unrelated to each other, and in one minute she has to provide a sixth word, semantically connected to the others. The system performs retrieval from several knowledge sources, such as a dictionary, a set of proverbs, and Wikipedia to realize a knowledge infusion process. The main motivation for designing an artificial player for Guillotine is the challenge of providing the machine with the cultural and linguistic background knowledge which makes it similar to a human being, with the ability of interpreting natural language documents and reasoning on their content. Our feeling is that the approach presented in this work has a great potential for other more practical applications besides solving a language game.
Information about top-ranked documents plays a key role to improve retrieval performance. One of the most common strategies which exploits this kind of information is relevance feedback. Few works have investigated the role of negative feedback on retrieval performance. This is probably due to the difficulty of dealing with the concept of non-relevant document. This paper proposes a novel approach to document re-ranking, which relies on the concept of negative feedback represented by non-relevant documents. In our model the concept of non-relevance is defined as a quantum operator in both the classical Vector Space Model and a Semantic Document Space. The latter is induced from the original document space using a distributional approach based on Random Indexing. The evaluation carried out on a standard document collection shows the effectiveness of the proposed approach and opens new perspectives to address the problem of quantifying the concept of non-relevance.
In this work, we propose a method for document re-ranking, which exploits negative feedback represented by non-relevant documents. The concept of non-relevance is modelled through the quantum negation operator. The evaluation carried out on a standard collection shows the eectiveness of the proposed method in both the classical Vector Space Model and a Semantic Document Space.
This paper describes OTTHO (On the Tip of my THOught), an artificial player able to solve a very popular language game, called “The Guillotine”, broadcast by the Italian National TV company. The game demands knowledge covering a broad range of topics, such as politics, literature, history, proverbs, and popular culture. The rule of the game is simple: the player observes five words, generally unrelated to each other, and in one minute she has to provide a sixth word, semantically connected to the others. In order to find the solution, a human being has to perform a complex memory retrieval task within the facts retained in her own knowledge, concerning the meanings of thousands of words and their contextual relations. In order to make this task executable by machines, machine reading techniques are exploited for knowledge extraction from the web, while Artificial Intelligence techniques are used to infer new knowledge, in the form of keywords, from the extracted information.
This paper describes the techniques used to build a virtual player for the popular TV game "Who Wants to Be a Millionaire?". The player must answer a series of multiple-choice questions posed in natural language by selecting the correct answer among four different choices. The architecture of the virtual player consists of 1) a Question Answering (QA) module, which leverages Wikipedia and DBpedia datasources to retrieve the most relevant passages of text useful to identify the correct answer to a question, 2) an Answer Scoring (AS) module, which assigns a score to each candidate answer according to different criteria based on the passages of text retrieved by the Question Answering module, and 3) a Decision Making (DM) module, which chooses the strategy for playing the game according to specific rules as wellas to the scores assigned to the candidate answers.We have evaluated both the accuracy of the virtual player to correctly answer to questions of the game, and its ability to play real games in order to earn money. The experiments have been carried out on questions comingfrom the official Italian and English boardgames. The average accuracy of the virtual player for Italian is 79.64%, which is significantly better than the performance of human players, which is equal to 51.33%. The average accuracy of the virtual player for English is 76.41%. The comparison with human players is not carried out for English since, playing successfully the game heavily depends on the players' knowledge about popular culture, and in this experiment we have only involved a sample of Italian players. As regards the ability to play real games, which involves the definition of a proper strategy for the usage of lifelines in order to decide whether to answer to a question even in a condition of uncertainty or to retire from the game by taking the earned money, the virtual player earns € 114,531 on average for Italian, and E 88,878 for English, which exceeds the average amount earned by the human players to a greater extent (€ 5,926 for Italian).
This paper proposes an investigation about a re-ranking strategy presented at SIGIR 2010. In that work we describe a re-ranking strategy in which the output of a semantic based IR system is used to re-weigh documents by exploiting inter-document similarities computed on a vector space. The space is built using the Random Indexing technique. The effectiveness of the strategy has been evaluated in the context of the CLEF Ad-Hoc Robust-WSD Task, while in this paper we propose new experiments in the TREC Ad-Hoc Robust Track 2004.
In this paper we exploit Semantic Vectors to develop an IR system. The idea is to use semantic spaces built on terms and documents to overcome the problem of word ambiguity. Word ambiguity is a key issue for those systems which have access to textual information. Semantic Vectors are able to dividing the usages of a word into different meanings, discriminating among word meanings based on information found in unannotated corpora. We provide an in vivo evaluation in an Information Retrieval scenario and we compare the proposed method with another one which exploits Word Sense Disambiguation (WSD). Contrary to sense discrimination, which is the task of discriminating among different meanings (not necessarily known a priori), WSD is the task of selecting a sense for a word from a set of predefined possibilities. The goal of the evaluation is to establish how Semantic Vectors affect the retrieval performance.
Artificial Intelligence technologies are growingly used within several software systems ranging from Web services to mobile applications. It is by no doubt true that the more AI algorithms and methods are used the more they tend to depart from a pure "AI" spirit and end to refer to the sphere of standard software. In a sense, AI seems strongly connected with ideas, methods and tools that are not (yet) used by the general public. On the contrary, a more realistic view of it would be a rich and pervading set of successful paradigms and approaches. Industry is currently perceiving semantic technologies as a key contribution of AI to innovation. In this paper a survey of current industrial experiences is used to discuss different semantic technologies at work in heterogeneous areas, ranging from Web services to semantic search and recommender systems.The resulting picture confirms the vitality of the area and allows to sketch a general taxonomy of approaches, that is the main contribution of this paper.
“The Guillotine” is a language game whose goal is to predict the unique word that is linked in some way to five words given as clues, generally unrelated to each other. The ability of the human player to find the solution depends on the richness of her cultural background. We designed an artificial player for that game, based on a large knowledge repository built by exploiting several sources available on the web, such as Wikipedia, that provide the system with the cultural and linguistic background needed to understand clues. The “brain” of the system is a spreading activation algorithm that starts processing clues, finds associations between them and words within the knowledge repository, and computes a list of candidate solutions. In this paper we focus on the problem of finding the most promising candidate solution to be provided as the final answer. We improved the spreading algorithm by means of two strategies for finding associations also between candidate solutions and clues. Those strategies allow bidirectional reasoning and select the candidate solution which is the most connected with the clues. Experiments show that the performance of the system is comparable to that of average human players.
Super-sense tagging is the task of annotating each word in a text with a super-sense, i.e. a general concept such as animal, food or person, coming from the general semantic taxonomy defined by the WordNet lexicographer classes. Due to the small set of involved concepts, the task is simpler than Word Sense Disambiguation, which identifies a specific meaning for each word. The small set of concepts allows machine learning algorithms to achieve good performance when coping with the problem of tagging. However, machine learning algorithms suffer from data-sparseness. This problem becomes more evident when lexical features are involved, because test data can contain words with low frequency (or completely absent) in training data. To overcome the sparseness problem, this paper proposes a supervised method for super-sense tagging which incorporates information coming from a distributional space of words built on a large corpus. Results obtained on two standard datasets, SemCor and SensEval-3, show the effectiveness of our approach.
This paper describes the participation of the UNIBA team in the Named Entity rEcognition and Linking (NEEL) Challenge. We propose a completely unsupervised algorithm able to recognize and link named entities in English tweets. The approach combines the simple Lesk algorithm with information coming from both a distributional semantic model and usage frequency of Wikipedia concepts. The results show encouraging performance.
We report the results of UNIBA participation in the first SemEval-2012 Semantic Textual Similarity task. Our systems rely on distributional models of words automatically inferred from a large corpus. We exploit three different semantic word spaces: Random Indexing (RI), Latent Semantic Analysis (LSA) over RI, and vector permutations in RI. Runs based on these spaces consistently outperform the baseline on the proposed datasets.
This paper presents the participation of the semantic N-levels search engine SENSE at the CLEF 2009 Ad Hoc Robust-WSD Task. Our aim is to demonstrate that the combination of the N-levels model and WSD can improve the retrieval performance even when an effective retrieval model is adopted. To reach this aim, we worked on two different strategies. On one hand a model, based on Okapi BM25, was adopted at each level. On the other hand, we integrated a local relevance feedback technique, called Local Context Analysis, in both indexing levels of the system (keyword and word meaning). The hypothesis that Local Context Analysis can be effective even when it works on word meanings coming from a WSD algorithm is supported by experimental results. In monolingual task MAP increased of about 2% exploiting disambiguation, while CMAP increased from 4% to 9% when we used WSD in both mono- and bi- lingual tasks.
This paper describes the UNIBA team participation in the Cross-Level Semantic Similarity task at SemEval 2014. We propose to combine the output of different semantic similarity measures which exploit Word Sense Disambiguation and Distributional Semantic Models, among other lexical features. The integration of similarity measures is performed by means of two supervised methods based on Gaussian Process and Support Vector Machine. Our systems obtained very encouraging results, with the best one ranked 6th out of 38 submitted systems.
This paper describes the UNIBA participation in the Semantic Textual Similarity (STS) core task 2013. We exploited three different systems for computing the similarity between two texts. A system is used as baseline, which represents the best model emerged from our previous participation in STS 2012. Such system is based on a distributional model of semantics capable of taking into account also syntactic structures that glue words together. In addition, we investigated the use of two different learning strategies exploiting both syntactic and semantic features. The former uses a combination strategy in order to combine the best machine learning techniques trained on 2012 training and test sets. The latter tries to overcame the limit of working with different datasets with varying characteristics by selecting only the more suitable dataset for the training purpose.
Condividi questo sito sui social