With the introduction of the first smartphone around ten years ago many people have come to grabs with a new kind of technology that surprised them, as they never thought that their lives would benefit from it as much as they did. The next surprise in this technological journey came five years later, when virtual assistants, a “fairytale” whose origins date back to the 60’s, were introduced in smartphones. It was not just an evolution, it was a breakthrough! Even more so when one considers that these assistants operate from a portable gadget that sits in the palm of our hands. Since then, the presence of virtual assistants has grown exponentially, while their usage has also expanded significantly, ranging from computers to household equipment. The current featured article focuses on the technology that powers one such assistant, the medically inclined KRISTINA agent and the Question Answering system that drives it, and also elaborates on the workflow pipeline that is responsible to calculate the appropriate responses that cater to the user’s informational needs.
The communication agent is able to process the user’s input, interpret the user’s needs and translate them to a form/query that the computer understands. Subsequently, depending on the question’s topic, the machine is given two options on how to treat the query. It can forward it to a structured-schema Knowledge Base -think of it as a spreadsheet that is already filled with relevant information- or use an indexed-based Question Answering system. Usually, the second option is engaged when the user query cannot be fully addressed by the information derived from the knowledge base (which is stored in a structured manner (ontology)), but it requires a “free text” response.
Under the hood of a Question Answering system lie sophisticated processes that scan vast amounts of web content and try to identify and retrieve the most relevant passages/segments that meet the user’s informational needs. Its work becomes even more demanding when we consider that the agent must intelligently classify the segments and choose the top-ranked between them, since only one will be returned to the user.
So, in order to return the appropriate response, a Question Answering system relies on three distinct components that i) extract content from trustworthy health-related web resources pertinent to the domain of interest, ii) store this information in a repository, and iii) implement services that can search and retrieve the most relevant segments / passages. We’ll try to provide more details on these components so we can figure out what makes them tick in the following subsections:
Content extraction: Crawling of trustworthy domain resources and automatic extraction of their content
Web crawlers, also known as "spiders", "bots" or "wanderers", are software programs that automatically go through the Web in a methodical, automated manner in order to collect web resources. They view the Web as a directed graph, where the nodes represent unique web pages (based on their URL) and each directed edge represents a unique hyperlink between two pages. The crawler that has been deployed in KRISTINA uses as starting nodes the URLs of the web domains of interest (provided by end users), while the set of neighbouring nodes are restricted only on the web pages deriving from the same domain, in order to perform exhaustive and deep crawling of the specific web domains.
Once URLs have been crawled, web scraping is performed on the saved content, which is the process of extracting information from websites while filtering out unwanted content (e.g., advertisements). Usually, the choice of scraping techniques depends on the type of the website (i.e., static or dynamic). Additionally, the way in which the information is structured or presented is very important since scrappers have to be configured based on that structure. Certain rules have also been added in the scraping process to ignore crawled URLs that contain noisy information, which could influence the search component’s performance and prevent it from returning the correct search results. That’s it!
Okay, not really. Web scraping is also related to indexing, a process that transforms unstructured web data, typically in HTML format, into structured data that can be stored and analysed in a central local database. The objective of the indexing is to provide an ontology-compliant structure that allows for efficient storage, rapid access and easy search of information. The web content distilled from the crawling and scraping pipeline is being stored in a SIMO-based repository  that allows representing multimedia and text content in the context of the web and social media.
Information search based on query rewriting and passage retrieval
Great, we now have a repository full of useful, indexed information! But, what exactly do we do with it? How to access this “treasure trove” and more importantly, how to get the desired bit of info from it? The solution can be found in information search and retrieval algorithms, which are employed in order to extract only the relevant segments from the web resources, as well as in query rewriting methods for enhancing the search results.
Before enforcing the passage retrieval methods, the webpage and pdf content has been segmented into paragraphs following the HTML structure. Then, the paragraphs have been split into the corresponding sentences by means of the Stanford CoreNLP Toolkit. In order to capture as many segmentation types as possible for each document, other than the index with the entire documents, we also created separate indices for each one of the segmentation types (paragraphs, single sentences, pairs of sentences, triplets of sentences and all segmentation)
Pairs and triplets of sentences are formed by merging neighbour sentences which exist in the same paragraph. “All segmentation” is the name of the index that encloses all segment types. It is regarded as the most important index in the passage retrieval service, since it is the one which can serve passage queries without having to define the kind of passage to be returned as a response.
Moreover, before each segment/document is stored into the respective index, its content is stemmed (only the root of the written words is being kept) and the stop-words (i.e. very common words like the, is, at, and, on) are removed. The same stemming and stop-word removal process has also been employed on each user query.
In the retrieval stage, the models that have been used to match the query terms with the indexed segments are based on language models. In particular, the two language models that have been used are: a) a model based on a multinomial distribution, for which the conjugate prior for Bayesian analysis is the Dirichlet distribution , and b) a model that takes into account the Jelinek-Mercer smoothing method  that involves a linear interpolation of the maximum likelihood model with the collection model. Simple, right?
The retrieval process can be executed using many different variations, depending on how we choose to reconstruct the query that was initially fed into the search engine. In the most simplistic scenario, we do not process the user input at all, but instead we directly send it into the search engine and answer the top-N documents (or segments) based on their language model score. Four additional retrieval variations have also been investigated (what follows isn’t for the faint of heart!):
/Geek mode on
- Retrieval based on sequential dependence models (SDM). In this setting, the query is split into a set of queries which contain single terms, ordered term pairs and unordered term pairs.
- Document-based retrieval. We exploit the top-N retrieved documents and add them as a filter along with the initial query in the passage retrieval. The retrieved passage results shall exist in the top-N retrieved documents.
- Document-paragraph-based retrieval. Top-N retrieval documents are added as a filter to retrieve paragraphs. Then, the remaining passage results must be contained in at least one of the retrieved paragraphs.
- Context based retrieval. It has the same retrieval steps with the document-paragraph-based retrieval method with a major difference. In the paragraph retrieval phase, the paragraphs are re-ranked based on the language model score of the previous and the next paragraphs (if any).
/Geek mode off
|Figure: Document-paragraph-based retrieval method: Segments are coming from top-ranked paragraphs which, in turn, come from top-ranked documents|
Furthermore, queries have been optionally expanded with more terms to boost documents containing synonyms or closely related terms to the query. To this end, we trained a model that converts words into vectors using the Word2Vec algorithm . Using a large corpus, the algorithm has the ability to create a model that has learned related words and we can detect them by comparing their vectors, since semantically similar words are represented by vectors of close proximity. The corpus we used was the content of the scraped documents. This query expansion process is automatic and it adds a number of terms that is relevant to the number of words in the query after stop-word removal.
 Tsikrika, T., Andreadou, K., Moumtzidou, A., Schinas, E., Papadopoulos, S., Vrochidis, S., and Kompatsiaris, I. 2015. “A Unified Model for Socially Interconnected Multimedia-Enriched Objects”, In MultiMedia Modeling, pp. 372-384, Springer International Publishing.
 Zhai, C. and Lafferty, J., 2004. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS), 22(2), pp.179-214.
 Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.