Ever wondered why sometimes when you ask a question to Google Assistant or Siri; it gives an answer like, "I am not sure about that" or "Here what I found on the web".
What if there could be any way to answer such questions? - To answer any question we have in our mind?
Here is a proposed approach, "CATAQ: Concise Answer To Any Question", which after implementation can answer any question possible!
This article is a summary of how CATAQ works and is implemented. If you want to read a more extended and scientific version, just ask me on LinkedIn or email me.
CATAQ is an approach to solving questions not solved by present intelligent assistants like Google Assistants and Siri. In this article, questions that are not solved by intelligent assistants are defined as complex questions and solvable questions as simple questions. CATAQ focuses on solving complex questions keeping consistent output in solving simple questions too.
CATAQ solves this problem by dynamic information retrieval, which involves the selection of the most appropriate webpage links, based on the link parsing the content of the webpage irrespective of its structure, pre-processing the content, extracting the pertinent information with semantic search and summarizing by prioritizing question’s keywords.
Overview of CATAQ approach
CATAQ comprises a series of steps which includes selecting a website and extracting text from the website. And then performing a semantic search and summarizing the output text of semantic search results.
In CATAQ, for semantic search, a BERT-based model was implemented, which is distilBERT on sentence transfer SBERT trained on MSMARCO dataset. (What is BERT? - Refer to my blog here. Engage in my post to motivate me to write about SBERT and variations of BERT.)
And for summarization, a simple extractive text summarization is implemented prioritizing the words in the question asked.
Information extraction from relevant web page
This approach needs to be dynamic as possible, to make it more user-oriented and dynamic a layer has been added where a user can select their website from where they want to get information from. If the user wants CATAQ to decide the website chosen, it simply selects the first Google search result, which is the `I'm feeling lucky` feature of Google. It is been observed in the research literature that the `I'm feeling lucky` search of Google gives the most relevant search result of a query.
After the website link is extracted, the text available on the particular website is extracted. Just to make sure that it can scrape any website possible, first, the content from the web page was extracted by parsing the DOM tree and then removing the tags, which helped retain only the text from the web page.
After retrieving the text, sentences were extracted from the text. And if sentence length were below a certain threshold, suppose less than 5 English words, then those sentences were omitted. This created a corpus of only required sentences omitting the unnecessary characters and texts.
Semantic search of question on information extracted
Now we have the reference text which is processed and extracted from a particular website and the question searched which is candidate sentence. A semantic search is performed on reference sentences with respect to the candidate sentence.
The corpus extracted for semantic search (reference sentences) is processed by generating contextual embeddings of the sentences (to make the computer understand the English words contextually).The literature survey observed that distilBERT showed faster computation while retaining the power of BERT (What is BERT? - Refer to my blog here). Furthermore, it is observed that the MS MARCO dataset used for the model creation is the best open-source dataset available for solving open-domain question-answering in intelligent assistants. So, the distilBERT model trained on MS MARCO for question answering was used for the semantic search. The sentence transformer (SBERT) on the distilBERT model trained on MS MARCO was used to generate the embeddings.
The same model helped in generating the embeddings of the sentence for the question (candidate sentence) being asked. Then the cosine distances were computed, taking the question embedding as the candidate and corpus embeddings as a reference. The corpus sentences which contained the lowest cosine scores were considered for the summarization step. This helped in retrieving most semantically similar sentences from the reference sentences. (-calculations are skipped here)
Text summarization on semantic relevant texts
The text provided by the semantic search was filtered to get the answer concisely for providing most of the information in a few lines. This was performed by implementing the simple heuristic steps, which gave ideal results. The steps were as follows: -
1. The vocabulary was created by extracting lemmatized words from the text provided by semantic search and the question which was searched.
2. A frequency table was maintained for counting the presence of the vocabulary.
3. The frequency of the vocabulary was incremented with 1 point in the frequency table if the given vocabulary is from the text provided by semantic search.
4. If the vocabulary was from the question, the frequency was incremented with 5 points in the frequency table. This helped in valuing the relevant sentences more.
5. If the vocabulary was from the stop words, the frequency was set to 0 points.
6. The sentences were quantified by summation of the vocabulary points, and the scores were assigned to each sentence.
7. Sentences whose scores were 25% more than the averages of the frequencies in each sentence were chosen for the final output.
Executing these steps provided a concise answer for the question by keeping the information intact as it prioritizes keywords from the question to select the most similar from the semantically similar sentences.
The methodology implemented in CATAQ was tested on simple and complex questions to support the performance of the approach. To evaluate the results, the state-of-the-art intelligent assistants, such as Google Assistant by Google and Siri by Apple, were compared with CATAQ on 15th June 2021. The default website selection logic was chosen to evaluate the differences in the behaviour of the answers, which is to select the first link of Google search results.
First, CATAQ was compared with the other intelligent assistant by querying simple questions to check the basic functionality of the approach. Then it was compared with the complex questions to understand the complex functionality of the approach.
If the assistant failed to answer the question, it provided related weblinks or directly showed a prompt that the search is failed. Eventually, it was seen that all the assistants, including the proposed approach, passed all of the questions, but when complex questions were asked, only CATAQ was able to answer them with concise and relevant output while others showed related links or prompted failed search.
Result on simple questions
Result on complex questions
The CATAQ approach outperforms Google Assistant and Siri on open-domain
question-answering capabilities of complex questions asked. Furthermore, it showed consistency in answering simple questions and complex questions. While asked complex
questions, Google Assistant and Siri render website links to search for an answer. However, CATAQ outputted answers to the complex questions as it gave to the simple ones. This
approach combined state-of-art models with the basic flow of information to achieve the best output possible, proving a novel approach to solve complex questions asked.
* This was my capstone project for my bachelors at PES University, Bangalore completed in June 2021. After going through the implementation again, I do realise there are a few things that could be better (this was a very amateur implementation). Any feedback and suggestions are welcome in my inbox. *