Solving Question-Answering on Tabular Data: A Comparison

Updated on September 13, 2024

By Abhijith Neil Abraham and Shaoni Mukherjee

Solving Question-Answering on Tabular Data: A Comparison

Introduction

This article presents a comprehensive study addressing the challenge of natural language question-answering using tabular data. We will explore several approaches that utilize deep learning, delving into the architectures and datasets employed. Additionally, we will compare their performance using appropriate evaluation metrics.

Answering questions on tabular data is a research problem in NL, and numerous approaches exist to finding a solution. Some involve a heuristic method to break down the natural language input and translate it to be understandable by structured query languages, while others involve training deep learning models.

A good solution would cover the following:

Understand the relationships between the columns and values
Have a good semantic understanding of the natural language query
Preserve efficiency in performance, memory, and cross-domain interpretability This is what we will try to accomplish in this tutorial. Further, we will also be covering the following: Understanding tabular data using heuristic methods
AI-assisted tools vs pre-trained models
TableQA: an AI-assisted tool for question answering on tabular data
Popular datasets for pre-trained models
TAPAS
Choosing the best solution
Conclusion

Prerequisites

Basic Understanding of NLP: Familiarity with natural language processing concepts, including question-answering systems, tokenization, and embeddings.
Knowledge of Machine Learning Models: Understanding of common machine learning models such as decision trees, random forests, and neural networks, especially in the context of structured data.
Familiarity with Deep Learning Frameworks: Experience with deep learning frameworks such as PyTorch or TensorFlow to comprehend how models are implemented and fine-tuned.
Experience with Tabular Data: Proficiency in working with tabular data, including data manipulation and analysis using tools like pandas in Python.
Basic Statistics and Probability: A foundational understanding of statistics and probability for interpreting model performance metrics and evaluation results.

Understanding Tabular Data using Heuristic Methods

Tabular data is data that is structured into rows and columns. Since the data can be big in size, it is not easy to manually extract the answer to a particular user query, given a condition or a set of conditions. Tabular data can also be found in relational databases.

While Structured Query Language (SQL) queries are often used to extract particular information from a database, an individual’s knowledge of SQL queries may be limited. A more universal solution is to query the tabular data using natural language.

LN2SQL is one approach which used heuristic approaches to convert natural language to SQL queries. A set of parsers are used to break down the natural language construct into its respective SQL components. A thesaurus is used to improve keyword filtering to enable easier translation of specific entities like column names, SQL keywords, etc. The SQL output would then be queried to fetch results.

AI-assisted tools vs Pre-trained models

Neural networks provide various methods for analyzing tabular data. For instance, a pre-trained model can identify the relationship between the intricate structure of a natural language query and tabular data. This is achieved by embedding the query and the data, which are then processed through the neural network model.

Other solutions are not pre-trained on datasets containing tabular information but use AI and some heuristics to arrive at the solution. These are known as AI-assisted tools. We will take a look at such a tool below.

TableQA: an AI-assisted tool for question answering on tabular data

As we saw earlier, one method for breaking down natural language into smaller components is converting them into SQL queries. A SQL query is built from several component statements which include conditions, aggregate operations, etc. TableQA converts natural language queries to SQL queries. Let’s see a demo of how to use this tool to query cancer death data obtained from the Abu Dhabi open platform.

First, install the package and requirements. Install the module via pip:

!pip install tableqa

Next, we’ll access the sample data from the repository.

!git clone https://github.com/abhijithneilabraham/tableQA/

%cd tableQA/tableqa/

And now we’ll take our first look at the data.

import pandas as pd

df = pd.read_csv("cleaned_data/Cancer Death - Data.csv")
# A first look at data
df.head()

Cancer death data

The file we are trying to query is a CSV (comma-separated values) file with columns like Year, Nationality, Gender, Cancer Site, Death Count, and age. Let’s try to answer some questions from this dataset.

from tableqa.agent import Agent

agent = Agent(df)

#Ask a question
agent.query_db("how many deaths of age below 40 had stomach cancer?")

>>> [(24,)]

We could also take a look at the associated SQL query related to the same.

agent.get_query("how many deaths of age below 40 had stomach cancer?")

>>> SELECT COUNT(death_count) FROM dataframe WHERE age  < 40 AND cancer_site = "Stomach"

As you can see here, there is a COUNT operation done on the column Death Count with conditions like age<40 and Cancer Site="Stomach".

TableQA builds the SQL query using a combination of a deep learning-enabled entity extractor and aggregate clause classifier. The entity extractor matches the conditions by mapping the columns with their respective values if there are any in the question. It uses a pre-trained question-answering model trained on SQUAD to help easily locate the column’s value from the input question.

The aggregate clause classifier is a two-layer deep learning model that helps in classifying the nature of the input question in any of the following:

SELECT
COUNT
MAXIMUM
MINIMUM
AVERAGE
SUM

Combining entity extraction and clause classification with the help of some heuristics generates an SQL query, which is then used to query the data. One possible advantage of this approach is that no retraining is necessary for adding any new feature—some custom fixes could be done by modifying the code, replacing the AI models inside with more accurate ones, etc.

Popular Datasets for Pre-trained models

The heuristics-based solutions have their limitations. Addressing increased question complexity is challenging with explicit programming.

For example, we need to be able to answer questions that require multiple non-trivial steps, like: “Who finished first in this year’s F1 championship?” or “What is the cheapest mode of transport that departs tomorrow morning?”

In cases where human-like inference is required, a model pre-trained with a dataset containing such questions can be used.

The WikiTableQuestions (WTQ) is a dataset consisting of complex questions and a table from Wikipedia. The dataset contains 2108 tables from various topics (to cover more domain knowledge) and 22033 questions of different complexities.

Microsoft has made available a Sequential Question Answering dataset, SQA. The task of answering sequences of questions is done on HTML tables. SQA was created by decomposing 2,022 questions from WikiTableQuestions (WTQ). Each WTQ question is divided into three, resulting in a dataset of 6,066 sequences containing 17,553 questions. Each question was also associated with answers in the form of cell locations in the tables.

Another popular dataset for this task is WikiSQL, a dataset of 80654 human-annotated pairs of questions and SQL queries distributed across 24241 tables from Wikipedia. Unlike the other datasets above, WikiSQL contains SQL queries mapped from questions with annotated components.

TAPAS: Weakly Supervised Table Parsing via Pre-training

In 2020, Google Research open-sourced a popular deep learning model for querying tabular data without generating logical forms called TAPAS. It extends the Masked Language Modeling (MLM) approach to structured data.

Like BERT, TAPAS uses features which have to encode the tabular input. It then initializes a joint pre-training of text sequences and tables trained end to end and is successfully able to restore masked words and table cells. It can infer from a query by selecting a subset of table cells. If available, aggregate operations are also performed on top of them. Various datasets have been benchmarked using TAPAS, including WTQ, SQA, and WIKISQL.

Let’s look at a sample.

First, install the requirements:

!pip install tapas-table-parsing

Copy the model files from the cloud to our machine:

!gsutil cp gs://tapas_models/2020_04_21/tapas_sqa_base.zip . && unzip tapas_sqa_base.zip

We are going to make a predict() function to input the table as a dataframe:

import tensorflow.compat.v1 as tf
import os 
import shutil
import csv
import pandas as pd
import IPython

tf.get_logger().setLevel('ERROR')
from tapas.utils import tf_example_utils
from tapas.protos import interaction_pb2
from tapas.utils import number_annotation_utils
from tapas.scripts import prediction_utils

os.makedirs('results/sqa/tf_examples', exist_ok=True)
os.makedirs('results/sqa/model', exist_ok=True)
with open('results/sqa/model/checkpoint', 'w') as f:
  f.write('model_checkpoint_path: "model.ckpt-0"')
for suffix in ['.data-00000-of-00001', '.index', '.meta']:
  shutil.copyfile(f'tapas_sqa_base/model.ckpt{suffix}', f'results/sqa/model/model.ckpt-0{suffix}')
  
  
max_seq_length = 512
vocab_file = "tapas_sqa_base/vocab.txt"
config = tf_example_utils.ClassifierConversionConfig(
    vocab_file=vocab_file,
    max_seq_length=max_seq_length,
    max_column_id=max_seq_length,
    max_row_id=max_seq_length,
    strip_column_names=False,
    add_aggregation_candidates=False,
)
converter = tf_example_utils.ToClassifierTensorflowExample(config)

def convert_interactions_to_examples(tables_and_queries):
  """Calls Tapas converter to convert interaction to example."""
  for idx, (table, queries) in enumerate(tables_and_queries):
    interaction = interaction_pb2.Interaction()
    for position, query in enumerate(queries):
      question = interaction.questions.add()
      question.original_text = query
      question.id = f"{idx}-0_{position}"
    for header in table[0]:
      interaction.table.columns.add().text = header
    for line in table[1:]:
      row = interaction.table.rows.add()
      for cell in line:
        row.cells.add().text = cell
    number_annotation_utils.add_numeric_values(interaction)
    for i in range(len(interaction.questions)):
      try:
        yield converter.convert(interaction, i)
      except ValueError as e:
        print(f"Can't convert interaction: {interaction.id} error: {e}")
        
def write_tf_example(filename, examples):
  with tf.io.TFRecordWriter(filename) as writer:
    for example in examples:
      writer.write(example.SerializeToString())

def predict(table_data, queries):
  table=[list(table_data.columns)]+table_data.values.tolist()
  examples = convert_interactions_to_examples([(table, queries)])
  write_tf_example("results/sqa/tf_examples/test.tfrecord", examples)
  write_tf_example("results/sqa/tf_examples/random-split-1-dev.tfrecord", [])
  
  ! python -m tapas.run_task_main \
    --task="SQA" \
    --output_dir="results" \
    --noloop_predict \
    --test_batch_size={len(queries)} \
    --tapas_verbosity="ERROR" \
    --compression_type= \
    --init_checkpoint="tapas_sqa_base/model.ckpt" \
    --bert_config_file="tapas_sqa_base/bert_config.json" \
    --mode="predict" 2> error


  results_path = "results/sqa/model/test_sequence.tsv"
  all_coordinates = []
  df = pd.DataFrame(table[1:], columns=table[0])
  display(IPython.display.HTML(df.to_html(index=False)))
  print()
  with open(results_path) as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t')
    for row in reader:
      coordinates = prediction_utils.parse_coordinates(row["answer_coordinates"])
      all_coordinates.append(coordinates)
      answers = ', '.join([table[row + 1][col] for row, col in coordinates])
      position = int(row['position'])
      print(">", queries[position])
      print(answers)
  return all_coordinates

Now let’s use a sample table and load it as a dataframe. It consists of columns containing actor names and the number of movies they played in.

data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
queries = ["Who played less than 60 movies", "How many movies has George Clooney played in?"]
table = pd.DataFrame.from_dict(data)
result = predict(table, queries)

The model inference from the table can be seen below:

> Who played less than 60 movies
Leonardo Di Caprio
> How many movies has George Clooney played in?
69

There are certain limitations when using TAPAS. It can only handle single tables as context, which must fit into memory. This limitation makes it unsuitable for large datasets or multiple databases. Additionally, TAPAS cannot process questions that require multiple aggregate operations, such as, “How many actors have an average number of movies greater than 10?”

Choosing the best solution

A good solution to this task depends on various factors and user preferences. Non-deep-learning solutions that use heuristics are the fastest, but they come with a cost of poor semantic understanding. A deep learning model like TAPAS aids in comprehending natural language questions and table structures, but it raises concerns regarding memory issues and high computational costs. An AI-assisted solution like TableQA has an advantage in using heuristics since the heuristics can be modified to improve the solution’s performance in most cases without having to retrain the deep learning model. However, as natural language becomes more complex, more errors occur in retrieving information from the table.

Conclusion

In this article, we went through various solutions for question-answering on tabular data. We also saw different solutions which use heuristics and deep learning to semantically parse the information combined from tabular data and the natural language question. Choosing the best solution is done by considering various factors: speed, performance on understanding natural language, and memory.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author(s)

Abhijith Neil Abraham

Author

Shaoni Mukherjee

Editor

Technical Writer

See author profile

With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.

Category:

Tags: