Research Interest
The big picture of my research interests lies in
democratizing data intelligence to empower people and
organizations to derive insights, learn and share
knowledge, and build intelligence to turn data into
action.
Regardless of the various forms of data, understanding,
generation, and interaction are the three
common themes. Data understanding aims to achieve semantic
understanding of various types of data; Data generation
aims for automatic content generation based on users'
needs; and Data interaction aims to create unparalleled
user experiences working with data like recommendation or
information retrieval.
Specifically, I am interested in the following topics:
Development of Large Language Model
,
Semi-structured Data Modeling & Reasoning
,
Knowledge Representation Learning
,
Causal Inference
.
A few questions that drive my recent research are:
-
how can we get foundation models to efficiently learn
domain knowledge?;
-
how can we advance better models with humans'
collaborations?
-
how can we reduce potential harms (fairness, privacy
and bias)?
-
how can we genuinely advance our understanding of
current LLMs (capabilities and limitations), both
empirically and theoretically?
|
News
- 2023.10: One paper accepted by WSDM'24!
- 2023.08: Start my Ph.D. Journey at National
University of Singapore (NUS) with Ph.D. research
scholarship!
- 2023.03: Honored to be involved in developing
the Excel Copilot, which is the "moon-shot"
project of Microsoft!
- 2022.10: Join MSRA, DKI Group as a research
intern!
- 2022.06: Join Dartmouth College, Minds,
Machines and Society Lab as a research intern!
- 2022.05: One paper accepted by KBS (journal)!
- 2022.03: One paper accepted by IJCNN'22!
- 2022.02: Join ICT, VIPL Group as a research
intern!
|
Publications
(selected, * refers to equal contribution)
|
TAP4LLM: Table Provider on Sampling, Augmenting, and
Packing Semi-structured Data for Large Language
Model Reasoning
Yuan Sui, Jiaru Zou, Mengyu Zhou, Xinyi He, Lun
Du, Shi Han, Dongmei Zhang
Preprint, 2023
Table reasoning has shown remarkable progress in a
wide range of table-based tasks. This challenging task
requires reasoning over both free-form natural
language (NL) questions and semi-structured tabular
data. However, previous table reasoning solutions
suffer from significant performance degradation on
``huge'' tables. In addition, most existing methods struggle to reason over complex questions since lacking of essential information or they are scattered in different places. To alleviate the above challenges, we exploit a table provider on versatile sampling, augmentation and packing methods to achieve effective table reasoning using large language models (LLMs), which 1) decomposes the raw table into sub-table with specific rows/columns based on the rules or semantic similarity; 2) augments the table information by extracting semantic and statistical metadata from the raw table, and retrieving relevant knowledge from trustworthy knowledge sources (e.g., Wolfram Alpha, Wikipedia). 3) packs the table information with the augmented knowledge into a sequence for LLMs reasoning while balancing the token allocation trade-off. Experiment results illustrate that TAP4LLM not only demonstrates commendable performance across various tabular reasoning tasks but also serves as a systematic framework. It allows for different components as plug-ins, enhancing LLMs'
understanding of structured data in diverse tabular
tasks.
|
Table meets LLM: Can Large Language Models
Understand Structured Table Data? A Benchmark and
Empirical Study
Yuan Sui, Mengyu Zhou, Mingjie Zhou, Shi Han and
Dongmei Zhang
Conference on Web Search and Data Mining (WSDM'24), Long Paper, 2023, Blog
Large language models (LLMs) are
becoming attractive as few-shot reasoners to solve
Natural Language (NL)-related tasks. However, there is
still much to learn about how well LLMs understand
structured data, such as tables. While it is true that
tables can be used as inputs to LLMs with
serialization, there lack comprehensive studies
examining whether LLMs can truly comprehend such data.
In this paper, we try to understand this by designing
a benchmark to evaluate the structural understanding
capabilities (SUC) of LLMs. The benchmark we create
includes seven tasks, each with their own unique
challenges, e.g., cell lookup, row retrieval, and size
detection. We run a series of evaluations on GPT-3.5
and GPT-4. We discover that the performance varied
depending on a number of input choices, including
table input format, content order, role prompting, and
partition marks. Drawing from the insights gained
through the benchmark evaluations, we then propose
self-augmentation for effective structural
prompting, e.g., critical value / range identification
using LLMs' internal knowledge. When combined with
carefully chosen input choices, these structural
prompting methods lead to promising improvements in
LLM performance on a variety of tabular tasks, e.g.,
TabFact(up 2.31%), HybridQA(up 2.13%),
SQA(up 2.72%), Feverous(up 0.84%), and
ToTTo(up 5.68%). We believe that our benchmark
and proposed prompting methods can serve as a simple
yet generic selection for future research. The code
and data are released in
\url{https://anonymous.4open.science/r/StructuredLLM-76F3}.
|
Why is Cross-Lingual Fine-Tuning Inferior
to Multi-Lingual Fine-Tuning? An Empirical Study
Weicheng Ma, Junhwi Kim, Yuan Sui, Chunyuan Deng,
Lili Wang and Soroush Vosoughi
Preprint, 2023
Cross-lingual models, which are
fine-tuned only in source languages, are typically
weaker than multi-lingual models, which are fine-tuned
in both source and target languages. However,
cross-lingual models are crucial for low-resource
languages as they do not require task-specific
annotated
training data for these languages. This paper
investigates the causes of this performance gap by
providing an in-depth analysis of cross-lingual and
multi-lingual Transformer models fine-tuned on two
natural language understanding (NLU) tasks. Our
findings
suggest two possible causes: multi-lingual models (1)
have better text-domain consistency with target
languages, and (2) are better able to extract and
encode
certain linguistic features that contribute to the NLU
objectives in the target languages. Based on these
findings, we propose and evaluate two methods for
improving cross-lingual models: (1) target-language
text-domain adaptation using masked language modeling
and (2) feature augmentation guided by model probing.
Our experiments show that applying these methods to
cross-lingual models can lead to gains in performance,
thus closing the performance gap between cross- and
multi-lingual models. These results also provide
general
empirical guidance for efficient data augmentation for
cross-lingual fine-tuning.
|
Intelligent Predictive Maintenance of
Hydraulic Systems based on Virtual Knowledge Graph
Wei Yan, Yu Shi, Zengyan Ji, Yuan Sui, Zhenzhen
Tian, Wanjing Wang, Qiushi Cao
Engineering Applications of Artificial Intelligence
, 2023 (IF=8)
In the manufacturing industry, a hydraulic system
harnesses liquid fluid power to create powerful
machines. Under the trend of Industry 4.0, the
predictive maintenance of hydraulic systems is
transforming to more intelligent and automated
approaches that leverage the strong power of
artificial intelligence and data science
technologies. However, due to the
knowledge-intensive and heterogeneous nature of the
manufacturing domain, the data and information
required for predictive maintenance are normally
collected from ubiquitous sensing networks. This
leads to the gap between massive heterogeneous
data/information resources in hydraulic system
components and the limited cognitive ability of
system users. Moreover, how to capture and structure
useful domain knowledge (in a machine-readable way)
for solving domain-specific tasks remains an open
challenge for the predictive maintenance of
hydraulic systems. To address these challenges, in
this paper we propose a virtual knowledge
graph-based approach for the digital modeling and
intelligent predictive analytics of hydraulic
systems. We evaluate the functionalities and
effectiveness of the proposed approach on a
predictive maintenance task under real-world
industrial contexts. Results show that our proposed
approach is capable and feasible to be implemented
for digital modeling, data access, data integration,
and predictive analytics.
|
Causality-aware Enhanced Model for Multi-hop
Question Answering over Knowledge Graphs
Yuan Sui, Shanshan Feng, Huaxiang Zhang, Jian
Cao, Liang Hu, Nengjun Zhu
Knowledge-Based Systems
(KBS), 2022 (IF=8.139)
To improve the performance of
knowledge
graph-based question answering system (KGQA),
several
approaches have been developed to construct a
semantic
parser based on entity linking, relation
identification and logical/numerical structure
identification.
However, existing methods arrive at answers only by
maximizing
the data likelihood only on the sparse or imbalanced
explicit relations, ignoring the potentially large
number of latent relations. It makes KGQA suffer
from
a high level of spurious entity relations and
missing
link challenge. In this paper, we propose a causal
filter
(CF) model for KGQA (CF-KGQA), which performs causal
interference on the relation representation space to
reduce the spurious relation representation in a
data-driven manner, i.e., the goal of this work is
to
comprehensively discover disentangled latent factors
to alleviate the spurious correlation problem in
KGQA.
The model comprises a causal pairwise aggregator
(AP)
and
a disentangled latent factor aggregator (AC). The
former
filters out most spurious entity relations
inconsistent to their dense groups' neighborhood,
and
generates a
causal pairwise matrix among all the candidate
relations. The latter learns the latent relation
representation via an encoder-decoder on the causal
pairwise matrix. It disconnects the latent factor
and
the causal confounder beneath the knowledge
embedding
space by causal intervention. To prove the
effectiveness and efficiency of the proposed
approach,
we test
CF-KGQA and other state-of-the-art methods on four
public
real-world datasets. The experiments indicate that
our
approach outperforms the recent methods and is also
less sensitive to the spurious correlation problem,
thus
demonstrating the robustness of CF-KGQA.
|
Trigger-GNN: A Trigger-Based Graph
Neural
Network for Nested Named Entity Recognition
Yuan Sui, Fanyang Bu, Yingting Hu, Wei Yan,
Liang
Zhang
International Joint Conference on Neural Networks
(IJCNN'22), Long Paper, 2022 (oral)
Nested named entity recognition
(NER)
aims to identify the entity boundaries and recognize
categories of the named entities in a complex
hierarchical sentence. Some works have been done
using
character-level, word-level, or lexicon-level based
models. However, such researches ignore the role of
the
complementary annotations. In this paper, we propose
a
trigger-based graph neural network (Trigger-GNN) to
leverage the nested NER. It obtains the
complementary
annotation embeddings through entity trigger
encoding
and semantic matching, and tackle nested entity
utilizing an efficient graph message passing
architecture, aggregation-update mode. We posit that
using entity triggers as external annotations can
add
in
complementary supervision signals on the whole
sentences. It helps the model to learn and
generalize
more efficiently and cost-effectively. Experiments
show
that the Trigger-GNN consistently outperforms the
baselines on four public NER datasets, and it can
effectively alleviate the nested NER.
|
|