Ge Lee

Ge — pronounced like the letter `g`, or `guh` as in 李格.

I'm a PhD candidate in Computer Science @ RMIT University, supervised by Prof. Zhifeng Bao and Dr. Shixun Huang. I also work with Dr. Yanchang Zhao @ CSIRO Data61, and I'm currently a visiting researcher in the Data Science discipline @ The University of Queensland.

My research sits at the intersection of data management and machine learning. I study how to make large, uncurated data collections easier to understand and work with. My recent work focuses on tabular data, discovering overlap between tables for scalable retrieval and deduplication. Increasingly, I am extending this toward cost-aware agentic workflows for selective and efficient data processing under practical budget and latency constraints. Throughout, the aim is the same: data management that is fast enough to use, affordable enough to run, and reliable enough to trust.

ge.lee@uq.edu.au / scholar / dblp / github / linkedin

Publications

2027
1. Alignment-Guided Largest Table Overlap Size Estimation
  Ge Lee, Shixun Huang, Zhifeng Bao, Shazia Sadiq, Yanchang Zhao.
  
  SIGMOD 2027 details hide
  
  Large table repositories need fast overlap estimates for blocking and query-by-table retrieval, but exact computation is too expensive and existing estimators struggle under structural variation and domain shift. This work introduces ALORE, an alignment-guided, hypergraph-based estimator that predicts table overlap size accurately and efficiently across heterogeneous repositories.
2026
1. Shape-Agnostic Table Overlap Discovery: A Maximum Common Subhypergraph Approach
  Ge Lee, Shixun Huang, Zhifeng Bao, Felix Naumann, Shazia Sadiq, Yanchang Zhao.
  
  SIGMOD 2026 tr / code / details hide
  
  Tables often share content despite reordered rows, columns, and missing metadata, but existing rectangular overlap definition miss many valid matches. This work introduces SALTO, a shape-agnostic notion of table overlap, and HyperSplit, a hypergraph-based algorithm that finds exact cell-level overlaps efficiently for copy detection, deduplication, and version comparison.
2. AgenticScholar: Agentic Data Management with Pipeline Orchestration for Scholarly Corpora
  Hai Lan, Tingting Wang, Zhifeng Bao, Guoliang Li, Daomin Ji, Ge Lee, Feng Luo, Zi Huang, Hailang Qiu, Gang Hua.
  
  SIGMOD 2026 tr / system / details hide
  
  Scholarly analysis increasingly requires reasoning across papers, where evidence is scattered across text, tables, figures, code snippets, citations, and bibliographic context, while questions evolve from retrieval into multi-step synthesis, comparison, trend tracing, and idea exploration. AgenticScholar compiles natural-language requests into evidence-grounded executable DAG workflows, supporting paper retrieval, structured extraction, cross-paper synthesis with ranking and inconsistency checking, trend analysis, milestone paper selection, and under-explored problem–method discovery. Its agentic core unifies a structure-aware scholarly knowledge base with hybrid planning and reusable operators, while exposing plans, intermediate results, and data lineage for traceability.
2025
1. Representative Time Series Discovery for Data Exploration
  Ge Lee, Shixun Huang, Zhifeng Bao, Yanchang Zhao.
  
  VLDB 2025 code / details hide
  
  Large time series collections need compact summaries for exploration, but existing methods lack controllable similarity-bounded coverage. This work introduces RTSD, which finds the smallest set of representative time series that collectively covers a user-specified proportion of the data, and MLGreedyET, a self-supervised greedy framework that solves it with low time and memory costs.
2024
1. Cost-effective Data Labelling for Graph Neural Networks
  Shixun Huang, Ge Lee, Zhifeng Bao, Shirui Pan.
  
  WWW 2024 code

Education

PhD, Computer Science
RMIT University, 2023–present
Bachelor of Computer Science (Hons)
RMIT University, 2022
Bachelor of Computer Science
RMIT University, 2019–2021

Awards

CSIRO Data61 Top-Up Scholarship, 2023
RMIT Vice-Chancellor's PhD Scholarship, 2023
RMIT Vice-Chancellor's List for Academic Excellence, 2021 & 2022

Teaching

Introduction to Information Systems (INFS7900)
Casual Academic, The University of Queensland, Sem 1 2026

Service

Reviewer, KDD 2024