Ge Lee

Ge Lee

Ge — pronounced like the letter `g`, or `guh` as in 李格.

I'm a PhD candidate in Computer Science @ RMIT University, supervised by Prof. Zhifeng Bao and Dr. Shixun Huang. I also work with Dr. Yanchang Zhao @ CSIRO Data61, and I'm currently a visiting researcher in the Data Science discipline @ The University of Queensland.

My research sits at the intersection of data management and machine learning. I study how to make large, uncurated data collections easier to understand and work with. My recent work focuses on tabular data, discovering overlap between tables for scalable retrieval and deduplication. Increasingly, I am extending this toward cost-aware agentic workflows for selective and efficient data processing under practical budget and latency constraints. Throughout, the aim is the same: data management that is fast enough to use, affordable enough to run, and reliable enough to trust.

ge.lee@uq.edu.au / scholar / dblp / github / linkedin

Publications

  1. 2027
    1. Alignment-Guided Largest Table Overlap Size Estimation

      Ge Lee, Shixun Huang, Zhifeng Bao, Shazia Sadiq, Yanchang Zhao.

      SIGMOD 2027

      Large table repositories need fast overlap estimates for blocking and query-by-table retrieval, but exact computation is too expensive and existing estimators struggle under structural variation and domain shift. This work introduces ALORE, an alignment-guided, hypergraph-based estimator that predicts table overlap size accurately and efficiently across heterogeneous repositories.

  2. 2026
    1. Shape-Agnostic Table Overlap Discovery: A Maximum Common Subhypergraph Approach

      Ge Lee, Shixun Huang, Zhifeng Bao, Felix Naumann, Shazia Sadiq, Yanchang Zhao.

      SIGMOD 2026 tr / code /

      Tables often share content despite reordered rows, columns, and missing metadata, but existing rectangular overlap definition miss many valid matches. This work introduces SALTO, a shape-agnostic notion of table overlap, and HyperSplit, a hypergraph-based algorithm that finds exact cell-level overlaps efficiently for copy detection, deduplication, and version comparison.

    2. AgenticScholar: Agentic Data Management with Pipeline Orchestration for Scholarly Corpora

      Hai Lan, Tingting Wang, Zhifeng Bao, Guoliang Li, Daomin Ji, Ge Lee, Feng Luo, Zi Huang, Hailang Qiu, Gang Hua.

      SIGMOD 2026 tr / system /

      Scholarly analysis increasingly requires reasoning across papers, where evidence is scattered across text, tables, figures, code snippets, citations, and bibliographic context, while questions evolve from retrieval into multi-step synthesis, comparison, trend tracing, and idea exploration. AgenticScholar compiles natural-language requests into evidence-grounded executable DAG workflows, supporting paper retrieval, structured extraction, cross-paper synthesis with ranking and inconsistency checking, trend analysis, milestone paper selection, and under-explored problem–method discovery. Its agentic core unifies a structure-aware scholarly knowledge base with hybrid planning and reusable operators, while exposing plans, intermediate results, and data lineage for traceability.

  3. 2025
    1. Representative Time Series Discovery for Data Exploration

      Ge Lee, Shixun Huang, Zhifeng Bao, Yanchang Zhao.

      VLDB 2025 code /

      Large time series collections need compact summaries for exploration, but existing methods lack controllable similarity-bounded coverage. This work introduces RTSD, which finds the smallest set of representative time series that collectively covers a user-specified proportion of the data, and MLGreedyET, a self-supervised greedy framework that solves it with low time and memory costs.

  4. 2024
    1. Cost-effective Data Labelling for Graph Neural Networks

      Shixun Huang, Ge Lee, Zhifeng Bao, Shirui Pan.

      WWW 2024 code

Education

Awards

Teaching

Service