Qian Liu (ๅˆ˜ไนพ)

I am a Research Scientist (NLP) at the Sea AI Lab, an industry research lab based in Singapore ๐Ÿ‡ธ๐Ÿ‡ฌ. Currently we are still actively seeking (Senior) Research Scientists for all directions to join our team, and we encourage you to reach out if you are interested. Please don't hesitate to send me an email for further details or to express your interest.

My primary research interests are in natural language processing, particularly code generation, table pre-training and natural langauge reasoning. I have been fortunate to work with an amazing set of researchers in the past. I did my Ph.D. thesis at the joint program of Beihang University and Microsoft Research Asia where I was advised by Jian-Guang Lou and Bei Chen. The research topic during my Ph.D. journey is semantic parsing, which aims to translate a userโ€™s natural language sentences into machine-executable formal programming language to accomplish relevant tasks. My thesis revolves around my efforts to develop cost-effective, generalizable, and interactive semantic parsing systems.

During my Ph.D. time, there are three academic achievements that have been most significant to me:
  • In 2020, I was honored to be nominated for the Baidu Scholarship (20 worldwide), which motivated me to seek out significant research.
  • In 2021, we (together with 11 other young researchers from around the world) launched the MLNLP community to enhance communication among the Chinese ML and NLP communities.
  • In 2022, our paper TAPEX received the highest score (with 31 awesome papers) in the initial round of ICLR 2022 reviews.
Prior to my time in research, I served as teaching assistant (leader) for a number of courses, including computer organization, operating systems, and software engineering. Furthermore, I directed the writing of an operating system experiment course manual. In 2017, I formed the first undergraduate teaching assistant organization S.T.A.R. to share my expertise as a TA with younger TAs.

For more details, check my CV or hit me up on my email.

โœจ News

[2023.11] Please checkout our S3Eval benchmark to see how a synthetic dataset can be used to systematically ๐Ÿ” analze & ๐Ÿ”ฌ evaluate language models!

[2023.10] 3 papers got accepted by EMNLP 2023! If you'd like to hang out with me during the conference in ๐Ÿ‡ธ๐Ÿ‡ฌ, feel free to DM me in twitter! I will also be in the SSNLP 2023 event!

[2023.10] Please checkout our Lemur (70B Language Model) and OpenAgents (Open-Source Platform) to build your private ChatGPT Plugin / Advanced Data Analysis platform!

[2023.07] Please checkout our Lorahub to see how existing language model adapters can be merged to tackle new tasks!

[2023.05] Please checkout our FLARE to see how to build a neat and powerful retrieval-augmented generation system!

[2023.05] Please checkout our TapTap to see how language models pre-trained on tables benefit machine learning models!

[2023.05] The best open-source code pre-training model StarCoder (15.5B) was out!

๐Ÿ“ Selected Publications (Full Publications on Google Scholar)

S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models
Fangyu Lei*, Qian Liu*, Yiming Huang*, Shizhu He, Jun Zhao, Kang Liu (* = Equal Contribution)
The Benchmark with Unlimited Examples and Infinite Context Length
PDF | Github

OpenAgents: An Open Platform for Language Agents in the Wild
Tianbao Xie*, Fan Zhou*, Zhoujun Cheng*, Peng Shi*, Luoxuan Weng*, Yitao Liu*, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, Tao Yu (* = Equal Contribution)
The Open-Source ChatGPT Plugin / Advanced Data Analysis Framework
PDF | Github | Homepage | Video

Lemur: Harmonizing Natural Language and Code for Language Agents
Yiheng Xu*, Hongjin Su*, Chen Xing*, Boyu Mi, Qian Liu, Weijia Shi, Binyuan Hui, Fan Zhou, Yitao Liu, Tianbao Xie, Zhoujun Cheng, Siheng Zhao, Lingpeng Kong, Bailin Wang, Caiming Xiong, Tao Yu (* = Equal Contribution)
The State-of-the-art Open Foundation Models (70B) for Language Agents
PDF | Github | Homepage | Model | Media

LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition
Chengsong Huang*, Qian Liu*, Bill Yuchen Lin*, Tianyu Pang, Chao Du, Min Lin (* = Equal Contribution)
Compose Existing LoRA Modules for Novel Tasks
PDF | Github | Homepage | Media | Media (Chinese)

Active Retrieval Augmented Generation
Zhengbao Jiang*, Frank F. Xu*, Luyu Gao*, Zhiqing Sun*, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, Graham Neubig (* = Equal Contribution)
EMNLP 2023 | Empirical Methods in Natural Language Processing
PDF | Github | LangChain Integration

Generative Table Pre-training Empowers Models for Tabular Prediction
Tianping Zhang, Shaowen Wang, Shuicheng Yan, Jian Li, Qian Liu EMNLP 2023 | Empirical Methods in Natural Language Processing
PDF | Github | Model

StarCoder: may the source be with you!
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, Joรฃo Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muรฑoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, Harm de Vries
One of the Most Popular Code Language Model
PDF | Github | Model | Blog

From Zero to Hero: Examining the Power of Symbolic Tasks in Instruction Tuning
Qian Liu*, Fan Zhou*, Zhengbao Jiang, Longxu Dou, Min Lin (* = Equal Contribution)
PDF | Github

Learning on Large-scale Text-attributed Graphs via Variational Inference
Jianan Zhao, Meng Qu, Chaozhuo Li, Hao Yan, Qian Liu, Rui Li, Xing Xie, Jian Tang
ICLR 2023 (Oral) | International Conference on Learning Representations
PDF | Github

SantaCoder: don't reach for the stars!
Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo Garcรญa del Rรญo, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra
DL4C @ ICLR 2023 | Deep Learning for Code Workshop @ International Conference on Learning Representations
Best Paper Award of DL4C Workshop
PDF | Model

Reasoning Like Program Executors
Xinyu Pi*, Qian Liu*, Bei Chen, Morteza Ziyadi, Zeqi Lin, Yan Gao, Qiang Fu, Jian-Guang Lou, Weizhu Chen (* = Equal Contribution)
Distinguished Contribution Award (2/300+) on Microsoft 2022 MLADS Spring
EMNLP 2022 (Oral) | Empirical Methods in Natural Language Processing
PDF | Video

LEMON: Language-Based Environment Manipulation via Execution-Guided Pre-training
Qi Shi, Qian Liu, Bei Chen, Yu Zhang, Ting Liu, Jian-Guang Lou EMNLP 2022 (Findings) | Empirical Methods in Natural Language Processing
PDF | Github

TAPEX: Table Pre-training via Learning a Neural SQL Executor
Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou
Highest rating in the 1st Round
ICLR 2022 | International Conference on Learning Representations
PDF| Slides | Github| Cite| Homepage | Video(Chinese) | Blog | Model

Awakening Latent Grounding from Pretrained Language Models for Semantic Parsing
Qian Liu*, Dejian Yang*, Jiahui Zhang*, Jiaqi Guo, Bin Zhou, Jian-Guang Lou (* = Equal Contribution)
ACL 2021 (Findings) | Association for Computational Linguistics
PDF| Slides | Cite | Video

ReTraCk: A Flexible and Efficient Framework for Knowledge Base Question Answering
Shuang Chen*, Qian Liu*, Zhiwei Yu*, Chin-Yew Lin, Jian-Guang Lou, Feng Jiang (* = Equal Contribution)
ACL 2021 (Demo) | Association for Computational Linguistics
PDF | Github | Cite | Video

Compositional Generalization by Learning Analytical Expressions
Qian Liu*, Shengnan An*, Jian-Guang Lou, Bei Chen, Zeqi Lin, Yan Gao, Bin Zhou, Nanning Zheng, Dongmei Zhang (* = Equal Contribution)
First Paper to Achieve 100% Accuracy on SCAN
NeurIPS 2020 (Spotlight) | Advances in Neural Information Processing Systems
PDF | Slides | Github | Cite | Video | Video(Chinese) | Blog(Chinese)

You Impress Me: Dialogue Generation via Mutual Persona Perception
Qian Liu, Yihong Chen, Bei Chen, Jian-Guang Lou, Zixuan Chen, Bin Zhou, Dongmei Zhang ACL 2020 | Association for Computational Linguistics
PDF| Slides | Github | Cite | Blog(Chinese)

๐Ÿ’ฌ Talks

[2022. 10-12] Language Pre-training without Natural Language (Invited Talk)
๐Ÿ“– Language model with large-scale textual data has been successful but lacks reasoning ability due to limited reasoning data. This talk suggests using programs instead of language for pre-training corpus to improve reasoning in tasks such as tabular, numerical, and spatial reasoning.
Slides | Video

Venue: Carnegie Mellon University (CMU) Host: Frank Xu
Venue: Sigma Computing Host: Madelon Hulsebos
Venue: National University Singapore (NUS) Host: Prof. Min-Yen Kan
Venue: Singapore University of Technology & Design (SUTD) Host: Prof. Wei Lu
Venue: Nanyang Technological University (NTU) Host: Prof. Luu Anh Tuan

[2022. 09] Introduction to Language Models (Tutorial)
๐Ÿ“– The tutorial will give a brief overview of mainstream language model architectures (ELMo, GPT, BERT), giant language models (GPT3, Chinchilla), retrieval-based language models (REALM, kNN-LM), and interesting trends (scaling law, instruction following, parameter efficiency).

Venue: Sea AI Lab (SAIL)

[2022. 06] Semantic Parsing of Natural Language from Weakly Labeled Data (Ph.D. Defense)
๐Ÿ“– Focus on compositional and domain generalization of semantic parsing, answer-driven semantic parsing under weak supervision, and conversational semantic parsing under semi-supervision.
Slides(Chinese) | Thesis(Chinese)

Venue: Beihang University (BUAA) Host: Prof. Maosong Sun

[2022.01-02] Towards Data-Efficient Semantic Parsing (Job Talk)
๐Ÿ“– Build methods to improve semantic parsers' performance and generalization capacity under program data, task data, or even no data, and integrated the research into real product PowerApp.

Venue: Sea AI Lab (SAIL) Host: Dr. Min Lin
Venue: Microsoft Research Asia (MSRA) Host: Dr. Jian-Guang Lou

[2022. 01] How to Find a Research Job in Industry (Seminar)
๐Ÿ“– Discuss the critical processes in seeking a good job, such as resume preparation, coding exercises, project discussions, and behavior questions.
Video(Chinese)| Slides(Chinese)

Venue: MLNLP Community Host: Bei Li

[2021.07] On the Future of Semantic Parsing (Seminar)
๐Ÿ“– Discuss the past, current and future of semantic parsing with other rising stars in semantic parsing.
Video(Chinese)| Blog(Chinese)

Venue: AI TIME Speaker: Dr. Pengcheng Yin, Dr.Ziyu Yao, Dr.Bailin Wang

๐Ÿ“ž Contact

Please feel free to contact me via my email (left) if you are interested in our papers, my experience, or you just have any problem on research which I may help.
Beihang University
2013 - 2017
Microsoft Research Asia
2017 - 2022
Sea AI Lab
2022 - Present