Applications for SNAP Research Projects

Research Positions in the SNAP Group
Autumn Quarter 2025-26

Welcome to the application page for research positions in the SNAP group under Prof. Jure Leskovec, Autumn Quarter 2025-26!

Our group has open positions for Research Assistants and students interested in independent studies and research (CS191, CS195, CS199, CS399). These positions are available for Stanford University students only. Below are some of the possible research projects. All projects are high-impact, allowing participants to perform research and work on real-world problems and data, and leading to research publications or open source software. Positions are often extended over several quarters. We are looking for highly motivated students with any combination of skills: machine learning, data mining, network analysis, algorithms, and computer systems.

Please apply by filling out and submitting the form below. Apply quickly since the selection process starts soon after this announcement is posted. Thanks for your interest!

If you have any questions please contact Lata Nair at [email protected].

Application Form

First and Last Name

SUNetID

SUNetID is your Stanford CS login name and contact email address, <your_SUNetID>@cs.stanford.edu. If you don't have a SUNetID, use <your_last_name>_<your_first_name>, so if your last name is Smith and your first name is John, use smith_john.

Email

Department

Student Status

Project(s)

Please select all the projects that you are interested in. Project descriptions are available below.

	Graph Transformer for Tree Reconstruction and Lineage Tracing of Single Cell Datasets [description] Keywords: Graph Learning, ML Theory, Transformers, Data Processing, Information Retrieval, AI Biology
	Foundation Models for Machine Learning [description] Keywords: In-Context Learning, Pre-Training, Foundation Model, Relational Databases
	Scaling Laws for Relational Transformers [description] Keywords: Scaling Laws, Pre-Training, Foundation Model, Relational Databases
	Pretraining Relational Database Foundation Model with Link-Prediction Tasks [description] Keywords: Relational Databases, Transformers, Pretraining, Foundation Model, Link Prediction, Recommendation
	Building User Simulators from Large Language Models [description] Keywords: Human-centric LLMs, Reasoning, Simulation
	Massive Hypothesis Mining with Agentic Verifications [description] Keywords: AI Agents, Statistical Tests, Data Science, AI for Science
	Inverse Perturbation Modeling Using GenAI and Natural Language [description] Keywords: Multimodal Modeling, Generative AI, Biomedical AI

Position

Please select the position you are interested in. Please select all that apply.

	25% RA
	50% RA
	Independent study (CS399, CS199, CS191, CS195)

Statement of Purpose

Briefly explain why you would like to participate in this project, why you think you are qualified to work on it, and how you would like to contribute.

Your Resume

Your Transcript

Click on the button below to Submit

Projects

Graph Transformer for Tree Reconstruction and Lineage Tracing of Single Cell Datasets

Keywords: Graph Learning, ML Theory, Transformers, Data Processing, Information Retrieval, AI Biology

Trees are a fundamental data structure and reconstructing or creating trees from leaf features is a classical problem in computer science. Biological data modalities can often be grouped into large tree structures with hundreds of thousands or millions of leaves, presenting a compelling source of training data for machine learning, and a challenging, expensive task for classical algorithms. This project seeks to develop a suite of machine learning tools and theories for reconstructing trees with tens to hundreds of thousands of nodes (also called hierarchies or phylogenies) from biological data.

We are looking for highly motivated students who have a background in machine learning, deep learning and CS/ML theory (e.g., CS224W, CS230, CS231N etc.). Experience in computational biology is not required. Strong coding skills and proficiency in PyTorch are required.

Foundation Models for Machine Learning

Keywords: In-Context Learning, Pre-Training, Foundation Model, Relational Databases

Pre-trained relational transformer (RT) models achieve remarkable transfer to unseen datasets. However, currently we rely on fine-tuning to reach satisfactory performance. Yet, we have overwhelming evidence that RT is capable of In-Context Learning (ICL), i.e. learning from a new dataset entirely from the context window in a single forward pass. In this project, we will build on this surprising observation by carefully pre-training for strong ICL capabilities by (1) scaling up the context length, (2) scaling up the pre-training data, and (3) retrieving relevant in-context examples.

This project involves pre-training foundation models at scale. Strong experience with deep learning, foundation models, systems, etc. is required, including proficiency with PyTorch, Python, data processing, etc. Familiarity with Rust is a plus, as it powers our highly-efficient data pipeline.

Scaling Laws for Relational Transformers

Keywords: Scaling Laws, Pre-Training, Foundation Model, Relational Databases

Pre-trained relational transformer (RT) models as small as 22M parameters and pre-trained on as few as 6 different datasets already show very strong transfer to completely unseen datasets. However, to pre-train strong relational foundation models, scaling up both the model size and the pre-training data is critical. In this project, we will empirically investigate the scaling behavior of RT models with the aim of establishing clear recommendations for allocating fixed compute budgets to achieve the best results. While LLM scaling is a heavy influence, scaling for relational settings has unique considerations, for example, the role of schema diversity.

This project involves large-scale empirical analyses. Disciplined execution, experiment management, scientific rigor in thinking, curiosity to quickly learn from relevant literature, etc. are important skills. A background in machine learning is required.

Pretraining Relational Database Foundation Model with Link-Prediction Tasks

Keywords: Relational Databases, Transformers, Pretraining, Foundation Model, Link Prediction, Recommendation

Relational databases are the backbone of modern data infrastructure, supporting critical applications in domains such as e-commerce, finance, and healthcare. By organizing information across interlinked tables, they capture the structured relationships among real-world entities. Recent advances in Relational Deep Learning (RDL) and Relational Transformers have transformed predictive modeling in this setting by enabling models to operate directly on relational structures. This progress sets the stage for developing a general-purpose foundation model for relational databases, capable of learning from and adapting across diverse schemas and tasks, and unlocking the full potential of relational data for AI. In this project, we aim to explore pretraining strategies for relational databases using relational transformer architectures. Our focus is on capturing the structural characteristics of complex database schemas, with link prediction as a central task. By modeling the rich relational structure inherent in multi-table data, our goal is to develop a foundation model for relational databases that enables powerful and generalizable recommendation capabilities.

We are looking for highly motivated students who have a background in machine learning, and deep learning (e.g., CS224W, CS230, CS231N, CS336 etc.) and experience in data processing. Strong coding skills and proficiency in PyTorch are required. Experience with distributed computing will be appreciated.

Building User Simulators from Large Language Models

Keywords: Human-centric LLMs, Reasoning, Simulation

Collecting large-scale real-world user data is expensive and slow. Can we use Large Language Models (LLMs) to simulate users for efficient data collection? While LLMs are good at solving problems, they often fail to understand human intentions and simulate realistic human behavior. This project will study the question "How to make LLMs think and act adaptively like users?". Our goal is to build an effective training framework to enable LLMs to have diverse human-like reasoning, such that we can collect more realistic user data efficiently from these LLMs, instead of from real users.

We are looking for highly motivated students with experience in machine learning (ML), natural language processing (NLP), ML systems, and engineering. A strong background in PyTorch is recommended. Experience with LLM training framework (e.g., verl, trl, torchrun) and distributed computing would also be beneficial.

Massive Hypothesis Mining with Agentic Verifications

Keywords: AI Agents, Statistical Tests, Data Science, AI for Science

Leveraging our lab's previous work, this project will build AI agents that massively generate and rigorously validate scientific hypotheses, using adaptive planning, statistical guardrails, and self-assessment loops to move far beyond manual prompting.

We are looking for highly motivated students with experience in machine learning (ML), natural language processing (NLP), ML systems, and engineering. A strong background in building AI Agents is recommended.

Inverse Perturbation Modeling Using GenAI and Natural Language

Keywords: Multimodal Modeling, Generative AI, Biomedical AI

Understanding the impact of perturbations to a cell is a challenging and an actively-studied problem, but there are no suitable methods for the equally interesting inverse problem of predicting the perturbation that caused an observed cellular state change, which would enable exciting applications in the field of synthetic biology. Recent generative AI methods provide a means to learn embeddings of distributions (Meta Flow Matching, Atanackovic et al., 2024), providing an avenue to learn the presented inverse problem. In this project, we aim to develop probabilistic machine learning methods that decode perturbations such as gene knockouts based on a source and target distribution of cell measurements. We will aggregate public and in-house perturbation data and train the first foundation model for perturbation inference. We aim to apply the model to improve cancer therapy methods with a collaborating lab at the Department of Medicine.

We are seeking a highly motivated student with experience in probabilistic/generative deep learning (CS229, CS230, CS236) and optionally biology (CS273A|B). A strong background in PyTorch is required.

Go to the application form.