Welcome to the application page for research positions in the SNAP group under Prof. Jure Leskovec, Autumn Quarter 2025-26!
Our group has open positions for Research Assistants and students interested in independent studies and research (CS191, CS195, CS199, CS399). These positions are available for Stanford University students only. Below are some of the possible research projects. All projects are high-impact, allowing participants to perform research and work on real-world problems and data, and leading to research publications or open source software. Positions are often extended over several quarters. We are looking for highly motivated students with any combination of skills: machine learning, data mining, network analysis, algorithms, and computer systems.
Please apply by filling out and submitting the form below. Apply quickly since the selection process starts soon after this announcement is posted. Thanks for your interest!
If you have any questions please contact Lata Nair at [email protected].
Keywords: Graph Learning, ML Theory, Transformers, Data Processing, Information Retrieval, AI Biology
Trees are a fundamental data structure and reconstructing or creating trees from leaf features is a classical problem in computer science. Biological data modalities can often be grouped into large tree structures with hundreds of thousands or millions of leaves, presenting a compelling source of training data for machine learning, and a challenging, expensive task for classical algorithms. This project seeks to develop a suite of machine learning tools and theories for reconstructing trees with tens to hundreds of thousands of nodes (also called hierarchies or phylogenies) from biological data.
We are looking for highly motivated students who have a background in machine learning, deep learning and CS/ML theory (e.g., CS224W, CS230, CS231N etc.). Experience in computational biology is not required. Strong coding skills and proficiency in PyTorch are required.
Keywords: In-Context Learning, Pre-Training, Foundation Model, Relational Databases
Pre-trained relational transformer (RT) models achieve remarkable transfer to unseen datasets. However, currently we rely on fine-tuning to reach satisfactory performance. Yet, we have overwhelming evidence that RT is capable of In-Context Learning (ICL), i.e. learning from a new dataset entirely from the context window in a single forward pass. In this project, we will build on this surprising observation by carefully pre-training for strong ICL capabilities by (1) scaling up the context length, (2) scaling up the pre-training data, and (3) retrieving relevant in-context examples.
This project involves pre-training foundation models at scale. Strong experience with deep learning, foundation models, systems, etc. is required, including proficiency with PyTorch, Python, data processing, etc. Familiarity with Rust is a plus, as it powers our highly-efficient data pipeline.
Keywords: Scaling Laws, Pre-Training, Foundation Model, Relational Databases
Pre-trained relational transformer (RT) models as small as 22M parameters and pre-trained on as few as 6 different datasets already show very strong transfer to completely unseen datasets. However, to pre-train strong relational foundation models, scaling up both the model size and the pre-training data is critical. In this project, we will empirically investigate the scaling behavior of RT models with the aim of establishing clear recommendations for allocating fixed compute budgets to achieve the best results. While LLM scaling is a heavy influence, scaling for relational settings has unique considerations, for example, the role of schema diversity.
This project involves large-scale empirical analyses. Disciplined execution, experiment management, scientific rigor in thinking, curiosity to quickly learn from relevant literature, etc. are important skills. A background in machine learning is required.
Keywords: Relational Databases, Transformers, Pretraining, Foundation Model, Link Prediction, Recommendation
Relational databases are the backbone of modern data infrastructure, supporting critical applications in domains such as e-commerce, finance, and healthcare. By organizing information across interlinked tables, they capture the structured relationships among real-world entities. Recent advances in Relational Deep Learning (RDL) and Relational Transformers have transformed predictive modeling in this setting by enabling models to operate directly on relational structures. This progress sets the stage for developing a general-purpose foundation model for relational databases, capable of learning from and adapting across diverse schemas and tasks, and unlocking the full potential of relational data for AI. In this project, we aim to explore pretraining strategies for relational databases using relational transformer architectures. Our focus is on capturing the structural characteristics of complex database schemas, with link prediction as a central task. By modeling the rich relational structure inherent in multi-table data, our goal is to develop a foundation model for relational databases that enables powerful and generalizable recommendation capabilities.
We are looking for highly motivated students who have a background in machine learning, and deep learning (e.g., CS224W, CS230, CS231N, CS336 etc.) and experience in data processing. Strong coding skills and proficiency in PyTorch are required. Experience with distributed computing will be appreciated.
Keywords: Human-centric LLMs, Reasoning, Simulation
Collecting large-scale real-world user data is expensive and slow. Can we use Large Language Models (LLMs) to simulate users for efficient data collection? While LLMs are good at solving problems, they often fail to understand human intentions and simulate realistic human behavior. This project will study the question "How to make LLMs think and act adaptively like users?". Our goal is to build an effective training framework to enable LLMs to have diverse human-like reasoning, such that we can collect more realistic user data efficiently from these LLMs, instead of from real users.
We are looking for highly motivated students with experience in machine learning (ML), natural language processing (NLP), ML systems, and engineering. A strong background in PyTorch is recommended. Experience with LLM training framework (e.g., verl, trl, torchrun) and distributed computing would also be beneficial.
Keywords: AI Agents, Statistical Tests, Data Science, AI for Science
Leveraging our lab's previous work, this project will build AI agents that massively generate and rigorously validate scientific hypotheses, using adaptive planning, statistical guardrails, and self-assessment loops to move far beyond manual prompting.
We are looking for highly motivated students with experience in machine learning (ML), natural language processing (NLP), ML systems, and engineering. A strong background in building AI Agents is recommended.
Keywords: Multimodal Modeling, Generative AI, Biomedical AI
Understanding the impact of perturbations to a cell is a challenging and an actively-studied problem, but there are no suitable methods for the equally interesting inverse problem of predicting the perturbation that caused an observed cellular state change, which would enable exciting applications in the field of synthetic biology. Recent generative AI methods provide a means to learn embeddings of distributions (Meta Flow Matching, Atanackovic et al., 2024), providing an avenue to learn the presented inverse problem. In this project, we aim to develop probabilistic machine learning methods that decode perturbations such as gene knockouts based on a source and target distribution of cell measurements. We will aggregate public and in-house perturbation data and train the first foundation model for perturbation inference. We aim to apply the model to improve cancer therapy methods with a collaborating lab at the Department of Medicine.
We are seeking a highly motivated student with experience in probabilistic/generative deep learning (CS229, CS230, CS236) and optionally biology (CS273A|B). A strong background in PyTorch is required.
Go to the application form.