Machine Learning in Structural Biology

Workshop at the 38th Conference on Neural Information Processing Systems

15th December 2024

About

Structural biology, the study of the 3D structure or shape of proteins and other biomolecules, has been transformed by breakthroughs from machine learning algorithms. Machine learning models are now routinely used by experimentalists to predict structures to aid in hypothesis generation and experimental design, accelerate the experimental process of structure determination (e.g. computer vision algorithms for cryo-electron microscopy), and have become a new industry standard for bioengineering new protein therapeutics (e.g. large language models for protein design). Despite all of this progress, there are still many active and open challenges for the field, such as modeling protein dynamics, predicting the structure of other classes of biomolecules such as RNA, learning and generalizing the underlying physics driving protein folding, and relating the structure of isolated proteins to the in vivo and contextual nature of their underlying function. These challenges are diverse and interdisciplinary, motivating new kinds of machine learning methods and requiring the development and maturation of standard benchmarks and datasets.

Machine Learning in Structural Biology (MLSB), seeks to bring together field experts, practitioners, and students from across academia, industry research groups, and pharmaceutical companies to focus on these new challenges and opportunities. This year, MLSB aims to bridge the theoretical and practical by addressing the outstanding computational and experimental problems at the forefront of our field. The intersection of artificial intelligence and structural biology promises to unlock new scientific discoveries and develop powerful design tools.

MLSB will be an in-person NeurIPS workshop on 15th December 2024 in MTG Rooms 11 & 12 at the Vancouver Convention Center.

Please contact the organizers at workshopmlsb@gmail.com with any questions.

Stay updated on changes and workshop news by joining our mailing list.

Presenter Information

Congratulations to all accepted presenters! Please find some information on deadlines and expectations leading up to the MLSB Workshop!

Posters

We ask all authors to prepare a poster that can be presented as part of our workshop. Posters must be 24W x 36H inches and will be taped to the wall. Poster boards will not be provided at the workshop. We specifically ask for portrait layout because we will be tight on wall space.

Additionally, a virtual copy of each poster must be uploaded to the NeurIPS poster upload portal, by Thursday, December 12. Posters must be PNG with no more than 5120 width x 2880 height (no more than 10 MB). Thumbnail images should be 320 width x 256 height PNG and no more than 5 MB. We know these are different dimensions than what we're asking for in-person posters, the poster upload dimensions are set by NeurIPS.

Users should log in using the neurips.cc account associated with their CMT email address. If they did not already have a neurips.cc account, then it should have automatically been created and can be accessed by resetting the password.

Paper Camera-Ready

De-anonymized, camera-ready versions of the workshop paper will be due on Microsoft CMT by Monday, Dec 2. Papers must indicate that they are NeurIPS MLSB workshop papers by using the modified NeurIPS style file here. Papers should be compiled with the `final` argument, e.g. \usepackage[final]{neurips_mlsb_2024}

We plan to make all camera-ready submitted papers available on the workshop website (https://www.mlsb.io/). If you would prefer that your work not be shared, then there is no need to submit a camera-ready version..

Travel Award

This year we will try to cover as many workshop registrations as possible for student/academic attendees with oral presentations or posters who need financial assistance. If you would like to be considered, please fill out the following form by Friday, Nov 15. If you have any questions, please don't hesitate to contact us at workshopmlsb@gmail.com.

Key Dates

Application for Registration Reimbursement: Friday, November 15th, 2024, at 11:59PM, Anywhere on Earth.

Camera-Ready PDF due on Microsoft CMT: Monday, December 2nd, 2024.

Poster due: Thursday, December 12th, 2024.

Call For Papers

We welcome submissions of short papers leveraging machine learning to address problems in structural biology, including but not limited to:

  • Prediction of biomolecular structures, complexes, and interactions
  • Design of or generative models for structure and/or sequence
  • Methods for structure determination / biophysics (Cryo-EM/ET, NMR, crystallography, single-molecule methods, etc.)
  • Geometric and symmetry-aware deep learning
  • Conformational change, ensembles, and dynamics
  • Integration of biomolecular physics
  • Function and property prediction
  • Structural bioinformatics and systems biology
  • Therapeutic screening and design
  • Language models and other implicit representations of protein structure
  • Forward-looking position papers

We request anonymized PDF submissions by Friday, September 20, 2024, at 11:59PM, AoE (anywhere on earth) through our submission website on CMT.

Papers should present novel work that has not been previously accepted at an archival venue at the time of submission. Submissions should be a maximum of 5 pages (excluding references and appendices) in PDF format, using the NeurIPS style files, and fully anonymized as per the requirements of NeurIPS. The NeurIPS checklist can be omitted from the submission. Submissions meeting these criteria will go through a light, double-blind review process. Reviewer comments will be returned to the authors as feedback.

Accepted papers will be invited to present a poster at the workshop, with nominations of spotlight talks at the discretion of the organizers.

New this year, we will have two special tracks for models for predicting protein-protein and protein-ligand interactions, evaluated on two new large-scale benchmarks, PINDER and PLINDER. The highest-performing open-source methods from these two tracks will receive invitations to a spotlight presentation. Stay tuned for more information on how to submit to these tracks.

Like last year, authors that commit to open-sourcing code, model weights, and datasets used in the work will be given precedence for spotlight talks. This change only affects consideration for spotlights. Submissions that cannot make this commitment will still be considered for posters and will not be penalized for acceptance.

This workshop is considered non-archival, however, authors of accepted contributions will have the option to make their work available through the workshop website. Presentation of work that is concurrently in submission is welcome. We welcome papers sharing encouraging work-in-progress results or forward-looking position papers that would benefit from feedback and community discussion at our workshop.

Important Dates

Submission Deadline: Friday, September 20th, 2024, at 11:59PM, Anywhere on Earth.

Notification of Acceptance: Wednesday, October 9th, 2024.

Workshop Date: December 15th 2024, Vancouver, Canada.

Invited Speakers

Erika Alden deBenedictis

Erika Alden DeBenedictis

Group Leader, Francis Crick Institute

Show/Hide Bio
Gabe Rocklin

Gabe Rocklin

Assistant Professor, Department of Pharmacology, Northwestern University.

Show/Hide Bio
Jennifer Listgarten

Jennifer Listgarten

Professor in EECS, UC Berkeley

Show/Hide Bio
Milot Mirdota

Milot Mirdita

Postdoctoral Researcher, Seoul National University.

Show/Hide Bio
Noelia Ferruz

Noelia Ferruz

Group Leader, Center of Genomic Regulation, Barcelona.

Show/Hide Bio
AlphaFold3 Team

AlphaFold3 Team

A model for predicting biomolecular interactions.

Show/Hide Bio

Challenge 2024

This year we are running a challenge on the Pinder and Plinder datasets to evaluate how well the community is currently doing for protein-protein interaction prediction and protein-ligand complex prediction.

Details of the challenge

To submit your trained model you will need to make an inference docker image on HuggingFace Spaces using the following templates:

Rules for model training

  • Participants MUST use the sequences and SMILES in the provided train and validation sets from PINDER or PLINDER. In order to ensure no leakage, external data augmentation is not allowed.
  • If starting structures/conformations need to be generated for the model, then this can only be done from the training and validation sequences and SMILES. Note that this is only the case for train & validation - no external folding methods or starting structures are allowed for the test set under any circumstance!. Only the predicted structures/conformers themselves may be used in this way, the embeddings or models used to generate such predictions may not. E.g. it is not valid to “distill” a method that was not trained on PLINDER/PINDER
  • The PINDER and PLINDER datasets should be used independently; combining the sets is considered augmentation and is not allowed.
  • For inference, only the inputs provided in the evaluation sets may be used: canonical sequences, structures and MSAs; no alternate templates or sequences are permitted. The inputs that will be used by assessors for each challenge track is as follows:
    • PLINDER: (SMILES, monomer protein structure, monomer FASTA, monomer MSA)
    • PINDER: (monomer protein structure 1, monomer protein structure 2, FASTA 1, FASTA 2, MSA 1, MSA 2)
  • Model selection must be performed exclusively on the validation set designed for this purpose within the PINDER and PLINDER datasets.
  • Methods relying on any model derivatives or embeddings trained on structures outside the PINDER/PLINDER training set are not permitted (e.g., ESM2, MSA: allowed; ESM3/ESMFold/SAProt/UniMol: not allowed).

Please find the technical documentation for how to use the datasets for the challenge:

Rules for valid inference pipeline

Submission system will use Hugging Face Spaces. To qualify for submission, each team must:

  • Provide an MLSB submission ID or a link to a preprint/paper describing their methodology. This publication does not have to specifically report training or evaluation on the P(L)INDER dataset. Previously published methods, such as DiffDock, only need to link their existing paper. Note that entry into this competition does not equate to an MLSB workshop paper submission.
  • Create a copy of the provided inference template. Go to the top right corner of the page and click on the drop-down menu (vertical ellipsis) right next to the “Community”, then select “Duplicate this space”.
  • Change files in the newly create space to reflect the peculiarities of your model
    • Edit requirements.txt to capture all dependencies.
    • Include a inference_app.py file. This contains a predict function that should be modified to reflect the specifics of inference using their model.
    • Include a train.py file to ensure that training and model selection use only the PINDER/PLINDER datasets and to clearly show any additional hyperparameters used.
    • Provide a LICENSE file that allows for reuse, derivative works, and distribution of the provided software and weights (e.g., MIT or Apache2 license).
    • Modify the Dockerfile as appropriate (including selecting the right base image)
  • Submit to the leaderboard via the designated form.
    • On submission page, add reference to the newly created space in the format username/space (e.g mlsb/alphafold3)
  • How to submit and view results

    Metrics

    Primary Ranking Metric:
    • PLINDER: lDDT-PLI
    • PINDER: DockQ

    Other metrics computed by PINDER/PLINDER will be displayed on the leaderboard but will not influence the ranking.

    The winners will be invited to present their work at the MLSB workshop.

    Evaluation Datasets

    Although the exact composition of the eval set will be shared at a future date, below we provide an overview of the dataset and what to expect

    • Two leaderboards, one for each of PINDER and PLINDER, will be created using a single evaluation set for each.
    • Evaluation sets will be subsets of 150-200 structures from the current PINDER and PLINDER test splits (subsets to enable reasonable eval runtime).
    • Each evaluation sample will contain a predefined input/output to ensure performance assessment is model-dependent, not input-dependent.
    • The focus will be exclusively on flexible docking/co-folding, with a single canonical structure per protein, sampled from apo and predicted structures.
    • Monomer input structures will be sampled from paired structures available in PINDER/PLINDER, balanced between apo and predicted structures and stratified by "flexibility" level according to specified conformational difference thresholds.

    Key Dates

    Training workshop September 24th, 2024, virtual (Register here)

    Leaderboard Opens: October 9th, 2024 (following acceptance notifications for MLSB).

    Leaderboard Closes: November 9th, 2024

    Winner Notification: Wednesday, November 27th, 2024

    Questions?

    If you have trouble we invite you to join the PINDER/PLINDER discord server

    HuggingFace

Organizers

Photo of Gabriele Corso

Gabriele Corso
MIT

Photo of Gina El Nesr

Gina El Nesr
Stanford University

Photo of Vignesh Ram Somnath

Vignesh Ram Somnath
ETH Zurich

Photo of Ellen Zhong

Zeming Lin
EvolutionaryScale

Photo of Simon Duerr

Simon Duerr
EPFL

Photo of Hannah Wayment-Steele

Hannah Wayment-Steele
University of Wisconsin–Madison

Photo of Sergey Ovchinnikov

Sergey Ovchinnikov
MIT