Molecular property prediction using GNNs

Boiling point predictions


Context

Boiling point prediction is crucial for designing and optimizing separation processes like distillation, which is widely used in chemical and petrochemical industries. Accurate predictions help in selecting suitable solvents, designing reactors, and ensuring safety by avoiding unexpected phase changes. It reduces the need for costly and time-consuming experimental measurements. Reliable models also support process simulation and optimization, enhancing efficiency and reducing operational costs.

Figure 1. Examples of molecules – Methanol. Image from Wikimedia

Molecular property prediction is commonly approached through Graph Neural Networks (GNNs), offering distinct advantages in terms of data representation, learning complexity.

  • Molecules as graphs: GNNs model molecules as graphs, with atoms as nodes and bonds as edges, enabling them to learn molecular features directly from structure without handcrafted descriptors [1,2]. Through iterative message passing, GNNs aggregate information from neighbouring atoms to capture complex interactions and structural dependencies. This allows them to generalize well across diverse chemical spaces, making them particularly effective for large datasets and novel compounds, though they require substantial computational resources and training data.

Task description

In this Kaggle challenge, you will be tasked with the boiling point prediction of molecules using GNNs.

Figure 2: Example GNN for predicting boiling point. Figure adapted from Schweidtmann et al.

The prediction pipeline begins with SMILES strings, a textual representation of molecules that encodes atoms and bonds. These are converted into molecular graphs using RDKit, where atoms are treated as nodes and bonds as edges. Each node and edge is initialized with basic chemical features (e.g. atom type, hybridization, bond type).

A GNN processes this graph through several graph convolution layers, where information is iteratively passed between connected atoms. This allows the model to learn complex local and global chemical patterns (Figure 2). After multiple message-passing steps, the node embeddings are pooled into a single molecular fingerprint vector that captures the structure-property relationship of the entire molecule. Finally, this vector is fed into a fully connected neural network (multilayer perceptron) to predict the boiling point. The entire model is trained end-to-end using backpropagation. This architecture allows the model to learn directly from molecular structure without the need for handcrafted features, making it a powerful tool for molecular property prediction.

Dataset

https://www.kaggle.com/competitions/molecular-property-prediction-using-gnns/data

Dataset Description

In this part, we utilize the dataset from the paper: Novel method for prediction of normal boiling point and enthalpy of vaporization at normal boiling point of pure refrigerants: A QSPR approach.

Data Split

We split the whole dataset into training, validation, and test datasets, which contain 1341939 molecules, respectively.

Column Descriptors

Each dataset contains the following features:

FeatureDatatypeDescription
SMILESStringThe common smiles of the refrigerant.
Boiling point/KFloat(Target) The experimental boiling point in Kelvin.

Files

  • train.csv – the training set (153 rows)
  • validation.csv – the validation set (19 rows)
  • test.csv – the test set (20 rows)

sample_submission.csv – a sample submission file in the correct format

Citations & License

This dataset is sourced from the QsarDB repository (Archive 10967/128). If you use this dataset, please cite the original authors:

Abooali, D., & Sobati, M. A. (2014). “Novel method for prediction of normal boiling point and enthalpy of vaporization at normal boiling point of pure refrigerants: A QSPR approach.” International Journal of Refrigeration, 40, 282–293. License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Evaluation method

📏 RMSE

RMSE (Root Mean Square Error) is a standard metric used to evaluate regression models. It measures the average magnitude of the prediction errors, with greater penalty on larger errors due to squaring. A lower RMSE indicates better predictive accuracy.

The formula for RMSE is:

$$ \mathrm{RMSE} = \sqrt{\frac{1}{N}\sum_{i=1}^{N}\left(\hat{y}_i – y_i\right)^2} $$

Where:

  • N is the total number of samples
  • yᵢ is the true value for the i-th sample
  • ŷᵢ is the predicted value for the i-th sample
  • (ŷᵢ − yᵢ)² is the squared prediction error for each sample

RMSE is especially useful when large prediction errors are unacceptable. Since it shares the same unit as the target variable (e.g., Kelvin for boiling points), it’s intuitive for interpreting model performance.

Submission File

For each ID in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:

SMILES,Boiling point/K CCO,1 c1ccccc1,0 CC(=O)O,1 CC@HC(=O)O,0 C1=CC=CN=C1,1 etc.