Vassilis Gkatsis, Petros Maratos, Christoforos Rekatsinas, George. Giannakopoulos and Panagiotis Krokidas
Abstract
Machine learning algorithms often rely on large training datasets to achieve high performance. However, in domains like chemistry and materials science, acquiring such data is an expensive and laborious process, involving highly trained human experts and material costs. Therefore, it is crucial to develop strategies that minimize the size of training sets while preserving predictive accuracy. The objective is to select an optimal subset of data points from a larger pool of possible samples, one that is sufficiently
informative to train an effective machine learning model. Active learning (AL) methods, which iteratively annotate data points by querying an oracle (e.g., a scientist conducting experiments), have proven highly effective for such tasks. However, challenges remain, particularly for regression tasks, which are generally considered more complex in the AL framework. This complexity stems from the need for uncertainty estimation and the continuous nature of the output space. In this work, we introduce
density-aware greedy sampling (DAGS), an active learning method for regression that integrates uncertainty estimation with data density, specifically designed for large design spaces (DS). We evaluate DAGS in both synthetic data and multiple real-world datasets of functionalized nanoporous materials, such as metal–organic frameworks (MOFs) and covalent-organic frameworks (COFs), for separation applications. Our results demonstrate that DAGS consistently outperforms both random sampling and
state-of-the-art AL techniques in training regression models effectively with a limited number of data points, even in datasets with a high number of features.

