Automated Feature Engineering Using Meta-Learning for Efficient and Generalizable Data Science Pipelines

Helda Yudhiastuti; Shafiq Hussain; Irfa Shabbir

doi:10.61453/jods.v20260104

Authors

Helda Yudhiastuti Universitas Binadarma, Palembang, Indonesia
Shafiq Hussain University of Sahiwal, Sahiwal, Pakistan
Irfa Shabbir COMSATS University Islamabad, Islamabad, Pakistan

DOI:

https://doi.org/10.61453/jods.v20260104

Keywords:

Automated Machine Learning, Feature Engineering, Meta-Learning, Data Pipelines, AutoML

Abstract

Feature engineering remains one of the most time-intensive and expertise-dependent stages in machine learning pipelines, often limiting scalability and reproducibility. Despite advances in automated machine learning, existing systems largely emphasize model and hyperparameter optimization while leaving feature construction partially manual and task-specific. This reveals a critical research gap: the absence of a transferable, experience-driven mechanism capable of generalizing feature engineering knowledge across heterogeneous datasets. To address this limitation, this study proposes a meta-learning–based automated feature engineering framework that models transformation selection as a learnable mapping between dataset meta-characteristics and transformation utility. The framework constructs a reusable meta-knowledge layer trained on historical task–transformation–performance relationships and applies ranked transformation strategies to unseen datasets under computational constraints. Experiments conducted on diverse classification and regression datasets demonstrate that the proposed approach achieves up to 4.2% improvement in F1-score and 8.3% reduction in RMSE compared to raw-feature baselines, while maintaining performance comparable to or exceeding manually engineered pipelines. In addition, development time is reduced by up to 55%, and search complexity decreases by approximately 60% through ranking-based pruning. These findings confirm that feature engineering can be formalized as a transferable meta-learning problem, enabling scalable, efficient, and generalizable data science workflows. The study advances the automation of representation construction and supports the integration of intelligent meta-knowledge reuse in next-generation AutoML systems

References

Abdallah, M., Rossi, R. A., Mahadik, K., Kim, S., Zhao, H., & Bagchi, S. (2025). Evaluation-free Time-series Forecasting Model Selection via Meta-learning. ACM Transactions on Knowledge Discovery from Data, 19(3). https://doi.org/10.1145/3715149

Ameen, Y. A., Badary, D. M., Abonnoor, A. E. I., Hussain, K. F., & Sewisy, A. A. (2023). Which data subset should be augmented for deep learning? a simulation study using urothelial cell carcinoma histopathology images. BMC Bioinformatics 2023 24:1, 24(1), 75-. https://doi.org/10.1186/s12859-023-05199-y

Azhar, M., Amjad, A., Dewi, D. A., & Kasim, S. (2025). A Systematic Review and Experimental Evaluation of Classical and Transformer-Based Models for Urdu Abstractive Text Summarization. Information, 16(9). https://doi.org/10.3390/info16090784

Bhuyan, H. K., & Chakraborty, C. (2024). Explainable Machine Learning for Data Extraction Across Computational Social System. IEEE Transactions on Computational Social Systems, 11(3), 3131–3145. https://doi.org/10.1109/TCSS.2022.3164993

Bonidia, R. P., Santos, A. P. A., De Almeida, B. L. S., Stadler, P. F., Da Rocha, U. N., Sanches, D. S., & De Carvalho, A. C. P. L. F. (2022). BioAutoML: automated feature engineering and metalearning to predict noncoding RNAs in bacteria. Briefings in Bioinformatics, 23(4). https://doi.org/10.1093/bib/bbac218

Cheng, S., Harsuko, R., & Alkhalifah, T. (2024). Meta-Processing: A robust framework for multi-tasks seismic processing. Surveys in Geophysics, 45(4), 1081–1116. https://doi.org/10.1007/s10712-024-09837-9

Correia, J., Capela, J., & Rocha, M. (2024). Deepmol: an automated machine and deep learning framework for computational chemistry. Journal of Cheminformatics 2024 16:1, 16(1), 136-. https://doi.org/10.1186/s13321-024-00937-7

Dagher, R., Ozkara, B. B., Karabacak, M., Dagher, S. A., Rumbaut, E. I., Luna, L. P., Yedavalli, V. S., & Wintermark, M. (2024). Artificial intelligence/machine learning for neuroimaging to predict hemorrhagic transformation: Systematic review/meta-analysis. Journal of Neuroimaging, 34(5), 505–514. https://doi.org/10.1111/jon.13223

De Amorim, L. B. V., Cavalcanti, G. D. C., & Cruz, R. M. O. (2025). Meta-Scaler: A Meta-Learning Framework for the Selection of Scaling Techniques. IEEE Transactions on Neural Networks and Learning Systems, 36(3), 4805–4819. https://doi.org/10.1109/TNNLS.2024.3366615

Eldeeb, H., & Elshawi, R. (2025). Empowering Machine Learning With Scalable Feature Engineering and Interpretable AutoML. IEEE Transactions on Artificial Intelligence, 6(2), 432–447. https://doi.org/10.1109/TAI.2024.3400752

Garouani, M., Ahmad, A., Bouneffa, M., & Hamlich, M. (2023). Autoencoder-kNN meta-model based data characterization approach for an automated selection of AI algorithms. Journal of Big Data 2023 10:1, 10(1), 14-. https://doi.org/10.1186/s40537-023-00687-7

Garside, A. K., Ahmad, R., & Muhtazaruddin, M. N. Bin. (2024). A recent review of solution approaches for green vehicle routing problem and its variants. Operations Research Perspectives, 12(1), 100303. https://doi.org/10.1016/j.orp.2024.100303

Ghubaish, A., Yang, Z., Erbad, A., & Jain, R. (2024). LEMDA: A Novel Feature Engineering Method for Intrusion Detection in IoT Systems. IEEE Internet of Things Journal, 11(8), 13247–13256. https://doi.org/10.1109/JIOT.2023.3328795

Hassani, S. (2025). Meta-model structural monitoring with cutting-edge AAE-VMD fusion alongside optimized machine learning methods. Structural Health Monitoring, 24(5), 3185–3213. https://doi.org/10.1177/14759217241263954

Hu, G., Kollias, D., Papadopoulou, E., Tzouveli, P., Wei, J., & Yang, X. (2025). Rethinking Affect Analysis: A Protocol for Ensuring Fairness and Consistency. IEEE Transactions on Biometrics, Behavior, and Identity Science, 7(4), 914–923. https://doi.org/10.1109/TBIOM.2025.3550000

Kucik, A., & Stokholm, A. (2023). AI4SeaIce: selecting loss functions for automated SAR sea ice concentration charting. Scientific Reports 2023 13:1, 13(1), 5962-. https://doi.org/10.1038/s41598-023-32467-x

Lausser, L., Szekely, R., Schmid, F., Maucher, M., & Kestler, H. A. (2022). Efficient cross-validation traversals in feature subset selection. Scientific Reports 2022 12:1, 12(1), 21485-. https://doi.org/10.1038/s41598-022-25942-4

Lee, G., & Lee, S. (2022). Importance of Testing with Independent Subjects and Contexts for Machine-Learning Models to Monitor Construction Workers’ Psychophysiological Responses. Journal of Construction Engineering and Management, 148(9), 04022082. https://doi.org/10.1061/(asce)co.1943-7862.0002341

Payares-Garcia, D., Mateu, J., & Schick, W. (2023). Neuronorm: An R Package to Standardize Multiple Structural MRI. https://doi.org/10.2139/ssrn.4374278

Rulff, D., & Evins, R. (2025). Systematic refinement of surrogate modelling procedure for useful application to building energy problems. Journal of Building Performance Simulation, 18(4), 389–423. https://doi.org/10.1080/19401493.2024.2440418

Suawa, P. F., Halbinger, A., Jongmanns, M., & Reichenbach, M. (2023). Noise-Robust Machine Learning Models for Predictive Maintenance Applications. IEEE Sensors Journal, 23(13), 15081–15092. https://doi.org/10.1109/JSEN.2023.3273458

Uddin, S., & Lu, H. (2024). Dataset meta-level and statistical features affect machine learning performance. Scientific Reports 2024 14:1, 14(1), 1670-. https://doi.org/10.1038/s41598-024-51825-x

Wan, Q., Wang, M., Shan, W., Wang, B., Zhang, L., Leng, Z., Yan, B., Xu, Y., & Chen, H. (2025). Meta-Learning With Task-Adaptive Selection. IEEE Transactions on Circuits and Systems for Video Technology, 35(9), 8627–8638. https://doi.org/10.1109/TCSVT.2025.3557706

Wang, C., Zhao, J., Li, L., Jiao, L., Liu, J., & Wu, K. (2023). A Multi-Transformation Evolutionary Framework for Influence Maximization in Social Networks. IEEE Computational Intelligence Magazine, 18(1), 52–67. https://doi.org/10.1109/MCI.2022.3222050

Wang, P., Xu, J., Zhou, M., & Albeshri, A. (2023). Budget-Constrained Optimal Deployment of Redundant Services in Edge Computing Environment. IEEE Internet of Things Journal, 10(11), 9453–9464. https://doi.org/10.1109/JIOT.2023.3234966

Xiao, M., Wang, D., Wu, M., Liu, K., Xiong, H., Zhou, Y., & Fu, Y. (2024). Traceable Group-Wise Self-Optimizing Feature Transformation Learning: A Dual Optimization Perspective. ACM Transactions on Knowledge Discovery from Data, 18(4), 76. https://doi.org/10.1145/3638059

Yu, K., Sun, S., Liang, J., Chen, K., Qu, B., Yue, C., & Nagaratnam Suganthan, P. (2024). A Space Transformation-Based Multiform Approach for Multiobjective Feature Selection in High-Dimensional Classification. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 54(12), 7305–7317. https://doi.org/10.1109/TSMC.2024.3450278

Zhang, H., Ding, J., Feng, L., Chen Tan, K., & Li, K. (2024). Solving Expensive Optimization Problems in Dynamic Environments with Meta-Learning. IEEE Transactions on Cybernetics, 54(12), 7430–7442. https://doi.org/10.1109/TCYB.2024.3443396

Automated Feature Engineering Using Meta-Learning for Efficient and Generalizable Data Science Pipelines

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License