Statistical Quality and Reproducibility of Pseudorandom Number Generators in Machine Learning Technologies

Authors

DOI:

https://doi.org/10.59461/ijdiic.v4i3.214

Keywords:

Reproducibility, Statistical Quality, Pseudorandom Number , Machine learning

Abstract

Machine learning (ML) frameworks rely heavily on pseudorandom number generators (PRNGs) for tasks such as data shuffling, weight initialization, dropout, and optimization. Yet, the statistical quality and reproducibility of these generators—particularly when integrated into frameworks like PyTorch, TensorFlow, and NumPy—are underexplored. In this paper, we compare the statistical quality of PRNGs used in ML frameworks (Mersenne Twister, PCG, and Philox) against their original C implementations. Using the rigorous TestU01 BigCrush test suite, we evaluate 896 independent random streams for each generator. Our results challenge claims of statistical robustness, revealing that even generators labelled "crush-resistant" (e.g., PCG, Philox) may fail certain statistical tests. Surprisingly, we can observe some differences in failure profiles between the native and framework-integrated versions of the same algorithm, highlighting some implementation differences that may exist. Mersenne Twister implementation in Pytorch and Numpy does not have the exact same failure profile as the original implementation in C. In addition, this is also the case for the TensorFlow implementation of Philox.

Downloads

Download data is not yet available.

References

M. Matsumoto and T. Nishimura, “Mersenne twister,” ACM Trans Model Comput Simul, vol. 8, no. 1, pp. 3–30, Jan. 1998, doi: 10.1145/272991.272995.

J. K. Salmon, M. A. Moraes, R. O. Dror, and D. E. Shaw, “Parallel random numbers,” in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, New York, NY, USA: ACM, Nov. 2011, pp. 1–12. doi: 10.1145/2063384.2063405.

M. E. O’neill, “PCG: A family of simple fast space-efficient statistically good algorithms for random number generation,” ACM Trans Math Softw, 2014.

P. L’Ecuyer and R. Simard, “TestU01,” ACM Trans Math Softw, vol. 33, no. 4, pp. 1–40, Aug. 2007, doi: 10.1145/1268776.1268777.

M. Saito and M. Matsumoto, “SIMD-Oriented Fast Mersenne Twister: a 128-bit Pseudorandom Number Generator,” in Monte Carlo and Quasi-Monte Carlo Methods 2006, Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 607–622. doi: 10.1007/978-3-540-74496-2_36.

A. H. A. Rukhin, J. Soto, J. Nechvatal, M. Smid, E. Barker, S. Leigh, “A statistical test suite for random and pseudorandom number generators for cryptographic applications,” Natl Inst Stand Technol, 2001.

M. Roucairol and T. Cazenave, “Comparing search algorithms on the retrosynthesis problem,” Mol Inform, vol. 43, no. 7, Jul. 2024, doi: 10.1002/minf.202300259.

C. Drummond, “Replicability is not reproducibility: Nor is it good science,” Proc Eval Methods Mach Learn Work, pp. 1–4, 2009.

M. Hart et al., “Trust Not Verify? The Critical Need for Data Curation Standards in Materials Informatics,” Chem Mater, vol. 36, no. 19, pp. 9046–9055, Oct. 2024, doi: 10.1021/acs.chemmater.4c00981.

B. Antunes and D. R. C. Hill, “Reproducibility, Replicability and Repeatability: A survey of reproducible research with a focus on high performance computing,” Comput Sci Rev, vol. 53, p. 100655, Aug. 2024, doi: 10.1016/j.cosrev.2024.100655.

M. Huk, K. Shin, T. Kuboyama, and T. Hashimoto, “Random Number Generators in Training of Contextual Neural Networks,” 2021, pp. 717–730. doi: 10.1007/978-3-030-73280-6_57.

A. Koivu, J.-P. Kakko, S. Mäntyniemi, and M. Sairanen, “Quality of randomness and node dropout regularization for fitting neural networks,” Expert Syst Appl, vol. 207, p. 117938, Nov. 2022, doi: 10.1016/j.eswa.2022.117938.

Y. Lu, S. Y. Meng, “A general analysis of example-selection for stochastic gradient descent,” Int Conf Learn Represent, p. 44, 2022.

J. M. H.-L. J. Antorán, J. Allingham, “Depth uncertainty in neural networks,” Adv Neural Inf Process Syst, vol. 33, pp. 10620–10634, 2020.

A. Mumuni and F. Mumuni, “Data augmentation: A comprehensive survey of modern approaches,” Array, vol. 16, p. 100258, Dec. 2022, doi: 10.1016/j.array.2022.100258.

F. Maleki, K. Ovens, R. Gupta, C. Reinhold, A. Spatz, and R. Forghani, “Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls,” Radiol Artif Intell, vol. 5, no. 1, Jan. 2023, doi: 10.1148/ryai.220028.

I. Tsamardinos, E. Greasidou, and G. Borboudakis, “Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation,” Mach Learn, vol. 107, no. 12, pp. 1895–1922, Dec. 2018, doi: 10.1007/s10994-018-5714-4.

Y. Liu, S. Liu, Y. Wang, F. Lombardi, and J. Han, “A Survey of Stochastic Computing Neural Networks for Machine Learning Applications,” IEEE Trans Neural Networks Learn Syst, vol. 32, no. 7, pp. 2809–2824, Jul. 2021, doi: 10.1109/TNNLS.2020.3009047.

M. Magris and A. Iosifidis, “Bayesian learning for neural networks: an algorithmic survey,” Artif Intell Rev, vol. 56, no. 10, pp. 11773–11823, Oct. 2023, doi: 10.1007/s10462-023-10443-1.

R. Wei and A. Mahmood, “Recent Advances in Variational Autoencoders With Representation Learning for Biomedical Informatics: A Survey,” IEEE Access, vol. 9, pp. 4939–4956, 2021, doi: 10.1109/ACCESS.2020.3048309.

P. Ladosz, L. Weng, M. Kim, and H. Oh, “Exploration in deep reinforcement learning: A survey,” Inf Fusion, vol. 85, pp. 1–22, Sep. 2022, doi: 10.1016/j.inffus.2022.03.003.

L. Xiao, Z. Zhang, K. Huang, J. Jiang, and Y. Peng, “Noise Optimization in Artificial Neural Networks,” IEEE Trans Autom Sci Eng, vol. 22, pp. 2780–2793, 2025, doi: 10.1109/TASE.2024.3384409.

K. Kim, J. Kim, J. Yu, J. Seo, J. Lee, and K. Choi, “Dynamic energy-accuracy trade-off using stochastic computing in deep neural networks,” in Proceedings of the 53rd Annual Design Automation Conference, New York, NY, USA: ACM, Jun. 2016, pp. 1–6. doi: 10.1145/2897937.2898011.

Y. Liu, Y. Wang, F. Lombardi, and J. Han, “An Energy-Efficient Online-Learning Stochastic Computational Deep Belief Network,” IEEE J Emerg Sel Top Circuits Syst, vol. 8, no. 3, pp. 454–465, Sep. 2018, doi: 10.1109/JETCAS.2018.2852705.

S. R. Dubey and S. K. Singh, “Transformer-Based Generative Adversarial Networks in Computer Vision: A Comprehensive Survey,” IEEE Trans Artif Intell, vol. 5, no. 10, pp. 4851–4867, Oct. 2024, doi: 10.1109/TAI.2024.3404910.

P. Dahiya, I. Shumailov, and R. Anderson, "Machine Learning needs Better Randomness Standards: Randomized Smoothing and PRNG-based attacks," Feb. 2024. http://arxiv.org/abs/2306.14043

A. D. and G. Vardi, “From local pseudorandom generators to hardness of learning,” Conf Learn Theory, pp. 1358–1394, 2021.

J. Hu et al., “Explainable AI models for predicting drop coalescence in microfluidics device,” Chem Eng J, vol. 481, p. 148465, Feb. 2024, doi: 10.1016/j.cej.2023.148465.

K. Zhu et al., “Analyzing drop coalescence in microfluidic devices with a deep learning generative model,” Phys Chem Chem Phys, vol. 25, no. 23, pp. 15744–15755, 2023, doi: 10.1039/D2CP05975D.

O. E. Gundersen, K. Coakley, C. Kirkpatrick, and Y. Gil, “Sources of Irreproducibility in Machine Learning: A Review,” Apr. 2023. http://arxiv.org/abs/2204.07610

B. Antunes, C. Mazel, and D. Hill, “Identifying Quality Mersenne Twister Streams for Parallel Stochastic Simulations,” in 2023 Winter Simulation Conference (WSC), IEEE, Dec. 2023, pp. 2801–2812. doi: 10.1109/WSC60868.2023.10408699.

Downloads

Published

20-08-2025

How to Cite

Antunes, B. (2025). Statistical Quality and Reproducibility of Pseudorandom Number Generators in Machine Learning Technologies. International Journal of Data Informatics and Intelligent Computing, 4(3), 23–32. https://doi.org/10.59461/ijdiic.v4i3.214

Issue

Section

Regular Issue