The Tabular Playground Series
Due to the large demand for tabular problems, Kaggle staff started an experiment in 2021, launching a monthly contest called the Tabular Playground Series. The contests were based on synthetic datasets that replicated public data or data from previous competitions. The synthetic data was created thanks to a deep learning generative network called CTGAN.
You can find the CTGAN code at https://github.com/sdv-dev/CTGAN. There’s also a relevant paper explaining how it works by modeling the probability distribution of rows in tabular data and then generating realistic synthetic data (see https://arxiv.org/pdf/1907.00503v2.pdf).
Synthetic Data Vault (https://sdv.dev/), an MIT initiative, created the technology behind CTGAN and quite a number of tools around it. The result is a set of open-source software systems built to help enterprises generate synthetic data that mimics real data; it can help data scientists to create anonymous datasets...