Advertisement

EuroSciPy 2019 Bilbao - Constrained Data Synthesis - Nick Radcliffe

EuroSciPy 2019 Bilbao - Constrained Data Synthesis - Nick Radcliffe EuroSciPy 2019 Bilbao
September 4, Wednesday
Baroja Track. Talk. 14.45

Constrained Data Synthesis
Nick Radcliffe

We introduce a method for creating synthetic data "to order" based on learned (or provided) constraints and data classifications. This includes "good" and "bad" data.

Synthetic data is useful in many contexts, including

providing "safe", non-private alternatives to data containing personally identifiable information
software and pipeline testing
software and service development
enhancing datasets for machine learning.

Synthetic data is often created on a bespoke basis, and since the advent of generative adverserial networks (GANs) there has been considerable interest and experimentation with using those as the basis for creating synthetic data.

We have taken a different approach. We have worked for some years on developing methods for automatically finding constraints that characterise data, and which can be used for testing data validity (so-called "test-driven data analysis", TDDA). Such constraints form (by design) a useful characterisation of the data from which they were generated. As a result, methods that generate datasets that match the constraints necessarily construct datasets that match many of the original characteristics of the data from which the constraints were extracted.

An important aspect of datasets is the relationship between "good" (~ valid) and "bad" (~ invalid) data, both of which are typically present. Systems for creating useful, realistic synthetic data generally need to be able to synthesize both kinds, in realistic mixtures.

This talk will discuss data synthesis from constraints, describing what has been achieved so far (which includes synthesizing good and bad data) and future research directions.

Domains – Big Data, Machine Learning, Simulation
Domain Expertise – some
Python Skill Level – professional
Project Homepage / Git –
Abstract as a tweet – Creating good and bad synthetic data to order using constraints

day1,euroscipy,euroscipy2019,euroscipy 2019,2019,scipy,python,wednesday,talk,bilbao,baroja,nick,radcliffe,constrained,data,synthesis,

Post a Comment

0 Comments