That's not something I'm really equipped to comment on all that much. My role wa...

That's not something I'm really equipped to comment on all that much. My role was mostly as a software developer, less a data scientist. I can talk about how to improve the statistical power of the data you are using though, since there's a few ways to make synthetic data. Maybe something in here can answer your question.

Dummy data, like what the Faker (various languages) package does has very little utility other than testing systems and developing prototypes with similar schema to the real thing. It can be used as a starting point for making synthetic data, and that's what we did.

Getting into synthetic data, there's sequential and non-sequential synthetic data. Sequential would generate a single datum, such as age, then use that as the starting point to produce the rest. For instance:

age = 29 => Income for age (20 < x < 30) draw from distribution [30, 70], for incomes in that distribution... ect

Here you have actually really high utility data and can use it for producing basic models that you can then apply to the real data. Non-sequential data on the other hand creates independent datum with a specific rule, but ignores the interdependence of the rules. For example, where a sequential dataset may contain less than 1% of people age 20 to 30 who are retired, a non sequential dataset may contain a distribution based on the group average, leading to a skewed number of people age 20 to 30 who are retired.