Generating Synthetic Data for Development and Testing

What is Synthetic Data?

Real events are the core of real-time analytics. Apache Druid® ingests both batch data and stream data to enable low-latency, high-concurrency queries at any scale that combine both real-time and historical events.

When developing or testing solutions that use Druid, sometimes it’s difficult or risky to use real data. Maybe the data for production hasn’t yet been collected. Perhaps the production data contains personal information or confidential information that must be carefully secured. Maybe there is a need to test the impact of changing the incoming data stream.

Synthetic data is a good approach to meet these challenges. It’s just data that is artificially generated instead of collected from actual events. Since the synthetic data isn’t “real”, it can be generated on-demand at any scale with no risks to data security.

Data Generation Checklist

  • Data Requirements
  • Programmatic Generation

Data Requirements

The first step in generating synthetic data is defining requirements. What is the format of the base data that should be simulated? How should the synthetic data “look”?
An example of IoT data (from the UNB public dataset), showing network packet data from various devices:

04:15:00.206384 IP ec2-52-22-185-73.compute-1.amazonaws.com.https > 192.168.137.2.59628: Flags [F.], seq 409, ack 1371, win 118, options [nop,nop,TS val 1810148855 ecr 82385791], length 0
04:15:00.206439 IP ec2-52-22-185-73.compute-1.amazonaws.com.https > 192.168.137.2.59628: Flags [.], ack 1372, win 118, options [nop,nop,TS val 1810148855 ecr 82385791], length 0
04:15:00.209178 IP 192.168.137.2.59628 > ec2-52-22-185-73.compute-1.amazonaws.com.https: Flags [R], seq 2883894230, win 0, length 0
04:15:00.209178 IP 192.168.137.2.59628 > ec2-52-22-185-73.compute-1.amazonaws.com.https: Flags [R], seq 2883894231, win 0, length 0
04:15:00.223469 IP 192.168.137.143.45454 > 154.64.123.34.bc.googleusercontent.com.https: Flags [P.], seq 3609391910:3609392000, ack 3972100536, win 3233, options [nop,nop,TS val 32863920 ecr 3379635137], length 90
04:15:00.272557 IP 154.64.123.34.bc.googleusercontent.com.https > 192.168.137.143.45454: Flags [.], ack 90, win 309, options [nop,nop,TS val 3379636165 ecr 32863920], length 0
04:15:00.279800 IP 192.168.137.125.49154 > broadcasthost.6667: UDP, length 172
04:15:00.324697 IP 192.168.137.19.58059 > broadcasthost.6667: UDP, length 172
04:15:00.410865 IP 192.168.137.55 > 192.168.137.1: ICMP echo request, id 49409, seq 65280, length 64
04:15:00.444109 IP 192.168.137.55.56520 > 120.76.210.199.http: Flags [S], seq 1803686720, win 14600, options [mss 1460,sackOK,TS val 32900845 ecr 0,nop,wscale 3], length 0
04:15:00.654531 IP 192.168.137.67.49154 > broadcasthost.6667: UDP, length 188
04:15:00.722132 IP 192.168.137.96.49154 > broadcasthost.6667: UDP, length 188
04:15:00.988691 ARP, Announcement 192.168.137.38, length 46
04:15:01.251996 IP 192.168.137.143.45454 > 154.64.123.34.bc.googleusercontent.com.https: Flags [P.], seq 90:180, ack 1, win 3233, options [nop,nop,TS val 32864024 ecr 3379636165], length 90
04:15:01.301070 IP 154.64.123.34.bc.googleusercontent.com.https > 192.168.137.143.45454: Flags [.], ack 180, win 309, options [nop,nop,TS val 3379637193 ecr 32864024], length 0
04:15:01.406151 IP 192.168.137.19.40623 > ec2-34-213-103-51.us-west-2.compute.amazonaws.com.8886: Flags [P.], seq 81071502:81071683, ack 340974241, win 1954, length 181
04:15:01.414629 IP 192.168.137.141.37020 > 239.255.255.250.37000: UDP, length 176
04:15:01.461416 IP 192.168.137.4.isnetserv > vps-3b3e145c.vps.ovh.ca.ipsec-msft: isakmp-nat-keep-alive
04:15:01.498070 IP ec2-34-213-103-51.us-west-2.compute.amazonaws.com.8886 > 192.168.137.19.40623: Flags [P.], seq 1:70, ack 181, win 24120, length 69
04:15:01.544228 IP 192.168.137.252 > 192.168.137.1: ICMP echo request, id 34844, seq 0, length 8
04:15:01.614912 IP 192.168.137.19.40623 > ec2-34-213-103-51.us-west-2.compute.amazonaws.com.8886: Flags [.], ack 70, win 1885, length 0
04:15:01.627384 ARP, Announcement 192.168.137.48, length 46
04:15:01.913356 IP 192.168.137.168.49154 > broadcasthost.6667: UDP, length 188

Each event includes:

Timestamp in milliseconds

Protocol (IP or ARP)

Source address, including port

Target Address, including port

Info

  • For ARP packets, length
  • For UDP/IP packets, the identifier “UDP” and length
  • For ICMP/IP packets, the identifier “ICMP echo request”, id, seq, and length
  • For TCP/IP packets, Flags (an array), seq, ack, win, and length

This defines the shape of the data to generate.

For development and testing uses, it’s helpful to have both a historical data set stored as a file and a live data stream to emulate real-time ingestion.

Programmatic Generation

Once the synthetic data definition and needs  are defined, a tool is required to actually generate the data.

Fortunately, there are multiple commercially available and open source tools to generate synthetic data. A few of the more popular open source options include:

Faker is a Python library to generate synthetic data. There are also Faker variants for PHP, Perl, and Ruby.

The Synthetic Data Vault is a collection of python libraries that generate single tables, multiple tables, or sequential events using a variety of generation algorithms.

Synth is a Rust application that generates tables and streams.

Twinify reads a CSV file and generates a “twin” of synthetic data with matching statistical distributions.
Gretel is a commercial provider of synthetic-data-as-a-service that offers an open source Python library for data generation.

Newsletter Signup

Let us help with your analytics apps

Request a Demo