Creating Realistic Test Data with Python Faker
Testing data systems, processing applications and data management systems require data. There are a few ways to get test data. Generate your own or use some other dataset in the format you need. What if you're looking for specific data types with many rows to test that you can put in a git repo without taking up an storage?
Your choices:
- Look for downloadable dataset from many places like Kaggle then import
Write your own python script to generate data.
- Use pandas + numpy to create a simple data set
- Use pandas + python's faker module
Using pandas + numpy to create a fake dataset:
To simplify we are loading it to a csv
file however depending on what system you want to connect to you can connect to a database using python's many database modules.
import pandas as pd
import numpy as np
import sys
import time
# Create Fake data
def get_dataset(size):
df = pd.DataFrame()
df['game'] = np.random.choice(['GameA', 'GameB', 'GameC','GameD', 'GameE'], size)
df['score'] = np.random.randint(1, 9000, size)
df['team'] = np.random.choice(['taper1', 'bravo7', 'M4','bluetree', 'tack satch', '21 jump'], size)
df['win'] = np.random.choice(['yes', 'no'], size)
df['prob'] = np.random.uniform(0, 1, size)
return df
# variables
size = 1_000_000 #! over 100 million rows laptop gets angry
FTIME = time.strftime('%b_%Y%d%H%M%S')
FNAME = "../data/fake_data_" + FTIME + ".csv"
print("\nCreating fake dataset with", size,"rows")
# Generate data with n rows.
df = get_dataset(size)
head1 = df.head()
print("\nSummary:\n", head1.to_string(index=False), "\n\nWriting file:", FNAME)
# Write file
df.to_csv(FNAME, index=False)
print("\nCreated file:", FNAME, "\n" "With", size," rows")
sys.exit()
Running it with a million rows took 3 secs.
sh-3.2$ time python3 fake_data_quick
Creating fake dataset with 1000000 rows
Summary:
game score group win prob
GameE 3662 M4 yes 0.980694
GameD 2862 taper1 yes 0.707455
GameE 4118 tack satch yes 0.859279
GameB 5696 bluetree yes 0.309197
GameE 4580 taper1 no 0.396701
Writing file: ../data/fake_data_Feb_202316124322.csv
Created file: ../data/fake_data_Feb_202316124322.csv
With 1000000 rows
real 0m3.883s
user 0m3.915s
sys 0m0.369s
Now I wanted a wide column with different data types and I could create my own in numpy like above or I could use faker module in python.
First attempt I made it so I could create files in different formats. Maybe I am testing some processing system, ETL or whatever?
- CSV
- Parquet
- JSON
#! /usr/local/bin/python3
import pandas as pd
from faker import Faker
import argparse
import sys
from tqdm import tqdm
# This script generates data in json, csv or parquet format
# mblue Feb 2023
parser = argparse.ArgumentParser()
parser.add_argument('--file', '-f', nargs='*', dest="data_file", help='name for the file -e user_data.parquet')
parser.add_argument('--records','-r', nargs='*', dest="num_records", type=str, help='Enter the amount of records like 100 = 100 rows of data')
parser.add_argument('--format', dest="format_name", type=str, help="Specify the format of the output")
parser.add_argument('--append', '-a', help='Uses 5.7 Grant Syntax', action="store_true")
args = (parser.parse_args())
# arg conditions
if args.data_file:
FNAME = ''.join(args.data_file)
else:
# Setting default value
FNAME = "fake_user_data"
if args.num_records:
REC = ''.join(args.num_records)
else:
REC = 100
# limited error handling
# *TODO: Better handling
if args.format_name is None:
print("Needs 'csv, json or parquet' format argument:\n\
Usage is: python3 data_generator --name mytestdata --records 1000 --format parquet")
sys.exit()
if args.format_name:
FORMAT = ''.join(args.format_name)
print("\nGenerating Fake Data with", '\'' + REC + '\'', 'records.... \n')
# Create an instance of Faker
fake = Faker('en-US')
#CPU_CORES = cpu_count() -1
# Create a pandas DataFrame
# To create email addresses
first_name = fake.first_name()
last_name = fake.last_name()
REC = int(REC)
# Generate Fake Data
def create_df():
df = pd.DataFrame()
for _ in tqdm(range(int(REC)), colour="#18F6F6"):
first_name = fake.first_name()
last_name = fake.last_name()
gamer_id = last_name + str(fake.random_int()) #* String to cagonate pw
df = pd.concat([df, pd.DataFrame({
'Email_Id': f"{first_name}.{last_name}@{fake.domain_name()}",
'Name': first_name,
'Gamer_Id': gamer_id,
'Device': fake.ios_platform_token(),
'Phone_Number' : fake.phone_number(),
'Address' : fake.address(),
'City': fake.city(),
'Year':fake.year(),
'Time': fake.time(),
'Link': fake.url(),
'Purchase_Amount': fake.random_int(min=5, max=1000)
}, index=[0])], ignore_index=True)
return df
df = create_df()
def head_it():
h1 = df.head(3)
print("\nSample Data:\n", h1.to_string(index=False),"\n")
#Global message
def Message1():
MESSAGE = f"\nCreated file: " + FNAME + "\nWith " + str(REC) + " records of data\n"
print(MESSAGE)
def Message2():
MESSAGE = f"\nAppended to file: " + FNAME + "\nWith " + str(REC) + " more records of data\n"
print(MESSAGE)
# Write to CSV
if args.format_name == "csv":
if args.append:
head_it()
FNAME = FNAME + ".csv"
with open(FNAME, 'a') as f:
df.to_csv(FNAME, mode='a', index=False)
Message2()
sys.exit()
else:
head_it()
FNAME = FNAME + ".csv"
df.to_csv(FNAME, index=False)
Message1()
sys.exit()
# Write to Parquet
if args.format_name == "parquet":
if args.append:
head_it()
FNAME = FNAME + ".parquet"
df.to_parquet(FNAME, engine='fastparquet', append=True)
Message2()
sys.exit()
else:
FNAME = FNAME + ".parquet"
df.to_parquet(FNAME)
head_it()
Message1()
sys.exit()
# Write to Json
if args.format_name == "json":
FNAME = FNAME + ".json"
df.to_json(FNAME, orient='records')
head_it()
Message1()
sys.exit()
else:
print("Format not supported must be either 'csv, json or parquet' format.")
sys.exit()
Running this with just 10,000 rows took almost 20 seconds.
sh-3.2$ time python3 fkdata_generator --file ../data/fkuser_data --records 10000 --format parquet
Generating Fake Data with '10000' records....
100%|█████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:18<00:00, 551.32it/s]
Sample Data:
Email_Id Name Gamer_Id Device Phone_Number Address City Year Time Link Purchase_Amount
Erin.Brown@montgomery-nichols.com Erin Brown6307 iPhone; CPU iPhone OS 9_3_5 like Mac OS X (009)211-6735x508 48402 Kelly Port Apt. 431\nNorth Brian, PR 81662 Lucasland 2012 06:12:59 https://www.perez.com/ 544
Dennis.King@avila-odom.com Dennis King3027 iPhone; CPU iPhone OS 14_2_1 like Mac OS X (852)065-5680x15345 92220 Denise Ways Apt. 843\nNorth Sierrashire, WV 39014 East Timothyshire 1976 20:42:58 https://ortega.com/ 97
Tammy.Andrews@garrison.com Tammy Andrews8289 iPhone; CPU iPhone OS 6_1_6 like Mac OS X 001-319-013-0740 6603 Justin Prairie Apt. 652\nMorrisbury, ID 31143 Nicholeville 1987 07:41:59 http://www.anderson.org/ 312
Created file: ../data/fkuser_data.parquet
With 10000 records of data
real 0m19.056s
I added an append option so I can always add to a file later.
time python3 fkdata_generator --file ../data/fkuser_data --records 20000 --format parquet --append
Generating Fake Data with '20000' records....
100%|█████████████████████████████████████████████████████████████████████████████| 20000/20000 [00:55<00:00, 361.19it/s]
Sample Data:
Email_Id Name Gamer_Id Device Phone_Number Address City Year Time Link Purchase_Amount
Jonathan.Ryan@hopkins.info Jonathan Ryan5852 iPad; CPU iPad OS 3_1_3 like Mac OS X (466)654-6869 70790 Yates Ways\nSchmidtview, CA 45881 South Stevenburgh 1979 02:05:06 https://morgan-barber.com/ 317
Calvin.Bennett@harmon.biz Calvin Bennett9929 iPhone; CPU iPhone OS 7_1_2 like Mac OS X 655.995.8999 720 Andrews Roads\nNew Karafurt, MO 41521 Port Daniel 2010 21:29:23 https://www.myers.info/ 433
Christina.Smith@burke-keller.com Christina Smith2436 iPad; CPU iPad OS 9_3_5 like Mac OS X +1-095-350-1935 14657 Hardin Hills\nGutierrezville, CO 22242 Port Nicoleburgh 2005 19:49:27 http://www.figueroa.biz/ 350
Appended to file: ../data/fkuser_data.parquet
With 20000 more records of data
real 0m56.503s
user 0m55.654s
sys 0m0.876s
Verify it added the records.
sh-3.2$ ./duckdb
D select count(*) from '../data/fkuser_data.parquet';
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 30000 │
└──────────────┘
No matter what if you double the amount of rows it seems it triples the time it takes. I will add multiprocessing in time but for now it does what I need it do.
You can find this script in github: fkdata_generator
Our first attempt to create a Dataframe and write to a file with our own data generated by numpy ran faster. Although that approach was quicker, it was limited in terms of the range of data available. By using Faker, we can create richer data in a single function. This was useful for testing DuckDB for a wide column with millions of rows. Writing to a file in a json, csv, or parquet
format can be useful for certain types of testing. A better option might be to connect directly to a database, which would not be difficult. More to come on this soon.