Introducing pydbgen: A random dataframe/database table generator

Posted by r.silva on 6 May 2019 - 9:12am
pybgen2.jpg
Image by Tirthajyoti Sarkar.

By Tirthajyoti Sarkar, ON Semiconductor.

This post was first published in Towards Data Science blog.

Often, beginners in SQL or data science struggle with the matter of easy access to a large sample database file (.DB or .sqlite) for practicing SQL commands. Would it not be great to have a simple tool or library to generate a large database with multiple tables, filled with data of one’s own choice?

When you start learning and practicing data science, often the biggest worry is not the algorithms or techniques but availability of raw data. Fortunately, there are many high-quality real-life datasets available on the web for trying out cool machine learning techniques. However, from my personal experience, I found that the same is not true when it comes to learning SQL. Now, for data science - having a basic familiarity of SQL is almost as important as knowing how to write code in Python or R. But access to a large enough database with real data (such as name, age, credit card, SSN, address, birthday, etc.) is not nearly as common as access to toy datasets on Kaggle, specifically designed or curated for machine learning task.

Apart from the beginners in data science, even seasoned software testers may find it useful to have a simple tool where with a few lines of code they can generate arbitrarily large data sets with random (fake) yet meaningful entries.

I am glad to introduce a lightweight Python library called pydbgen. You can read in details about the package here. I am going to go over similar details in the short article.

What exactly is pydbgen?

It is a lightweight, pure-python library to generate random useful entries (e.g. name, address, credit card number, date, time, company name, job title, license plate number, etc.) and save them in either Pandas dataframe object, or as a SQLite table in a database file, or in a MS Excel file.

How to install it?

It’s (current version 1.0.5) hosted on PyPI (Python Package Index repository). Remember you need to have Faker installed to make this work. So, just type,

pip install pydbgen

Note, it’s currently only tested on Python 3.6. It won’t work on Python 2 installations.

How to use it?

You have to initiate a pydb object to start using it.

import pydbgen
from pydbgen import pydbgen
myDB=pydbgen.pydb()

After that, you can access the various internal functions exposed by the pydbobject. For example to print random US cities,

myDB.city_real()
'Otterville'
for _ in range(10):
    print(myDB.license_plate())
8NVX937
6YZH485
XBY-564
SCG-2185
XMR-158
6OZZ231
CJN-850
SBL-4272
TPY-658
SZL-0934

If you just say 'city' instead of 'city_real', you will get fictitious city names :)

print(myDB.gen_data_series(num=8,data_type='city'))
New Michelle
Robinborough
Leebury
Kaylatown
Hamiltonfort
Lake Christopher
Hannahstad
West Adamborough

How to generate a Pandas dataframe with random entries?

You can choose how many and what data types to be generated. Note, everything is returned as string/texts.

testdf=myDB.gen_dataframe(5,['name','city','phone','date'])

The resulting dataframe looks like following,

names

city

phone-number

date

0

James Bass

Fife Lake

340-848-7354

2010-12-11

1

Cody Werner

Topanga

326-520-2048

1996-12-22

2

Joshua West

Richwoods

265-159-8349

2012-03-12

3

Kenneth Hanson

Northfields Woods

184-496-6411

1979-07-12

4

Michelle Brown

Lake Beulah

554-703-6417

1986-06-04

How to generate a Database Table?

You can choose how many and what data types to be generated. Note, everything is returned as text/VARCHAR data type for the database. You can specify the database file name and the Table name.

myDB.gen_table(
    db_file='Testdb.DB',
    table_name='People',
    fields=['name','city','street_address','email']
)

This generates a .DB file which can be used with MySQL or SQLite database server. The resulting database table was opened in DB Browser for SQLite and looks like following,

Screenshot of DB Browser for SQLit
Image by Tirthajyoti Sarkar.

How to generate an Excel file?

Similar to above, just use the following code to generate an Excel file with random data. Note the 'simple_phone' set is set to False, thereby generating complex long-form phone numbers. This can come handy to experiment with more involved data extraction codes!

myDB.gen_excel(
    num=20,
    fields=['name','phone','time','country'],
    phone_simple=False,
    filename='TestExcel.xlsx'
)

The resulting file looks like…

Screenshot of Excel.
Image by Tirthajyoti Sarkar.

A cool way to generate random email IDs for scrap use?

One built-in method in pydbgen is realistic_email , which generates random email IDs from a seed name. Can you think of any use of this on the web where you don’t want to give out your real email ID but something close?

myDB.gen_excel(
    num=20,
    fields=['name','phone','time','country'],
    phone_simple=False,
    filename='TestExcel.xlsx'
)
Tirtha_Sarkar@gmail.com
Sarkar.Tirtha@outlook.com
Tirtha_S48@verizon.com
Tirtha_Sarkar62@yahoo.com
Tirtha.S46@yandex.com
Tirtha.S@att.com
Sarkar.Tirtha60@gmail.com
TirthaSarkar@zoho.com
Sarkar.Tirtha@protonmail.com
Tirtha.S@comcast.net

Future Improvements and user contributions

The current version is 1.0.5 and may contain many bugs. If you notice any and your program crashes during execution (except for bad entry by you), please let me know. Also, if you have cool idea to contribute to the source code, the Github repo is all open for you. Some questions readily come to mind,

  • Can we integrate some machine learning/statistical modeling with this random data generator?

  • Should visualization function be added to the generator?

Possibilities are endless and exciting…