Retail Dataset Example

This page shows some typical use-cases of ‘chainstaining’ multiple stainers together to produce several distinct transformed DirtyDFs, based on a retail dataset. We expect these types of procedures to be the most common use-case of this library.

import pandas as pd
import numpy as np
from ddf.stainer import ShuffleStainer, InflectionStainer, NullifyStainer, DatetimeFormatStainer, DatetimeSplitStainer
from ddf.DirtyDF import DirtyDF

We load the dataset and view some basic dataset properties.

retail = pd.read_csv("../data/online_retail_small.csv", parse_dates = ["InvoiceDate"])
retail.info()

Out:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   InvoiceNo    5000 non-null   object
 1   StockCode    5000 non-null   object
 2   Description  4988 non-null   object
 3   Quantity     5000 non-null   int64
 4   InvoiceDate  5000 non-null   datetime64[ns]
 5   UnitPrice    5000 non-null   float64
 6   CustomerID   3795 non-null   float64
 7   Country      5000 non-null   object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 312.6+ KB
retail.head()
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-01-12 08:26:00 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-01-12 08:26:00 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-01-12 08:26:00 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-01-12 08:26:00 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-01-12 08:26:00 3.39 17850.0 United Kingdom


Convert ‘Country’ column to ‘category’ type.

retail["Country"] = retail.Country.astype("category")

We can stain the dataset in various ways; in particular, since there is a datetime component in this dataset, we can use the DatetimeFormatStainer and DatetimeSplitStainer. We can also add simple ShuffleStainer, NullifyStainer, and apply InflectionStainer on the countries as well.

We first view the distribution of the Country column to see if inflection staining is applicable here.

retail.Country.value_counts()

Out:

United Kingdom    4837
Norway              73
Germany             30
EIRE                24
France              20
Australia           14
Netherlands          2
Name: Country, dtype: int64

We can see that lowercase and uppercase inflections are applicable here, aside from the ‘EIRE’ category, which we can ignore.

We now check the numeric distribution of the datetime column to see if datetime staining is applicable here.

retail.InvoiceDate.describe(datetime_is_numeric=True)

Out:

count                          5000
mean     2010-01-24 07:21:03.612000
min             2010-01-12 08:26:00
25%             2010-01-12 13:24:00
50%             2010-01-12 17:06:00
75%             2010-02-12 12:10:00
max             2010-02-12 18:08:00
Name: InvoiceDate, dtype: object

We can see that the entire dataset consists of invoices within a month, and times are included.

We now initiate our stainers. It is possible to change the name of the Stainer to reflect the output seen when printing the history

retail_ddf = DirtyDF(retail, seed = 42) # Create DDF
dt_split_stainer = DatetimeSplitStainer(name = "Date Split", keep_time = False) # Only split the date

Since the col_type of the DatetimeSplitStainer is set to “datetime”, it will automatically identify datetime columns and only execute the stainer on those columns. Note that this only applies when using a DDF. If using the stainer directly, the column number needs to be specified

retail_transformed = retail_ddf.add_stainers(dt_split_stainer).run_stainer()
retail_transformed.get_df().head()

new_retail_df, row_map, col_map = dt_split_stainer.transform(retail, np.random.default_rng(42), col_idx = [4])

Since the DatetimeSpitStainer adds columns, we can check the column mapping to see how the columns were changed

retail_transformed.get_mapping(axis = 1) # or col_map if using the Stainer directly

Out:

{0: [0], 1: [1], 2: [2], 3: [3], 4: [4, 5, 6], 5: [7], 6: [8], 7: [9]}

Total running time of the script: ( 0 minutes 0.288 seconds)

Gallery generated by Sphinx-Gallery