.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples\plot_dirty_df_example.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_dirty_df_example.py: Basic Usage of DirtyDF with Stainers ==================================== This page shows some basic examples of using DirtyDF, and applying stainers to transform them. We recommend you go through the Basic Usage of Stainers (no DirtyDF) example first. .. GENERATED FROM PYTHON SOURCE LINES 9-14 .. code-block:: default import pandas as pd import numpy as np from ddf.stainer import ShuffleStainer, InflectionStainer, RowDuplicateStainer from ddf.DirtyDF import DirtyDF .. GENERATED FROM PYTHON SOURCE LINES 15-17 Single Stainer Example ^^^^^^^^^^^^^^^^^^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 19-21 For the first example, let us once again use the basic dataset containing only 6 rows and 2 columns, an integer ID and an animal class. .. GENERATED FROM PYTHON SOURCE LINES 21-24 .. code-block:: default animal = pd.DataFrame([(0, 'Cat'), (1, 'Dog'), (2, 'Rabbit'), (3, 'Cat'), (4, 'Cat'), (5, 'Dog')], columns=('id', 'class')) .. GENERATED FROM PYTHON SOURCE LINES 25-27 Let us convert the pandas dataframe into a DirtyDF object. We specify a seed for the numpy random generator. This generator will be used for the staining. .. GENERATED FROM PYTHON SOURCE LINES 27-29 .. code-block:: default animal_ddf = DirtyDF(animal, seed = 123) .. GENERATED FROM PYTHON SOURCE LINES 30-31 Let us use only 1 stainer: ShuffleStainer, for now. .. GENERATED FROM PYTHON SOURCE LINES 31-33 .. code-block:: default shuffle_stainer = ShuffleStainer() .. GENERATED FROM PYTHON SOURCE LINES 34-36 Instead of calling on the stainer's transform method directly, we now add the stainer into the DirtyDF object, to be used later when calling the DDF.run_stainer() method. .. GENERATED FROM PYTHON SOURCE LINES 36-38 .. code-block:: default animal_ddf2 = animal_ddf.add_stainers(shuffle_stainer) .. GENERATED FROM PYTHON SOURCE LINES 39-41 Note that the DDF methods return new DDF objects, and do not change the DDF in-place. This can be verified by checking the current stainers stored in a DDF using the .summarise_stainers() method. .. GENERATED FROM PYTHON SOURCE LINES 41-43 .. code-block:: default animal_ddf.summarise_stainers() #empty .. GENERATED FROM PYTHON SOURCE LINES 44-46 .. code-block:: default animal_ddf2.summarise_stainers() #ShuffleStainer present .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 1. Shuffle .. GENERATED FROM PYTHON SOURCE LINES 47-48 We run the stainer by calling the .run_stainer() method. .. GENERATED FROM PYTHON SOURCE LINES 48-50 .. code-block:: default animal_ddf3 = animal_ddf2.run_stainer() .. GENERATED FROM PYTHON SOURCE LINES 51-53 Note that same as before, the above call returns a new DDF object. To view the dataframe content of the DDF object, we can use the .get_df() method. .. GENERATED FROM PYTHON SOURCE LINES 53-55 .. code-block:: default animal_ddf3.get_df() .. raw:: html
id class
0 4 Cat
1 0 Cat
2 2 Rabbit
3 3 Cat
4 1 Dog
5 5 Dog


.. GENERATED FROM PYTHON SOURCE LINES 56-57 Notice that animal_ddf2 still contains the original df, and contains ShuffleStainer inside, but not yet run. .. GENERATED FROM PYTHON SOURCE LINES 57-59 .. code-block:: default animal_ddf2.get_df() .. raw:: html
id class
0 0 Cat
1 1 Dog
2 2 Rabbit
3 3 Cat
4 4 Cat
5 5 Dog


.. GENERATED FROM PYTHON SOURCE LINES 60-62 On the other hand, since ShuffleStainer had already been run to obtain animal_ddf3, we can verify that animal_ddf3 does not contain ShuffleStainer anymore. .. GENERATED FROM PYTHON SOURCE LINES 62-64 .. code-block:: default animal_ddf3.summarise_stainers() #empty .. GENERATED FROM PYTHON SOURCE LINES 65-67 We can view the history of stainers that were run to obtain animal_ddf3 (in this case, only the ShuffleStainer's history) by using the DDF.print_history() method. .. GENERATED FROM PYTHON SOURCE LINES 67-69 .. code-block:: default animal_ddf3.print_history() .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 1. Shuffle Order of rows randomized Time taken: 0.0010290145874023438 .. GENERATED FROM PYTHON SOURCE LINES 70-71 We can also obtain the row and column mappings from the original df to the latest transformed df. .. GENERATED FROM PYTHON SOURCE LINES 71-73 .. code-block:: default animal_ddf3.get_map_from_history(index=0, axis=0) #index=0 since there was only 1 stainer used, and axis=0 specifies rows. .. rst-class:: sphx-glr-script-out Out: .. code-block:: none {4: [0], 0: [1], 2: [2], 3: [3], 1: [4], 5: [5]} .. GENERATED FROM PYTHON SOURCE LINES 74-76 .. code-block:: default animal_ddf3.get_map_from_history(index=0, axis=1) #axis=1 specifies columns. Note that ShuffleStainer doesn't alter columns. .. rst-class:: sphx-glr-script-out Out: .. code-block:: none {0: [0], 1: [1]} .. GENERATED FROM PYTHON SOURCE LINES 77-79 Multiple Stainers Example ^^^^^^^^^^^^^^^^^^^^^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 81-83 Now lets get to the beauty of DirtyDF: using multiple stainers for transformation. For this example, we use 3 stainers, namely, ShuffleStainer, InflectionStainer, and RowDuplicateStainer. .. GENERATED FROM PYTHON SOURCE LINES 83-88 .. code-block:: default shuffle_stainer = ShuffleStainer() dup_stainer = RowDuplicateStainer(deg = 0.6, max_rep = 3) inflection_stainer = InflectionStainer(num_format=2, formats=['lowercase', 'uppercase']) .. GENERATED FROM PYTHON SOURCE LINES 89-91 We work with the same dataset as before. However, note that we have to explicitly convert the 'class' column as 'category' type. This is for the InflectionStainer to be able to detect the column as a categorical and automatically be applied onto it. .. GENERATED FROM PYTHON SOURCE LINES 91-93 .. code-block:: default animal["class"] = animal["class"].astype("category") .. GENERATED FROM PYTHON SOURCE LINES 94-96 We can add multiple stainers at a time by passing a list of stainers into the .add_stainers() method. .. GENERATED FROM PYTHON SOURCE LINES 96-100 .. code-block:: default animal_ddf_mult = DirtyDF(animal).add_stainers([shuffle_stainer, dup_stainer, inflection_stainer]) animal_ddf_mult.summarise_stainers() .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 1. Shuffle 2. Add Duplicates 3. Inflection .. GENERATED FROM PYTHON SOURCE LINES 101-102 We can now run the stainers one-by-one by sequentially applying the .run_stainer() method. .. GENERATED FROM PYTHON SOURCE LINES 104-108 .. note:: Stainers are run in the order that they were inserted in. This order can be altered by using the DDF.reindex_stainer() method, or we can also shuffle the order of stainers by using the DDF.shuffle_stainer() method, however do note that not all stainers are able to be run in any order (i.e. some stainers may need to come before or after others). .. GENERATED FROM PYTHON SOURCE LINES 108-111 .. code-block:: default animal_ddf_mult2 = animal_ddf_mult.run_stainer().run_stainer().run_stainer() .. GENERATED FROM PYTHON SOURCE LINES 112-113 Note that we can also use .run_all_stainers() to run all stainers sequentially at once. .. GENERATED FROM PYTHON SOURCE LINES 113-115 .. code-block:: default animal_ddf_mult3 = animal_ddf_mult.run_all_stainers() #does the same as above .. GENERATED FROM PYTHON SOURCE LINES 116-118 .. code-block:: default animal_ddf_mult3.print_history() .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 1. Shuffle Order of rows randomized Time taken: 0.000997304916381836 2. Add Duplicates Added Duplicate Rows for 3 rows. Each duplicated row should appear a maximum of 3 times. Rows added: 6 Time taken: 0.0019969940185546875 3. Inflection Categorical inflections on: {'class': {'Dog': ['DOG', 'dog'], 'Cat': ['CAT', 'cat'], 'Rabbit': ['rabbit', 'RABBIT']}} Time taken: 0.0029892921447753906 .. GENERATED FROM PYTHON SOURCE LINES 119-120 We can now view the transformed dataframe. .. GENERATED FROM PYTHON SOURCE LINES 120-122 .. code-block:: default animal_ddf_mult3.get_df() .. raw:: html
id class
0 5 dog
1 5 DOG
2 5 DOG
3 4 CAT
4 1 dog
5 0 cat
6 0 CAT
7 0 cat
8 3 cat
9 3 CAT
10 3 CAT
11 2 rabbit


.. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 0.039 seconds) .. _sphx_glr_download_auto_examples_plot_dirty_df_example.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_dirty_df_example.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_dirty_df_example.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_