Evelyn J. Boettcher

## Processing Large Data with Pandas

### By Evelyn J. Boettcher

#### DiDacTex, LLC

Py Ohio 2022

### Link to Repo

[https://github.com/ejboettcher/evelyn_j_boettcher_cv](https://github.com/ejboettcher/evelyn_j_boettcher_cv/tree/main/_pandas)

---

## Motivation
Data sets can get large quickly.  
You can quickly go from looking at a few 100 rows and a handful of columns to a million rows and hundred of columns.

### If you find yourself saying,

#### "My data is BIG data, so I neeed a better computer."

Well this talk is for you.  
I will show how python pandas library with smart use of categories can allow you to read in large data on a laptop.

Luckily, we live in a time where the processing power on a laptops  
 would have made the engineers who put a man on the moon; swoon.

---

## Large Data:

Big data has been around for a long time.

###  Attributes of Large Data

* Access controlled
* Shows what it shows
* Grows

---

## Access controlled: planet datasets

In the 1600, Johannes Kepler used Tycho Brahe large dataset on planet observations 
to prove how planets orbit the sun.

+  Unfortunately, Kepler had to wait until Tycho Brahe died before he could get a hold of a large data set.

+  Many thought he played a hand in Brahe's death

---

## Large Data: Shows what it shows

You always had to take time to understand it.

### Eclipse of IO

In 1676 Ole Roemer (Rømer) armed with only paper, pencil, telescope and 
a mechanical watch, calculated the speed of LIGHT! 
Was within 30% simply by looking at Jupiter's moon Io.

### Prior to this people thought that light was instantaneous!

His prediction for Io was off by a few minutes.

---

## Measuring the Speed of Light

This required LOTS of observations over years and years by multiple people.

Romer was calculating when Io would have an eclipse and not the speed of light; 
so he must have had to recalculate he calculations over and over again
by hand to ensure there was not a mistake in the math or observations.

### Roemer's Conclusion

+  My work is correct, 
+  the timing is off because light moves at a finite speed.

---

## Our Large Data

We have it a bit easier, we only having to deal with:

* NaNs: Not A Number
* Slow downloads speeds

We still have to deal with

* Access control
* Easter eggs in the data
* Data growth

### But, you say,
#### "My data is BIG (larger than my computer's working memory) and I want an answer today and not wait a decade."

### Lets talk about the Python library: Pandas.

---

## Python Pandas

Python's Pandas is a high performance, easy to use library for analyzing structured
 data: csv files, json, SQLite etc.

Pandas is fast, powerful and flexible. 
It enables you to quickly parse data. 
**But**, it is mainly designed to handle ~<100mb.

There are other tools like Spark to handle LARGE data sets 
(100 gigabytes to terabytes), but...

#### Pandas does an amazing job at cleaning messy / real data.

---

## How does one handle Large data with Pandas
When you have a gigabyte of real world data
and you want to:

* Explore it
* Use your laptop and
* You don't want to switch to Spark.

### Step One:

Use old programming tricks like set numbers to 
***int8***, ***floats16***, or ***float32*** etc to 
reduce the memory size of your DataFrame.

---

## Second Step: Use Categories

If you have strings that ***repeat*** SWITCH to

### categories

### String Examples
#### Non-repeating strings

```Python
string_list = ['Hello', 'World', 'More Strings', 'Evelyn','Boettcher']
```

#### Repeating String
Many times string data will be repetitive, like days of week.

```Python
val_days = ['Monday', 'Tuesday', 'Monday', 'Wednesday', 'Monday', 
            'Thursday', 'Friday', 'Saturday', 'Monday', 'Monday', 
            'Sunday']
```

---

## Data Memory footprint

(sans overhead)

#### Numbers

|memory usage |	int|	uint | float | 	bool  |	  complex |
|:---:        |:----|:----   |:----  |:----     |:----       |
|1 bytes |  int8 (-128-127)  | uint8 (0-255)| | bool |        |
|2 bytes |	int16	 (-32768 to 32767) | uint16 (0 to 65535) |float16 (Half precision)| |   |
|4 bytes |	int32 |	uint32 |	float32 (Single precision)    |	|   |
|8 bytes	|f	int64	|   uint64 |float64 |         |  complex64 (rep. by 2 32-bit floats) |

#### Strings
Python uses three kinds of internal representations for Unicode strings:

* 1 byte per char (Latin-1 encoding)
* 2 bytes per char (UCS-2 encoding)
* 4 bytes per char (UCS-4 encoding)

---

## How does Panda's categories work?

Pandas category type uses integer values to map to the raw values in a column.

This mapping is useful whenever a column contains a limited set of values.

So instead of writing
#### df.mydays = ["Sunday", "Sunday", "Sunday"]

Pandas categories says

#### Sunday = 1

and the DataFrame **in memory** is effectively now

#### df.mydays = [1, 1, 1]

---

## Convert Data to Categories

To convert a column of data to the category we set the data type (dtype).

```Python
df['column name'].astype('category')
```

![SundaySunday df](_pandas/images/SundaySunday.png)

### But wait..

Converting to categories is not always helpful.

The following examples will show the power and pitfalls of categories

---

## Examples

The following scripts are located here
[Pandas Scripts](https://github.com/ejboettcher/evelyn_j_boettcher_cv/tree/main/_pandas/scripts)

You can run these scripts and see for yourself.

The scripts build the DF and then measures the size before 
and after conversion to categories

---

## Big Data: Numbers Only

* Script: int_floats_cats.py

```bash
Building three DF with:
     Range of numbers 1- 4
     Length of DF  10_000
Top 10 rows of data
   INT  INT8     FLOAT
1    2     4  3.329446
2    2     4  1.478631
3    2     3  1.338759
4    3     1  1.846993
5    1     2  2.787727
6    4     3  1.116634
7    1     2  1.064875
8    3     3  3.146007
9    3     2  2.201730

Size of the DF we just made
INT    0.0764MB
INT8   0.0096MB
FLOAT  0.0764MB
____________________
As categories (plain df, category, SAVINGS):
INT  df ---> 0.0764MB , 0.0098MB , 87 %
INT8 df ---> 0.0096MB , 0.0098MB , -2 %
___NOTE______NOTE______NOTE______NOTE___
Categories reduced the size INT df!!
   BUT because of the overhead
   it did not reduce the int8 size

Now, lets try this with random Floats
Float         --->  0.0764MB
Float category--->  0.4079MB
Categories only made the DF memory use worse

All Done

```

---

## Large Data: Numbers cont..

### Increase SIZE

```bash
python int_floats_cats.py -r 4 -n 1_000_000
```

```Bash
Building three DF with:
     range of numbers 1 - 4
     length of DF  1_000_000
Top 10 rows of data
   INT  INT8     FLOAT
1    2     2  2.684429
2    2     4  1.696423
3    1     2  2.607256
4    3     1  1.247054
5    4     3  2.413925
6    2     3  3.141839
7    1     1  3.438698
8    2     2  3.584329
9    1     3  1.019402
Getting size of the DF we just made
INT    7.6295MB
INT8   0.9538MB
FLOAT  7.6295MB
____________________
As categories (plain df, category, SAVINGS)
INT  df ---> 7.6295MB , 0.9539MB , 87 %
INT8 df ---> 0.9538MB , 0.9539MB , 0 %
___NOTE______NOTE______NOTE______NOTE___
Categories reduced the size INT df!!
   BUT because of the overhead
   it did not reduce the int8 size

Now, lets try this with random Floats
Float         --->  7.6295MB
Float category--->  51.4442MB
Categories only made the DF memory use worse

All Done
```

---

## Large Data: Strings

So lets see how we can reduce size of STRINGS arrays.
If you have a column of strings that repeats, 
says days of week, states etc, then you might save memory 
if you switch to columns

```bash
python strings_cat.py
```

```Bash
Building three DF with:
     length of random string in a row 1 - 4
     length of DF  10_000
Top 10 rows of data
        Days  HELLO      Locations     Days_c HELLO_c    Locations_c Random_String
1    Tuesday  World    Beavercreek    Tuesday   World    Beavercreek          xhka
2  Wednesday  Hello        Oakwood  Wednesday   Hello        Oakwood          wiwh
3   Thursday  World      Fairfield   Thursday   World      Fairfield          gahf
4     Friday  Hello  Huber Heights     Friday   Hello  Huber Heights          ldou
5   Saturday  World      Riverdale   Saturday   World      Riverdale          rynl
6     Sunday  Hello         Dayton     Sunday   Hello         Dayton          oasv
7     Monday  World    Beavercreek     Monday   World    Beavercreek          nuym
8    Tuesday  Hello        Oakwood    Tuesday   Hello        Oakwood          gjvs
9  Wednesday  World      Fairfield  Wednesday   World      Fairfield          nmwn
Getting size of the DF we just made
String NO Categories      2.64MB
String WITH Categories    0.0306MB
Random String (1 column)  0.5MB
____________________
As categories
String Columns: HELLO, Locations, Days
(plain df, category, SAVINGS)
NO CAT to category ---> 2.67MB , 0.03MB , 98 %
Cat df to category ---> 0.0306MB , 0.0306MB , 0 %
___NOTE______NOTE______NOTE______NOTE___

Now, lets try this with random STRINGS
String: Random         --->  0.5818MB
String: Random category--->  0.9068MB
Categories only made the DF memory use worse

.
```

---

## Large Data: STRING
#### Now, lets make this BIGGGG

* 1 million rows
* 4 chars long in each cell

```bash
python strings_cat.py -n 1_000_000 -r 4
```

```Bash
Building three DF with:
     length of random string in a row 1 - 4
     length of DF  1_000_000
Top 10 rows of data
     Days  HELLO      Locations     Days_c HELLO_c    Locations_c Random_String
   Tuesday  World    Beavercreek    Tuesday   World    Beavercreek        qbzd
 Wednesday  Hello        Oakwood  Wednesday   Hello        Oakwood        wixv 
  Thursday  World      Fairfield   Thursday   World      Fairfield        vwjs    
   Friday  Hello  Huber Heights      Friday   Hello  Huber Heights        xiiz
  Saturday  World      Riverdale   Saturday   World      Riverdale        owon
    Sunday  Hello         Dayton     Sunday   Hello         Dayton        jihl
    Monday  World    Beavercreek     Monday   World    Beavercreek        xwon
   Tuesday  Hello        Oakwood    Tuesday   Hello        Oakwood        zchj
 Wednesday  World      Fairfield  Wednesday   World      Fairfield        hnjt

Size of the DF we just made
String NO Categories      267.2MB
String WITH Categories      2.8MB
Random String (1 column)  58.17MB
____________________
As categories
String Columns: HELLO, Locations, Days
               (plain df, category, SAVINGS)
NO CAT to category ---> 267.2MB , 2.8MB , 98 %
Cat df to category ---> 2.8MB , 2.8MB , 0 %
___NOTE______NOTE______NOTE______NOTE___

Now, lets try this with random STRINGS
String: Random         --->  58.1MB
String: Random category--->  47.3MB
____________________________________________________
WHAT.......
   There was an improvement!!!
   HOW  DID  THAT  HAPPEN????
____________________________________________________

.
```

#### There was an improvement!!!
HOW  DID  THAT  HAPPEN????

---

#### Interesting Fact:

When we have a random string of characters of length 4 (26 char in the alphabet)
 
 (e.g. 26 x 26 x 26 x 26 = 456,976)

Therefor over 1/2 of the strings should repeat!

---

## Larger Data: strings cont.

```bash
python strings_cat.py -n 10_000_000 -s 1 -r 6
```

This should produce a csv file called

* *my_awesome.csv* with a size of 550.6MB and
* *my_awesome_cat.csv* with size of 486.2  (No random str)

---

```bash
Starting
Building three DF with:
     length of random string in a row 1 - 6
     length of DF  10_000_000
Top 10 rows of data
        Days  HELLO      ...         Locations_c Random_String
1    Tuesday  World      ...         Beavercreek        ahlcks
2  Wednesday  Hello      ...             Oakwood        rwccxh
3   Thursday  World      ...           Fairfield        ihyieo
4     Friday  Hello      ...       Huber Heights        rtxevt
5   Saturday  World      ...           Riverdale        whpjhe
6     Sunday  Hello      ...              Dayton        vktted
7     Monday  World      ...         Beavercreek        klajfi
8    Tuesday  Hello      ...             Oakwood        eneums
9  Wednesday  World      ...           Fairfield        jvruya

[9 rows x 7 columns]
Size of the DF we just made
String NO Categories      2672.3MB
String WITH Categories      28.6MB
Random String (1 column)   600.8MB
____________________
As categories
String Columns: HELLO, Locations, Days
               (plain df, category, SAVINGS)
NO CAT to category ---> 2672.8MB , 28.6122MB , 98 %
Cat df to category ---> 28.6MB , 28.6123MB , 0 %
___NOTE______NOTE______NOTE______NOTE___

Now, lets try this with random STRINGS
String: Random         --->  600.8MB
String: Random category--->  949.3MB
Categories only made the DF memory use worse

.
```

---

## Read in...

`my_awesome_cat.csv` (Files size: ~486MB)

* Same data as my_awesome.csv, but **without** the column of random strings

#### WOW

The data in memory is **LESS** than the file size, by almost 90%!

```Bash
USING CATEGORIES for columns that make sense.
__________________________________________________

It took this many seconds to read in the csv file 7.85
__________________________________________________
Now lets look at size
Memory of Df is:  57.2MB
Top 4 rows of data
   HELLO      Locations       Days HELLO_c    Locations_c     Days_c
1  World    Beavercreek    Tuesday   World    Beavercreek    Tuesday
2  Hello        Oakwood  Wednesday   Hello        Oakwood  Wednesday
3  World      Fairfield   Thursday   World      Fairfield   Thursday
4  Hello  Huber Heights     Friday   Hello  Huber Heights     Friday
13    Sunday
20    Sunday
27    Sunday

Name: Days, dtype: category
Categories (7, object): [Friday, Monday, Saturday, Sunday, Thursday, Tuesday, Wednesday]
__________________________________________________

```

---

## Conclusion

* Pandas can handle data that most *GUI* readers (Office Libre, Excel) can not read in.  
    
* We reduced the size of data in working memory to LESS than the file size at rest (sometimes)!
   * Reduction can be as large as 90%

* Categories can be **Helpful** or **Hurtful** when we are dealing with large data.
   * Please use Pandas with care

---

## Questions

### Prayer to PyLint

Code, don't fail again 
CI in the cloud please take 
pylint be my friend

### Link to Repo

[https://github.com/ejboettcher/evelyn_j_boettcher_cv/tree/main/_pandas](https://github.com/ejboettcher/evelyn_j_boettcher_cv/tree/main/_pandas)

---

## References

* https://www.dataquest.io/blog/pandas-big-data/
* https://pandas.pydata.org/
* https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html

### Slides

These slides are written in Markdown and rendered in to HTML via the Jekyll Framework 
using RevealJS.  It is uses, with modification, the following sites.

* [RevealJS](https://revealjs.com/)
* [RevealJS in Jekyll Framework](https://github.com/dploeger/jekyll-revealjs)