# Looping Over Data Sets
---
- "How can I process many data sets with a single command?"
objectives:
- "Be able to read and write globbing expressions that match sets of files."
- "Use glob to create lists of files."
- "Write for loops to perform operations on files given their names in a list."
keypoints:
- "Use a `for` loop to process files given a list of their names."
- "Use `glob.glob` to find sets of files whose names match a pattern."
- "Use `glob` and `for` to process batches of files."

## Use a `for` loop to process files given a list of their names.

*   A filename is a character string.
*   And lists can contain character strings.


In [1]:
import pandas as pd
for filename in ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']:
    data = pd.read_csv(filename, index_col='country')
    print(filename, data.min())

data/gapminder_gdp_africa.csv gdpPercap_1952    298.846212
gdpPercap_1957    335.997115
gdpPercap_1962    355.203227
gdpPercap_1967    412.977514
gdpPercap_1972    464.099504
gdpPercap_1977    502.319733
gdpPercap_1982    462.211415
gdpPercap_1987    389.876185
gdpPercap_1992    410.896824
gdpPercap_1997    312.188423
gdpPercap_2002    241.165877
gdpPercap_2007    277.551859
dtype: float64
data/gapminder_gdp_asia.csv gdpPercap_1952    331.0
gdpPercap_1957    350.0
gdpPercap_1962    388.0
gdpPercap_1967    349.0
gdpPercap_1972    357.0
gdpPercap_1977    371.0
gdpPercap_1982    424.0
gdpPercap_1987    385.0
gdpPercap_1992    347.0
gdpPercap_1997    415.0
gdpPercap_2002    611.0
gdpPercap_2007    944.0
dtype: float64


## Use [`glob.glob`](https://docs.python.org/3/library/glob.html#glob.glob) to find sets of files whose names match a pattern.

*   In Unix, the term "globbing" means "matching a set of files with a pattern".
*   The most common patterns are:
    *   `*` meaning "match zero or more characters"
    *   `?` meaning "match exactly one character"
*   Python's standard library contains the [`glob`](https://docs.python.org/3/library/glob.html) module to provide pattern matching functionality
*   The [`glob`](https://docs.python.org/3/library/glob.html) module contains a function also called `glob` to match file patterns
*   E.g., `glob.glob('*.txt')` matches all files in the current directory 
    whose names end with `.txt`.
*   Result is a (possibly empty) list of character strings.


In [3]:
import glob
print('all csv files in data directory:', glob.glob('data/*.csv'))

all csv files in data directory: ['data\\gapminder_all.csv', 'data\\gapminder_gdp_africa.csv', 'data\\gapminder_gdp_americas.csv', 'data\\gapminder_gdp_asia.csv', 'data\\gapminder_gdp_europe.csv', 'data\\gapminder_gdp_oceania.csv']


In [4]:
print('all PDB files:', glob.glob('*.pdb'))

all PDB files: []


## Use `glob` and `for` to process batches of files.

*   Helps a lot if the files are named and stored systematically and consistently
    so that simple patterns will find the right data.

In [6]:
for filename in glob.glob('data/gapminder_*.csv'):
    data = pd.read_csv(filename)
    print(filename, data['gdpPercap_1952'].min())

data\gapminder_all.csv 298.8462121
data\gapminder_gdp_africa.csv 298.8462121
data\gapminder_gdp_americas.csv 1397.7171369999999
data\gapminder_gdp_asia.csv 331.0
data\gapminder_gdp_europe.csv 973.5331947999999
data\gapminder_gdp_oceania.csv 10039.595640000001


*   This includes all data, as well as per-region data.
*   Use a more specific pattern in the exercises to exclude the whole data set.
*   But note that the minimum of the entire data set is also the minimum of one of the data sets,
    which is a nice check on correctness.

:::{admonition} Exercise: Determining Matches
Which of these files is *not* matched by the expression `glob.glob('data/*as*.csv')`?
1. `data/gapminder_gdp_africa.csv`
2. `data/gapminder_gdp_americas.csv`
3. `data/gapminder_gdp_asia.csv`
4. 1 and 2 are not matched.                
:::
            
:::{admonition} See Solution
:class: tip, dropdown
1 is not matched by the glob.
:::

:::{admonition} Exercise: Minimum File Size
Modify this program so that it prints the number of records in the file that has the fewest records.
```python
import glob
import pandas as pd
fewest = ____
for filename in glob.glob('data/*.csv'):
    dataframe = pd.____(filename)
    fewest = min(____, dataframe.shape[0])
print('smallest file has', fewest, 'records')
```
Note that the [shape method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html) returns a tuple with the number of rows and columns of the data frame.
:::

:::{admonition} See Solution
:class: tip, dropdown
```python
import glob
import pandas as pd
fewest = float('Inf')
for filename in glob.glob('data/*.csv'):
    dataframe = pd.read_csv(filename)
    fewest = min(fewest, dataframe.shape[0])
print('smallest file has', fewest, 'records')
```
:::

:::{admonition} Exercise: Comparing Data
Write a program that reads in the regional data sets and plots the average GDP per capita for each region over time in a single chart.

:::

:::{admonition} See Solution
:class: tip, dropdown
This solution builds a useful legend by using the string [`split`](https://docs.python.org/3/library/stdtypes.html#str.split) method to
extract the `region` from the path 'data/gapminder_gdp_a_specific_region.csv'. The [`pathlib module`] also provides useful abstractions for file and path manipulation like returning the name of a file without the file extension.
```python
import glob
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1)
for filename in glob.glob('data/gapminder_gdp*.csv'):
    dataframe = pd.read_csv(filename)
    # extract <region> from the filename, expected to be in the format 'data/gapminder_gdp_<region>.csv'.
    # we will split the string using the split method and `_` as our separator,
    # retrieve the last string in the list that split returns (`<region>.csv`), 
    # and then remove the `.csv` extension from that string.
    region = filename.split('_')[-1][:-4] 
    dataframe.mean().plot(ax=ax, label=region)
plt.legend()
plt.show()
```
:::