Gapminder data wrangling and visualisation using R

class: center, middle, inverse, title-slide

.title[
# Gapminder data wrangling and visualisation using R
]
.author[
### Zahid Asghar
]
.date[
### 18 July 2022
]

---

<div>
<style type="text/css">.xaringan-extra-logo {
width: 110px;
height: 128px;
z-index: 0;
background-image: url(atomcamp.jpg);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:1em;right:1em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.title-slide):not(.inverse):not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('div')
          logo.classList = 'xaringan-extra-logo'
          logo.href = null
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

---
class: left inverse title-slide
background-image: url(75_yr_pk.png)
background-position:  50% 0%
background-size:  10%
##  Agenda : 5 Important verbs for handling data

####   View, glimpse, structure, head, tail

####  select() for column selection

####  filter() for data filtering

####  arrange() Data Ordering

####  mutate()  Creating Derived Columns

####  summarise()   Calculating Summary Statistics

####   group_by()

---

I have discussed the [Gapminder dataset](https://cran.r-project.org/web/packages/gapminder/index.html) in my videos and we shall use it throughout this training. It's available through CRAN, so make sure to install it. Here's how to load in all required packages:

```r
library(tidyverse)
library(knitr)
library(kableExtra)
#install.packages("gapminder")
library(hrbrthemes)
library(viridis)
library(kableExtra)
options(knitr.table.format = "html")
library(plotly)
library(gridExtra)
library(ggrepel)
```

---
# The dataset is provided in the gapminder library

```r
library(gapminder)

gapminder %>% filter(country=="Sweden")%>%
  mutate(gdpPercap=round(gdpPercap,0), lifeExp=round(lifeExp,2))%>%kable()%>%
  kable_styling(bootstrap_options = "striped", full_width = F)
```

<table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> country </th>
   <th style="text-align:left;"> continent </th>
   <th style="text-align:right;"> year </th>
   <th style="text-align:right;"> lifeExp </th>
   <th style="text-align:right;"> pop </th>
   <th style="text-align:right;"> gdpPercap </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Sweden </td>
   <td style="text-align:left;"> Europe </td>
   <td style="text-align:right;"> 1952 </td>
   <td style="text-align:right;"> 71.86 </td>
   <td style="text-align:right;"> 7124673 </td>
   <td style="text-align:right;"> 8528 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sweden </td>
   <td style="text-align:left;"> Europe </td>
   <td style="text-align:right;"> 1957 </td>
   <td style="text-align:right;"> 72.49 </td>
   <td style="text-align:right;"> 7363802 </td>
   <td style="text-align:right;"> 9912 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sweden </td>
   <td style="text-align:left;"> Europe </td>
   <td style="text-align:right;"> 1962 </td>
   <td style="text-align:right;"> 73.37 </td>
   <td style="text-align:right;"> 7561588 </td>
   <td style="text-align:right;"> 12329 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sweden </td>
   <td style="text-align:left;"> Europe </td>
   <td style="text-align:right;"> 1967 </td>
   <td style="text-align:right;"> 74.16 </td>
   <td style="text-align:right;"> 7867931 </td>
   <td style="text-align:right;"> 15258 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sweden </td>
   <td style="text-align:left;"> Europe </td>
   <td style="text-align:right;"> 1972 </td>
   <td style="text-align:right;"> 74.72 </td>
   <td style="text-align:right;"> 8122293 </td>
   <td style="text-align:right;"> 17832 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sweden </td>
   <td style="text-align:left;"> Europe </td>
   <td style="text-align:right;"> 1977 </td>
   <td style="text-align:right;"> 75.44 </td>
   <td style="text-align:right;"> 8251648 </td>
   <td style="text-align:right;"> 18856 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sweden </td>
   <td style="text-align:left;"> Europe </td>
   <td style="text-align:right;"> 1982 </td>
   <td style="text-align:right;"> 76.42 </td>
   <td style="text-align:right;"> 8325260 </td>
   <td style="text-align:right;"> 20667 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sweden </td>
   <td style="text-align:left;"> Europe </td>
   <td style="text-align:right;"> 1987 </td>
   <td style="text-align:right;"> 77.19 </td>
   <td style="text-align:right;"> 8421403 </td>
   <td style="text-align:right;"> 23587 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sweden </td>
   <td style="text-align:left;"> Europe </td>
   <td style="text-align:right;"> 1992 </td>
   <td style="text-align:right;"> 78.16 </td>
   <td style="text-align:right;"> 8718867 </td>
   <td style="text-align:right;"> 23880 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sweden </td>
   <td style="text-align:left;"> Europe </td>
   <td style="text-align:right;"> 1997 </td>
   <td style="text-align:right;"> 79.39 </td>
   <td style="text-align:right;"> 8897619 </td>
   <td style="text-align:right;"> 25267 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sweden </td>
   <td style="text-align:left;"> Europe </td>
   <td style="text-align:right;"> 2002 </td>
   <td style="text-align:right;"> 80.04 </td>
   <td style="text-align:right;"> 8954175 </td>
   <td style="text-align:right;"> 29342 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sweden </td>
   <td style="text-align:left;"> Europe </td>
   <td style="text-align:right;"> 2007 </td>
   <td style="text-align:right;"> 80.88 </td>
   <td style="text-align:right;"> 9031088 </td>
   <td style="text-align:right;"> 33860 </td>
  </tr>
</tbody>
</table>

---
## Information in **gapminder** data

`View` command opens data in new worksheet while glimpse lists nature of variables (numeric/character/factor...) and total number of rows and columns.

```r
glimpse(gapminder) # We see that there are 1704 rows for 6 columns and also tells nature of variable
```

```
Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
```

```r
#View(gapminder)    # This opens up full data in a new window
```

---

```r
summary(gapminder) 
```

```
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 
```

---
# dplyr features

### `filter()` to keep selected observations
### `select()` to keep selected variables
### `arrange()` to reorder observations by a value
### `mutate()` to create new variables
### `summarize()` to create summary statistics
### `group_by()` for performing operations by group

---

# `Select()`
## Column Selection
 For example `PDHS` files have more than 5000 columns in some files and maybe 40 or 50 or even fewer than that are needed for your analysis. __`Select()`__ function of R's dplyr is used to select columns of your interest

```r
gapminder %>% select(country, pop, lifeExp)
```

```
## # A tibble: 1,704 × 3
##    country          pop lifeExp
##    <fct>          <int>   <dbl>
##  1 Afghanistan  8425333    28.8
##  2 Afghanistan  9240934    30.3
##  3 Afghanistan 10267083    32.0
##  4 Afghanistan 11537966    34.0
##  5 Afghanistan 13079460    36.1
##  6 Afghanistan 14880372    38.4
##  7 Afghanistan 12881816    39.9
##  8 Afghanistan 13867957    40.8
##  9 Afghanistan 16317921    41.7
## 10 Afghanistan 22227415    41.8
## # … with 1,694 more rows
```

---
In case you want to select most of the variables and drop one or two, you may proceed as follows

```r
gapminder %>% select(-gdpPercap)
```

```
## # A tibble: 1,704 × 5
##    country     continent  year lifeExp      pop
##    <fct>       <fct>     <int>   <dbl>    <int>
##  1 Afghanistan Asia       1952    28.8  8425333
##  2 Afghanistan Asia       1957    30.3  9240934
##  3 Afghanistan Asia       1962    32.0 10267083
##  4 Afghanistan Asia       1967    34.0 11537966
##  5 Afghanistan Asia       1972    36.1 13079460
##  6 Afghanistan Asia       1977    38.4 14880372
##  7 Afghanistan Asia       1982    39.9 12881816
##  8 Afghanistan Asia       1987    40.8 13867957
##  9 Afghanistan Asia       1992    41.7 16317921
## 10 Afghanistan Asia       1997    41.8 22227415
## # … with 1,694 more rows
```

---
## Data Filtering

### `filter()` funtion

```r
gapminder_07<- gapminder %>% filter(year==2007)
kbl(gapminder_07[1:10,])%>%kable_styling(fixed_thead=T)
```

<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> country </th>
   <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> continent </th>
   <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> year </th>
   <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> lifeExp </th>
   <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> pop </th>
   <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> gdpPercap </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Afghanistan </td>
   <td style="text-align:left;"> Asia </td>
   <td style="text-align:right;"> 2007 </td>
   <td style="text-align:right;"> 43.828 </td>
   <td style="text-align:right;"> 31889923 </td>
   <td style="text-align:right;"> 974.5803 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Albania </td>
   <td style="text-align:left;"> Europe </td>
   <td style="text-align:right;"> 2007 </td>
   <td style="text-align:right;"> 76.423 </td>
   <td style="text-align:right;"> 3600523 </td>
   <td style="text-align:right;"> 5937.0295 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Algeria </td>
   <td style="text-align:left;"> Africa </td>
   <td style="text-align:right;"> 2007 </td>
   <td style="text-align:right;"> 72.301 </td>
   <td style="text-align:right;"> 33333216 </td>
   <td style="text-align:right;"> 6223.3675 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Angola </td>
   <td style="text-align:left;"> Africa </td>
   <td style="text-align:right;"> 2007 </td>
   <td style="text-align:right;"> 42.731 </td>
   <td style="text-align:right;"> 12420476 </td>
   <td style="text-align:right;"> 4797.2313 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Argentina </td>
   <td style="text-align:left;"> Americas </td>
   <td style="text-align:right;"> 2007 </td>
   <td style="text-align:right;"> 75.320 </td>
   <td style="text-align:right;"> 40301927 </td>
   <td style="text-align:right;"> 12779.3796 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Australia </td>
   <td style="text-align:left;"> Oceania </td>
   <td style="text-align:right;"> 2007 </td>
   <td style="text-align:right;"> 81.235 </td>
   <td style="text-align:right;"> 20434176 </td>
   <td style="text-align:right;"> 34435.3674 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Austria </td>
   <td style="text-align:left;"> Europe </td>
   <td style="text-align:right;"> 2007 </td>
   <td style="text-align:right;"> 79.829 </td>
   <td style="text-align:right;"> 8199783 </td>
   <td style="text-align:right;"> 36126.4927 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Bahrain </td>
   <td style="text-align:left;"> Asia </td>
   <td style="text-align:right;"> 2007 </td>
   <td style="text-align:right;"> 75.635 </td>
   <td style="text-align:right;"> 708573 </td>
   <td style="text-align:right;"> 29796.0483 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Bangladesh </td>
   <td style="text-align:left;"> Asia </td>
   <td style="text-align:right;"> 2007 </td>
   <td style="text-align:right;"> 64.062 </td>
   <td style="text-align:right;"> 150448339 </td>
   <td style="text-align:right;"> 1391.2538 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Belgium </td>
   <td style="text-align:left;"> Europe </td>
   <td style="text-align:right;"> 2007 </td>
   <td style="text-align:right;"> 79.441 </td>
   <td style="text-align:right;"> 10392226 </td>
   <td style="text-align:right;"> 33692.6051 </td>
  </tr>
</tbody>
</table>

---
##>` Have we accidently deleted all other rows? Answer is no.` 
    
### Nope:  If you don't believe me try entering gapminder at the console.

```r
gapminder %>% filter(year==2007)
```

```
## # A tibble: 142 × 6
##    country     continent  year lifeExp       pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Afghanistan Asia       2007    43.8  31889923      975.
##  2 Albania     Europe     2007    76.4   3600523     5937.
##  3 Algeria     Africa     2007    72.3  33333216     6223.
##  4 Angola      Africa     2007    42.7  12420476     4797.
##  5 Argentina   Americas   2007    75.3  40301927    12779.
##  6 Australia   Oceania    2007    81.2  20434176    34435.
##  7 Austria     Europe     2007    79.8   8199783    36126.
##  8 Bahrain     Asia       2007    75.6    708573    29796.
##  9 Bangladesh  Asia       2007    64.1 150448339     1391.
## 10 Belgium     Europe     2007    79.4  10392226    33693.
## # … with 132 more rows
```
---
## Filtering with respect to two variables

### One can apply multiple `filters`

```r
gapminder %>% filter(year==2007,country=="Sri Lanka")
```

```
## # A tibble: 1 × 6
##   country   continent  year lifeExp      pop gdpPercap
##   <fct>     <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Sri Lanka Asia       2007    72.4 20378239     3970.
```

```r
gapminder %>% filter(year==2007, country=="Pakistan")
```

```
## # A tibble: 1 × 6
##   country  continent  year lifeExp       pop gdpPercap
##   <fct>    <fct>     <int>   <dbl>     <int>     <dbl>
## 1 Pakistan Asia       2007    65.5 169270617     2606.
```

---

### Now we are selecting multiple countries for year 2007.

```r
gapminder %>% filter(year==2007, country %in% c("India", "Pakistan","Bangladesh", "Afghanistan", "Iran"))
```

```
## # A tibble: 5 × 6
##   country     continent  year lifeExp        pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>      <int>     <dbl>
## 1 Afghanistan Asia       2007    43.8   31889923      975.
## 2 Bangladesh  Asia       2007    64.1  150448339     1391.
## 3 India       Asia       2007    64.7 1110396331     2452.
## 4 Iran        Asia       2007    71.0   69453570    11606.
## 5 Pakistan    Asia       2007    65.5  169270617     2606.
```

---
## Filtering data for South Asia countries

```r
gapminderSA<-gapminder %>% filter(country %in% c("Bangladesh","India","Pakistan","Sri Lanka","Nepal", "Afghanistan","Bhutan", "Maldives"))
gapminderSA
```

```
## # A tibble: 72 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # … with 62 more rows
```

---
## Sort data with `arrange()`

```r
gapminderSA %>% arrange(gdpPercap)
```

```
## # A tibble: 72 × 6
##    country     continent  year lifeExp       pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Nepal       Asia       1952    36.2   9182536      546.
##  2 India       Asia       1952    37.4 372000000      547.
##  3 India       Asia       1957    40.2 409000000      590.
##  4 Nepal       Asia       1957    37.7   9682338      598.
##  5 Bangladesh  Asia       1972    45.3  70759295      630.
##  6 Afghanistan Asia       1997    41.8  22227415      635.
##  7 Afghanistan Asia       1992    41.7  16317921      649.
##  8 Nepal       Asia       1962    39.4  10332057      652.
##  9 India       Asia       1962    43.6 454000000      658.
## 10 Bangladesh  Asia       1977    46.9  80428306      660.
## # … with 62 more rows
```
---
### Note that by default `arrange()` sorts in ascending order. If we want to sort in descending order, we use the function `desc()`.

```r
gapminderSA %>% arrange(desc(gdpPercap))
```

```
## # A tibble: 72 × 6
##    country   continent  year lifeExp        pop gdpPercap
##    <fct>     <fct>     <int>   <dbl>      <int>     <dbl>
##  1 Sri Lanka Asia       2007    72.4   20378239     3970.
##  2 Sri Lanka Asia       2002    70.8   19576783     3015.
##  3 Sri Lanka Asia       1997    70.5   18698655     2664.
##  4 Pakistan  Asia       2007    65.5  169270617     2606.
##  5 India     Asia       2007    64.7 1110396331     2452.
##  6 Sri Lanka Asia       1992    70.4   17587060     2154.
##  7 Pakistan  Asia       2002    63.6  153403524     2093.
##  8 Pakistan  Asia       1997    61.8  135564834     2049.
##  9 Pakistan  Asia       1992    60.8  120065004     1972.
## 10 Sri Lanka Asia       1987    69.0   16495304     1877.
## # … with 62 more rows
```

---
## Life Expectancy in South Asia in 2007

What is the lowest and highest life expectancy among South Asian countries?

```r
gapminderSA %>% filter(year==2007) %>%  arrange(lifeExp)
```

```
## # A tibble: 6 × 6
##   country     continent  year lifeExp        pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>      <int>     <dbl>
## 1 Afghanistan Asia       2007    43.8   31889923      975.
## 2 Nepal       Asia       2007    63.8   28901790     1091.
## 3 Bangladesh  Asia       2007    64.1  150448339     1391.
## 4 India       Asia       2007    64.7 1110396331     2452.
## 5 Pakistan    Asia       2007    65.5  169270617     2606.
## 6 Sri Lanka   Asia       2007    72.4   20378239     3970.
```

### What was it in 1952?
---
## `mutate()` to change existing or create new variable

```r
gapminderSA %>% mutate(pop=pop/1000000)
```

```
## # A tibble: 72 × 6
##    country     continent  year lifeExp   pop gdpPercap
##    <fct>       <fct>     <int>   <dbl> <dbl>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8.43      779.
##  2 Afghanistan Asia       1957    30.3  9.24      821.
##  3 Afghanistan Asia       1962    32.0 10.3       853.
##  4 Afghanistan Asia       1967    34.0 11.5       836.
##  5 Afghanistan Asia       1972    36.1 13.1       740.
##  6 Afghanistan Asia       1977    38.4 14.9       786.
##  7 Afghanistan Asia       1982    39.9 12.9       978.
##  8 Afghanistan Asia       1987    40.8 13.9       852.
##  9 Afghanistan Asia       1992    41.7 16.3       649.
## 10 Afghanistan Asia       1997    41.8 22.2       635.
## # … with 62 more rows
```

---
If we want to calculate GDP, we need to multiply gdpPercap by pop.

But wait! Didn't we just change pop so it's expressed in millions? 
No: we never stored the results of our previous command, we simply displayed them. Just as I discussed above, unless you overwrite it, the original gapminder dataset will be unchanged. With this in mind, we can create the gdp variable as follows:

```r
gapminderSA %>% mutate(gdp = pop * gdpPercap)
```

```
## # A tibble: 72 × 7
##    country     continent  year lifeExp      pop gdpPercap          gdp
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>        <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.  6567086330.
##  2 Afghanistan Asia       1957    30.3  9240934      821.  7585448670.
##  3 Afghanistan Asia       1962    32.0 10267083      853.  8758855797.
##  4 Afghanistan Asia       1967    34.0 11537966      836.  9648014150.
##  5 Afghanistan Asia       1972    36.1 13079460      740.  9678553274.
##  6 Afghanistan Asia       1977    38.4 14880372      786. 11697659231.
##  7 Afghanistan Asia       1982    39.9 12881816      978. 12598563401.
##  8 Afghanistan Asia       1987    40.8 13867957      852. 11820990309.
##  9 Afghanistan Asia       1992    41.7 16317921      649. 10595901589.
## 10 Afghanistan Asia       1997    41.8 22227415      635. 14121995875.
## # … with 62 more rows
```

---
## How to calculate new variables
As mentioned above, `mutate` is used to calculate new variable. Here we have calculated a new variable `gdp` and then `arrange()` data and selected `top_n(10)` countries to see whether higher `lifeExpectancy` and higher `gdp` are linked or not?

```r
gapminder %>% filter(year==2007) %>% 
  mutate(gdp=gdpPercap*pop) %>% 
  arrange(desc(gdp)) %>% 
  top_n(10)
```

```
## # A tibble: 10 × 7
##    country        continent  year lifeExp        pop gdpPercap     gdp
##    <fct>          <fct>     <int>   <dbl>      <int>     <dbl>   <dbl>
##  1 United States  Americas   2007    78.2  301139947    42952. 1.29e13
##  2 China          Asia       2007    73.0 1318683096     4959. 6.54e12
##  3 Japan          Asia       2007    82.6  127467972    31656. 4.04e12
##  4 India          Asia       2007    64.7 1110396331     2452. 2.72e12
##  5 Germany        Europe     2007    79.4   82400996    32170. 2.65e12
##  6 United Kingdom Europe     2007    79.4   60776238    33203. 2.02e12
##  7 France         Europe     2007    80.7   61083916    30470. 1.86e12
##  8 Brazil         Americas   2007    72.4  190010647     9066. 1.72e12
##  9 Italy          Europe     2007    80.5   58147733    28570. 1.66e12
## 10 Mexico         Americas   2007    76.2  108700891    11978. 1.30e12
```
---
`transmute()` keeps only the derived column. Let's use it in the example from above:

```r
gapminder %>% filter(year==2007) %>% 
  transmute(gdp=gdpPercap*pop) %>% 
  arrange(desc(gdp)) %>% 
  top_n(10)
```

```
## # A tibble: 10 × 1
##        gdp
##      <dbl>
##  1 1.29e13
##  2 6.54e12
##  3 4.04e12
##  4 2.72e12
##  5 2.65e12
##  6 2.02e12
##  7 1.86e12
##  8 1.72e12
##  9 1.66e12
## 10 1.30e12
```
---
## Ordering

arrange data by life expectancy, we use `arrange()` function

```r
gapminder %>% 
  select(country, year,lifeExp) %>% 
  filter(year==2007) %>% 
  arrange(lifeExp)
```

```
## # A tibble: 142 × 3
##    country                   year lifeExp
##    <fct>                    <int>   <dbl>
##  1 Swaziland                 2007    39.6
##  2 Mozambique                2007    42.1
##  3 Zambia                    2007    42.4
##  4 Sierra Leone              2007    42.6
##  5 Lesotho                   2007    42.6
##  6 Angola                    2007    42.7
##  7 Zimbabwe                  2007    43.5
##  8 Afghanistan               2007    43.8
##  9 Central African Republic  2007    44.7
## 10 Liberia                   2007    45.7
## # … with 132 more rows
```
---
top to bottom, then use `arrange(desc())` command as follows:

```r
gapminder %>% 
  select(country, year,lifeExp) %>% 
  filter(year==2007) %>% 
  arrange(desc(lifeExp))
```

```
## # A tibble: 142 × 3
##    country           year lifeExp
##    <fct>            <int>   <dbl>
##  1 Japan             2007    82.6
##  2 Hong Kong, China  2007    82.2
##  3 Iceland           2007    81.8
##  4 Switzerland       2007    81.7
##  5 Australia         2007    81.2
##  6 Spain             2007    80.9
##  7 Sweden            2007    80.9
##  8 Israel            2007    80.7
##  9 France            2007    80.7
## 10 Canada            2007    80.7
## # … with 132 more rows
```
---
### Top 5

```r
gapminder %>% 
  select(country, year,lifeExp) %>% 
  filter(year==2007) %>% 
  arrange(desc(lifeExp)) %>% 
  top_n(5)
```

```
## # A tibble: 5 × 3
##   country           year lifeExp
##   <fct>            <int>   <dbl>
## 1 Japan             2007    82.6
## 2 Hong Kong, China  2007    82.2
## 3 Iceland           2007    81.8
## 4 Switzerland       2007    81.7
## 5 Australia         2007    81.2
```
---
# Summarising data

Another feature of dplyr is `summarise` data

```r
gapminder %>% filter(year==2007) %>% group_by(continent) %>% summarise(mean=mean(lifeExp),min=min(lifeExp),max=max(lifeExp))
```

```
## # A tibble: 5 × 4
##   continent  mean   min   max
##   <fct>     <dbl> <dbl> <dbl>
## 1 Africa     54.8  39.6  76.4
## 2 Americas   73.6  60.9  80.7
## 3 Asia       70.7  43.8  82.6
## 4 Europe     77.6  71.8  81.8
## 5 Oceania    80.7  80.2  81.2
```
---

```r
gapminder %>% 
  summarise(avglifeExp=mean(lifeExp))
```

```
## # A tibble: 1 × 1
##   avglifeExp
##        <dbl>
## 1       59.5
```
---
## Summarising data by groups

```r
gapminder %>%
  filter(year == 2007, continent == "Asia") %>%
  summarize(avgLifeExp = mean(lifeExp)) 
```

```
## # A tibble: 1 × 1
##   avgLifeExp
##        <dbl>
## 1       70.7
```

---

```r
gapminder %>% 
  group_by(continent) %>% 
  filter(year==2007) %>% 
  summarize(avglife=mean(lifeExp))
```

```
## # A tibble: 5 × 2
##   continent avglife
##   <fct>       <dbl>
## 1 Africa       54.8
## 2 Americas     73.6
## 3 Asia         70.7
## 4 Europe       77.6
## 5 Oceania      80.7
```
---
## if_else command alongwith mutate

```r
gapminder %>%
  filter(year == 2007) %>%
  group_by(continent) %>%
  summarize(avgLifeExp = mean(lifeExp)) %>%
  mutate(over75 = if_else(avgLifeExp > 70, "Y", "N"))
```

```
## # A tibble: 5 × 3
##   continent avgLifeExp over75
##   <fct>          <dbl> <chr> 
## 1 Africa          54.8 N     
## 2 Americas        73.6 Y     
## 3 Asia            70.7 Y     
## 4 Europe          77.6 Y     
## 5 Oceania         80.7 Y
```
---
## Total Population by Continets in 2007

```r
gapminder %>% 
  filter(year==2007) %>% 
  group_by(continent) %>% 
  summarize(tot_pop=sum(pop)) 
```

```
## # A tibble: 5 × 2
##   continent    tot_pop
##   <fct>          <dbl>
## 1 Africa     929539692
## 2 Americas   898871184
## 3 Asia      3811953827
## 4 Europe     586098529
## 5 Oceania     24549947
```
---
## Percentiles

In general it is assumed that higher the GDP , higher the lifeExp. To test this assumption, lets calculate percentiles of lifeExp. This will indicate how many countries have ranking lower than the current country.

```r
gapminder %>% select(country,year, lifeExp, gdpPercap) %>% 
  filter(year == 2007) %>%
  mutate(percentile = ntile(lifeExp, 100)) %>%
  arrange(desc(gdpPercap))
```

```
## # A tibble: 142 × 5
##    country           year lifeExp gdpPercap percentile
##    <fct>            <int>   <dbl>     <dbl>      <int>
##  1 Norway            2007    80.2    49357.         88
##  2 Kuwait            2007    77.6    47307.         68
##  3 Singapore         2007    80.0    47143.         87
##  4 United States     2007    78.2    42952.         71
##  5 Ireland           2007    78.9    40676.         79
##  6 Hong Kong, China  2007    82.2    39725.         99
##  7 Switzerland       2007    81.7    37506.         97
##  8 Netherlands       2007    79.8    36798.         85
##  9 Canada            2007    80.7    36319.         91
## 10 Iceland           2007    81.8    36181.         98
## # … with 132 more rows
```

One can notice that all countries are well above 60th percentile on lifeExpectancy when arranged by GDP per capita.

Before you conclude, lets see the bottom side
---
So it makes sense that higher the GDP, higher the lifeExp. This is not formal testing but exploratory data makes lot of sense here.

```r
gapminder %>% select(country,year, lifeExp, gdpPercap) %>% 
  filter(year == 2007) %>%
  mutate(percentile = ntile(lifeExp, 100)) %>%
  arrange(gdpPercap)
```

```
## # A tibble: 142 × 5
##    country                   year lifeExp gdpPercap percentile
##    <fct>                    <int>   <dbl>     <dbl>      <int>
##  1 Congo, Dem. Rep.          2007    46.5      278.          7
##  2 Liberia                   2007    45.7      415.          5
##  3 Burundi                   2007    49.6      430.         10
##  4 Zimbabwe                  2007    43.5      470.          4
##  5 Guinea-Bissau             2007    46.4      579.          6
##  6 Niger                     2007    56.9      620.         18
##  7 Eritrea                   2007    58.0      641.         19
##  8 Ethiopia                  2007    52.9      691.         14
##  9 Central African Republic  2007    44.7      706.          5
## 10 Gambia                    2007    59.4      753.         21
## # … with 132 more rows
```
---
# Advanced Analysis

Filtering data as done in introductory analysis seems quite difficult if you are not familiar with these simple things. But if you are working with dplyr for quite sometime, there is not anything very advanced or difficult.

For example, let's say you have to find out the top 10 countries in the 90th percentile regarding life expectancy in 2007. You can reuse some of the logic from the previous sections, but answering this question alone requires `multiple filtering` and `subsetting`:

```r
gapminder %>% filter(year==2007) %>% 
  mutate(percentile=ntile(lifeExp,100)) %>% 
  filter(percentile>90) %>% 
  arrange(desc(percentile)) %>% 
  top_n(10,wt=percentile) %>% 
  select(country,continent,lifeExp,gdpPercap)
```

```
## # A tibble: 10 × 4
##    country          continent lifeExp gdpPercap
##    <fct>            <fct>       <dbl>     <dbl>
##  1 Japan            Asia         82.6    31656.
##  2 Hong Kong, China Asia         82.2    39725.
##  3 Iceland          Europe       81.8    36181.
##  4 Switzerland      Europe       81.7    37506.
##  5 Australia        Oceania      81.2    34435.
##  6 Spain            Europe       80.9    28821.
##  7 Sweden           Europe       80.9    33860.
##  8 Israel           Asia         80.7    25523.
##  9 France           Europe       80.7    30470.
## 10 Canada           Americas     80.7    36319.
```
---
In case you are interested in bottom 10 (worst lifeExp countries from the bottom), use `top_n` with `-10`.

```r
gapminder %>% filter(year==2007) %>% 
  mutate(percentile=ntile(lifeExp,100)) %>% 
  filter(percentile<10) %>% 
  arrange(percentile) %>% 
  top_n(-10,wt=percentile) %>% 
  select(country,continent,lifeExp,gdpPercap)
```

```
## # A tibble: 10 × 4
##    country                  continent lifeExp gdpPercap
##    <fct>                    <fct>       <dbl>     <dbl>
##  1 Mozambique               Africa       42.1      824.
##  2 Swaziland                Africa       39.6     4513.
##  3 Sierra Leone             Africa       42.6      863.
##  4 Zambia                   Africa       42.4     1271.
##  5 Angola                   Africa       42.7     4797.
##  6 Lesotho                  Africa       42.6     1569.
##  7 Afghanistan              Asia         43.8      975.
##  8 Zimbabwe                 Africa       43.5      470.
##  9 Central African Republic Africa       44.7      706.
## 10 Liberia                  Africa       45.7      415.
```

---
# Visualizing data to get data insight
Visualizing data is one of the most important aspect of getting data insight and may provide a better data insight than a complicated model. Visualizing large data sets were not an easy task, so researchers relied on mathematical and core econometric/regression models. `ggplot2` which is a set of `tidyverse` package is probably one of the greatest tool for data visualization used in `R`. In the following sections we are going to visualize `gapminder` data.

Stat graphics is a mapping of variable to `aes`thetic attributes of `geom`etric objects.
---
## 3 Essential components of `ggplot2`

-   data: dataset containing the variables of interest
-   geom: geometric object in question line, point, bars
-   aes:  aesthetic attributes of an object x/y position, colors, shape, size

---
## Scatter plot

```r
gapminder2007<-gapminder %>% filter(year==2007)
p1<-ggplot(data=gapminder2007,mapping = aes(x=gdpPercap,y=lifeExp,color=continent,size=pop))+geom_point()
p1+facet_wrap(~continent)
```

```r
p1+  labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
       title = "Economic Growth and Life Expectancy",
       subtitle = "Data points are country-years",
       caption = "Source: Gapminder.")
```

---
## Bubbleplot

---
If you just want to highlight the relationship between gbp per capita and life Expectancy you’ve probably done most of the work now. However, it is a good practice to highlight a few interesting dots in this chart to give more insight to the plot:

---

```r
##This is a table of data about a large number of countries, each observed over several years. Let's make a scatterplot with it.
P<-ggplot(data=gapminder,mapping = aes(x=gdpPercap,y=lifeExp))

P+geom_point()+geom_smooth()
```

```r
P+geom_point()+geom_smooth(method = "lm")
```

```r
P+geom_point()+geom_smooth(method = "gam")+scale_x_log10()
```

```r
P+geom_point()+geom_smooth(method = "gam")+scale_x_log10(labels=scales::dollar)
```

```r
P<-ggplot(data=gapminder,mapping = aes(gdpPercap,y=lifeExp,color="purple"))
P+geom_point()+geom_smooth(method = "loess")+scale_x_log10()
```

---

---

#With proper title

```r
P<-ggplot(data=gapminder,mapping = aes(gdpPercap,y=lifeExp))
P+geom_point(alpha=0.3)+
  geom_smooth(method = "gam")+
  scale_x_log10(labels=scales::dollar)+
  labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
       title = "Economic Growth and Life Expectancy",
       subtitle = "Data points are country-years",
       caption = "Source: Gapminder.")
```

---

---
<img src="index_files/figure-html/unnamed-chunk-38-1.png" width="100%" />

---

```r
p + geom_point(mapping = aes(color = log(pop))) +
  scale_x_log10()  
```

<img src="index_files/figure-html/unnamed-chunk-40-1.png" width="100%" />
---

## Comparison of 2007 vs 1952 continentwise

---

```r
p1+labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
       title = "Economic Growth and Life Expectancy",
       subtitle = "Data points are country-years",
       caption = "Source: Gapminder.")
```

---

```r
gapminder %>%  
  select(-pop, -gdpPercap) %>%  
  filter(year == 2007) %>%  
  group_by(continent) %>%  
  summarise(mean_life_exp =  
              mean(lifeExp),median_life_exp=median(lifeExp)) 
```

```
## # A tibble: 5 × 3
##   continent mean_life_exp median_life_exp
##   <fct>             <dbl>           <dbl>
## 1 Africa             54.8            52.9
## 2 Americas           73.6            72.9
## 3 Asia               70.7            72.4
## 4 Europe             77.6            78.6
## 5 Oceania            80.7            80.7
```
---

```r
gapminder %>%  
  select(-pop, -gdpPercap) %>%  
  filter(year == 2007) %>%  
  group_by(continent) %>%  
  summarise(median_life_exp =  
              median(lifeExp)) %>%  
  ggplot() +  
  aes(x = continent) +  
  aes(y = median_life_exp) +  
  geom_point(color = "blue", size = 3,  
             alpha = .4) +  
  scale_y_continuous(limits = c(0,85)) +  
  labs(title = "Median life expectency by continent in 2007") +  
  labs(subtitle = "Data Source: Gapminder package in R") +  
  labs(x = "") +  
  labs(y = "Median life expectancy") +  
  labs(caption = "Zahid Asghar for 'QM4SSH'") +  
  theme_minimal()
```

---
.pull-left[

```r
gapminder %>%  
  filter(year == 1997) %>%  
  select(country, continent, lifeExp) %>%  
  mutate(  
    life_cats =  
      case_when(lifeExp >= 70 ~ "70+",  
                lifeExp < 70 ~ "<70")) %>%  
  ggplot() +  
  aes(x = continent, y = life_cats) +  
  geom_jitter(width = .25, height = .25)+  
  aes(col = continent) +  
  scale_color_discrete(guide = FALSE) +  
  theme_dark() +  
  labs(x = "", y = "") +  
  labs(title = "Life expectency beyond 70 in 1997")
```

```
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
```

---

```r
gapminder %>%  
  mutate(gdp_billions =  
           gdpPercap *  
           pop/1000000000) %>%  
  ggplot() +  
  aes(x = year) +  
  aes(y = gdp_billions) +  
  geom_line() +  
  aes(group = country) +  
  scale_y_log10() +  
  aes(col = continent) +  
  facet_wrap( ~ continent) +  
  scale_color_discrete(guide = F) +  
  theme_minimal()
```

```
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
```

---

```r
gapminder %>%  
  filter(year == 2007) %>%  
  ggplot() +  
  aes(x = continent, y = lifeExp) +  
  geom_boxplot() +  
  geom_jitter(height = 0, width = .2) +  
  stat_summary(fun.y = mean,
               geom = "point",
               col = "goldenrod3",
               size = 5)
```

```
## Warning: `fun.y` is deprecated. Use `fun` instead.
```

---
## Pakistan

```r
gapminder %>%  
  filter(country == "Pakistan") %>%  
  ggplot() +  
  aes(x = year, y = lifeExp) +  
  geom_point() +  
  geom_line() +  
  aes(alpha = year) +  
  aes(col = year) +  
  scale_color_viridis_c() +  
  theme_classic()
```

---

```r
gapminder %>%  
  filter(year == 2007) %>%  
  ggplot() +  
  aes(x = gdpPercap) +  
  aes(y = lifeExp) +  
  geom_point(alpha = .5) +  
  geom_rug(size = 1) +  
  aes(col = continent) +  
  aes(col = lifeExp) +  
  scale_x_log10() +  
  aes(size = gdpPercap) +  
  aes(size = pop) +  
  geom_point(col = "darkgreen", size = 1) +  
  facet_wrap(~ continent)
```

---