class: center, middle, inverse, title-slide .title[ # Gapminder data wrangling and visualisation using R ] .author[ ### Zahid Asghar ] .date[ ### 18 July 2022 ] ---
--- class: left inverse title-slide background-image: url(75_yr_pk.png) background-position: 50% 0% background-size: 10% ## Agenda : 5 Important verbs for handling data -- #### View, glimpse, structure, head, tail -- #### select() for column selection -- #### filter() for data filtering -- #### arrange() Data Ordering -- #### mutate() Creating Derived Columns -- #### summarise() Calculating Summary Statistics -- #### group_by() --- I have discussed the [Gapminder dataset](https://cran.r-project.org/web/packages/gapminder/index.html) in my videos and we shall use it throughout this training. It's available through CRAN, so make sure to install it. Here's how to load in all required packages: ```r library(tidyverse) library(knitr) library(kableExtra) #install.packages("gapminder") library(hrbrthemes) library(viridis) library(kableExtra) options(knitr.table.format = "html") library(plotly) library(gridExtra) library(ggrepel) ``` --- # The dataset is provided in the gapminder library ```r library(gapminder) gapminder %>% filter(country=="Sweden")%>% mutate(gdpPercap=round(gdpPercap,0), lifeExp=round(lifeExp,2))%>%kable()%>% kable_styling(bootstrap_options = "striped", full_width = F) ``` <table class="table table-striped" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> lifeExp </th> <th style="text-align:right;"> pop </th> <th style="text-align:right;"> gdpPercap </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Sweden </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 1952 </td> <td style="text-align:right;"> 71.86 </td> <td style="text-align:right;"> 7124673 </td> <td style="text-align:right;"> 8528 </td> </tr> <tr> <td style="text-align:left;"> Sweden </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 1957 </td> <td style="text-align:right;"> 72.49 </td> <td style="text-align:right;"> 7363802 </td> <td style="text-align:right;"> 9912 </td> </tr> <tr> <td style="text-align:left;"> Sweden </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 1962 </td> <td style="text-align:right;"> 73.37 </td> <td style="text-align:right;"> 7561588 </td> <td style="text-align:right;"> 12329 </td> </tr> <tr> <td style="text-align:left;"> Sweden </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 1967 </td> <td style="text-align:right;"> 74.16 </td> <td style="text-align:right;"> 7867931 </td> <td style="text-align:right;"> 15258 </td> </tr> <tr> <td style="text-align:left;"> Sweden </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 1972 </td> <td style="text-align:right;"> 74.72 </td> <td style="text-align:right;"> 8122293 </td> <td style="text-align:right;"> 17832 </td> </tr> <tr> <td style="text-align:left;"> Sweden </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 1977 </td> <td style="text-align:right;"> 75.44 </td> <td style="text-align:right;"> 8251648 </td> <td style="text-align:right;"> 18856 </td> </tr> <tr> <td style="text-align:left;"> Sweden </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 1982 </td> <td style="text-align:right;"> 76.42 </td> <td style="text-align:right;"> 8325260 </td> <td style="text-align:right;"> 20667 </td> </tr> <tr> <td style="text-align:left;"> Sweden </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 1987 </td> <td style="text-align:right;"> 77.19 </td> <td style="text-align:right;"> 8421403 </td> <td style="text-align:right;"> 23587 </td> </tr> <tr> <td style="text-align:left;"> Sweden </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 1992 </td> <td style="text-align:right;"> 78.16 </td> <td style="text-align:right;"> 8718867 </td> <td style="text-align:right;"> 23880 </td> </tr> <tr> <td style="text-align:left;"> Sweden </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 1997 </td> <td style="text-align:right;"> 79.39 </td> <td style="text-align:right;"> 8897619 </td> <td style="text-align:right;"> 25267 </td> </tr> <tr> <td style="text-align:left;"> Sweden </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 2002 </td> <td style="text-align:right;"> 80.04 </td> <td style="text-align:right;"> 8954175 </td> <td style="text-align:right;"> 29342 </td> </tr> <tr> <td style="text-align:left;"> Sweden </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 80.88 </td> <td style="text-align:right;"> 9031088 </td> <td style="text-align:right;"> 33860 </td> </tr> </tbody> </table> --- ## Information in **gapminder** data `View` command opens data in new worksheet while glimpse lists nature of variables (numeric/character/factor...) and total number of rows and columns. ```r glimpse(gapminder) # We see that there are 1704 rows for 6 columns and also tells nature of variable ``` ``` Rows: 1,704 Columns: 6 $ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", … $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, … $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, … $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8… $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12… $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, … ``` ```r #View(gapminder) # This opens up full data in a new window ``` --- ```r summary(gapminder) ``` ``` ## country continent year lifeExp ## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60 ## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20 ## Algeria : 12 Asia :396 Median :1980 Median :60.71 ## Angola : 12 Europe :360 Mean :1980 Mean :59.47 ## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85 ## Australia : 12 Max. :2007 Max. :82.60 ## (Other) :1632 ## pop gdpPercap ## Min. :6.001e+04 Min. : 241.2 ## 1st Qu.:2.794e+06 1st Qu.: 1202.1 ## Median :7.024e+06 Median : 3531.8 ## Mean :2.960e+07 Mean : 7215.3 ## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5 ## Max. :1.319e+09 Max. :113523.1 ## ``` --- # dplyr features ### `filter()` to keep selected observations ### `select()` to keep selected variables ### `arrange()` to reorder observations by a value ### `mutate()` to create new variables ### `summarize()` to create summary statistics ### `group_by()` for performing operations by group --- # `Select()` ## Column Selection For example `PDHS` files have more than 5000 columns in some files and maybe 40 or 50 or even fewer than that are needed for your analysis. __`Select()`__ function of R's dplyr is used to select columns of your interest ```r gapminder %>% select(country, pop, lifeExp) ``` ``` ## # A tibble: 1,704 × 3 ## country pop lifeExp ## <fct> <int> <dbl> ## 1 Afghanistan 8425333 28.8 ## 2 Afghanistan 9240934 30.3 ## 3 Afghanistan 10267083 32.0 ## 4 Afghanistan 11537966 34.0 ## 5 Afghanistan 13079460 36.1 ## 6 Afghanistan 14880372 38.4 ## 7 Afghanistan 12881816 39.9 ## 8 Afghanistan 13867957 40.8 ## 9 Afghanistan 16317921 41.7 ## 10 Afghanistan 22227415 41.8 ## # … with 1,694 more rows ``` --- In case you want to select most of the variables and drop one or two, you may proceed as follows ```r gapminder %>% select(-gdpPercap) ``` ``` ## # A tibble: 1,704 × 5 ## country continent year lifeExp pop ## <fct> <fct> <int> <dbl> <int> ## 1 Afghanistan Asia 1952 28.8 8425333 ## 2 Afghanistan Asia 1957 30.3 9240934 ## 3 Afghanistan Asia 1962 32.0 10267083 ## 4 Afghanistan Asia 1967 34.0 11537966 ## 5 Afghanistan Asia 1972 36.1 13079460 ## 6 Afghanistan Asia 1977 38.4 14880372 ## 7 Afghanistan Asia 1982 39.9 12881816 ## 8 Afghanistan Asia 1987 40.8 13867957 ## 9 Afghanistan Asia 1992 41.7 16317921 ## 10 Afghanistan Asia 1997 41.8 22227415 ## # … with 1,694 more rows ``` --- ## Data Filtering ### `filter()` funtion ```r gapminder_07<- gapminder %>% filter(year==2007) kbl(gapminder_07[1:10,])%>%kable_styling(fixed_thead=T) ``` <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> country </th> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> continent </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> year </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> lifeExp </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> pop </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> gdpPercap </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 43.828 </td> <td style="text-align:right;"> 31889923 </td> <td style="text-align:right;"> 974.5803 </td> </tr> <tr> <td style="text-align:left;"> Albania </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 76.423 </td> <td style="text-align:right;"> 3600523 </td> <td style="text-align:right;"> 5937.0295 </td> </tr> <tr> <td style="text-align:left;"> Algeria </td> <td style="text-align:left;"> Africa </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 72.301 </td> <td style="text-align:right;"> 33333216 </td> <td style="text-align:right;"> 6223.3675 </td> </tr> <tr> <td style="text-align:left;"> Angola </td> <td style="text-align:left;"> Africa </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 42.731 </td> <td style="text-align:right;"> 12420476 </td> <td style="text-align:right;"> 4797.2313 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 75.320 </td> <td style="text-align:right;"> 40301927 </td> <td style="text-align:right;"> 12779.3796 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> Oceania </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 81.235 </td> <td style="text-align:right;"> 20434176 </td> <td style="text-align:right;"> 34435.3674 </td> </tr> <tr> <td style="text-align:left;"> Austria </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 79.829 </td> <td style="text-align:right;"> 8199783 </td> <td style="text-align:right;"> 36126.4927 </td> </tr> <tr> <td style="text-align:left;"> Bahrain </td> <td style="text-align:left;"> Asia </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 75.635 </td> <td style="text-align:right;"> 708573 </td> <td style="text-align:right;"> 29796.0483 </td> </tr> <tr> <td style="text-align:left;"> Bangladesh </td> <td style="text-align:left;"> Asia </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 64.062 </td> <td style="text-align:right;"> 150448339 </td> <td style="text-align:right;"> 1391.2538 </td> </tr> <tr> <td style="text-align:left;"> Belgium </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 79.441 </td> <td style="text-align:right;"> 10392226 </td> <td style="text-align:right;"> 33692.6051 </td> </tr> </tbody> </table> --- ##>` Have we accidently deleted all other rows? Answer is no.` ### Nope: If you don't believe me try entering gapminder at the console. ```r gapminder %>% filter(year==2007) ``` ``` ## # A tibble: 142 × 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 2007 43.8 31889923 975. ## 2 Albania Europe 2007 76.4 3600523 5937. ## 3 Algeria Africa 2007 72.3 33333216 6223. ## 4 Angola Africa 2007 42.7 12420476 4797. ## 5 Argentina Americas 2007 75.3 40301927 12779. ## 6 Australia Oceania 2007 81.2 20434176 34435. ## 7 Austria Europe 2007 79.8 8199783 36126. ## 8 Bahrain Asia 2007 75.6 708573 29796. ## 9 Bangladesh Asia 2007 64.1 150448339 1391. ## 10 Belgium Europe 2007 79.4 10392226 33693. ## # … with 132 more rows ``` --- ## Filtering with respect to two variables ### One can apply multiple `filters` ```r gapminder %>% filter(year==2007,country=="Sri Lanka") ``` ``` ## # A tibble: 1 × 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Sri Lanka Asia 2007 72.4 20378239 3970. ``` ```r gapminder %>% filter(year==2007, country=="Pakistan") ``` ``` ## # A tibble: 1 × 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Pakistan Asia 2007 65.5 169270617 2606. ``` --- ### Now we are selecting multiple countries for year 2007. ```r gapminder %>% filter(year==2007, country %in% c("India", "Pakistan","Bangladesh", "Afghanistan", "Iran")) ``` ``` ## # A tibble: 5 × 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 2007 43.8 31889923 975. ## 2 Bangladesh Asia 2007 64.1 150448339 1391. ## 3 India Asia 2007 64.7 1110396331 2452. ## 4 Iran Asia 2007 71.0 69453570 11606. ## 5 Pakistan Asia 2007 65.5 169270617 2606. ``` --- ## Filtering data for South Asia countries ```r gapminderSA<-gapminder %>% filter(country %in% c("Bangladesh","India","Pakistan","Sri Lanka","Nepal", "Afghanistan","Bhutan", "Maldives")) gapminderSA ``` ``` ## # A tibble: 72 × 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 1957 30.3 9240934 821. ## 3 Afghanistan Asia 1962 32.0 10267083 853. ## 4 Afghanistan Asia 1967 34.0 11537966 836. ## 5 Afghanistan Asia 1972 36.1 13079460 740. ## 6 Afghanistan Asia 1977 38.4 14880372 786. ## 7 Afghanistan Asia 1982 39.9 12881816 978. ## 8 Afghanistan Asia 1987 40.8 13867957 852. ## 9 Afghanistan Asia 1992 41.7 16317921 649. ## 10 Afghanistan Asia 1997 41.8 22227415 635. ## # … with 62 more rows ``` --- ## Sort data with `arrange()` ```r gapminderSA %>% arrange(gdpPercap) ``` ``` ## # A tibble: 72 × 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Nepal Asia 1952 36.2 9182536 546. ## 2 India Asia 1952 37.4 372000000 547. ## 3 India Asia 1957 40.2 409000000 590. ## 4 Nepal Asia 1957 37.7 9682338 598. ## 5 Bangladesh Asia 1972 45.3 70759295 630. ## 6 Afghanistan Asia 1997 41.8 22227415 635. ## 7 Afghanistan Asia 1992 41.7 16317921 649. ## 8 Nepal Asia 1962 39.4 10332057 652. ## 9 India Asia 1962 43.6 454000000 658. ## 10 Bangladesh Asia 1977 46.9 80428306 660. ## # … with 62 more rows ``` --- ### Note that by default `arrange()` sorts in ascending order. If we want to sort in descending order, we use the function `desc()`. ```r gapminderSA %>% arrange(desc(gdpPercap)) ``` ``` ## # A tibble: 72 × 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Sri Lanka Asia 2007 72.4 20378239 3970. ## 2 Sri Lanka Asia 2002 70.8 19576783 3015. ## 3 Sri Lanka Asia 1997 70.5 18698655 2664. ## 4 Pakistan Asia 2007 65.5 169270617 2606. ## 5 India Asia 2007 64.7 1110396331 2452. ## 6 Sri Lanka Asia 1992 70.4 17587060 2154. ## 7 Pakistan Asia 2002 63.6 153403524 2093. ## 8 Pakistan Asia 1997 61.8 135564834 2049. ## 9 Pakistan Asia 1992 60.8 120065004 1972. ## 10 Sri Lanka Asia 1987 69.0 16495304 1877. ## # … with 62 more rows ``` --- ## Life Expectancy in South Asia in 2007 What is the lowest and highest life expectancy among South Asian countries? ```r gapminderSA %>% filter(year==2007) %>% arrange(lifeExp) ``` ``` ## # A tibble: 6 × 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 2007 43.8 31889923 975. ## 2 Nepal Asia 2007 63.8 28901790 1091. ## 3 Bangladesh Asia 2007 64.1 150448339 1391. ## 4 India Asia 2007 64.7 1110396331 2452. ## 5 Pakistan Asia 2007 65.5 169270617 2606. ## 6 Sri Lanka Asia 2007 72.4 20378239 3970. ``` ### What was it in 1952? --- ## `mutate()` to change existing or create new variable ```r gapminderSA %>% mutate(pop=pop/1000000) ``` ``` ## # A tibble: 72 × 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <dbl> <dbl> ## 1 Afghanistan Asia 1952 28.8 8.43 779. ## 2 Afghanistan Asia 1957 30.3 9.24 821. ## 3 Afghanistan Asia 1962 32.0 10.3 853. ## 4 Afghanistan Asia 1967 34.0 11.5 836. ## 5 Afghanistan Asia 1972 36.1 13.1 740. ## 6 Afghanistan Asia 1977 38.4 14.9 786. ## 7 Afghanistan Asia 1982 39.9 12.9 978. ## 8 Afghanistan Asia 1987 40.8 13.9 852. ## 9 Afghanistan Asia 1992 41.7 16.3 649. ## 10 Afghanistan Asia 1997 41.8 22.2 635. ## # … with 62 more rows ``` --- If we want to calculate GDP, we need to multiply gdpPercap by pop. But wait! Didn't we just change pop so it's expressed in millions? No: we never stored the results of our previous command, we simply displayed them. Just as I discussed above, unless you overwrite it, the original gapminder dataset will be unchanged. With this in mind, we can create the gdp variable as follows: ```r gapminderSA %>% mutate(gdp = pop * gdpPercap) ``` ``` ## # A tibble: 72 × 7 ## country continent year lifeExp pop gdpPercap gdp ## <fct> <fct> <int> <dbl> <int> <dbl> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330. ## 2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670. ## 3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797. ## 4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150. ## 5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274. ## 6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231. ## 7 Afghanistan Asia 1982 39.9 12881816 978. 12598563401. ## 8 Afghanistan Asia 1987 40.8 13867957 852. 11820990309. ## 9 Afghanistan Asia 1992 41.7 16317921 649. 10595901589. ## 10 Afghanistan Asia 1997 41.8 22227415 635. 14121995875. ## # … with 62 more rows ``` --- ## How to calculate new variables As mentioned above, `mutate` is used to calculate new variable. Here we have calculated a new variable `gdp` and then `arrange()` data and selected `top_n(10)` countries to see whether higher `lifeExpectancy` and higher `gdp` are linked or not? ```r gapminder %>% filter(year==2007) %>% mutate(gdp=gdpPercap*pop) %>% arrange(desc(gdp)) %>% top_n(10) ``` ``` ## # A tibble: 10 × 7 ## country continent year lifeExp pop gdpPercap gdp ## <fct> <fct> <int> <dbl> <int> <dbl> <dbl> ## 1 United States Americas 2007 78.2 301139947 42952. 1.29e13 ## 2 China Asia 2007 73.0 1318683096 4959. 6.54e12 ## 3 Japan Asia 2007 82.6 127467972 31656. 4.04e12 ## 4 India Asia 2007 64.7 1110396331 2452. 2.72e12 ## 5 Germany Europe 2007 79.4 82400996 32170. 2.65e12 ## 6 United Kingdom Europe 2007 79.4 60776238 33203. 2.02e12 ## 7 France Europe 2007 80.7 61083916 30470. 1.86e12 ## 8 Brazil Americas 2007 72.4 190010647 9066. 1.72e12 ## 9 Italy Europe 2007 80.5 58147733 28570. 1.66e12 ## 10 Mexico Americas 2007 76.2 108700891 11978. 1.30e12 ``` --- `transmute()` keeps only the derived column. Let's use it in the example from above: ```r gapminder %>% filter(year==2007) %>% transmute(gdp=gdpPercap*pop) %>% arrange(desc(gdp)) %>% top_n(10) ``` ``` ## # A tibble: 10 × 1 ## gdp ## <dbl> ## 1 1.29e13 ## 2 6.54e12 ## 3 4.04e12 ## 4 2.72e12 ## 5 2.65e12 ## 6 2.02e12 ## 7 1.86e12 ## 8 1.72e12 ## 9 1.66e12 ## 10 1.30e12 ``` --- ## Ordering arrange data by life expectancy, we use `arrange()` function ```r gapminder %>% select(country, year,lifeExp) %>% filter(year==2007) %>% arrange(lifeExp) ``` ``` ## # A tibble: 142 × 3 ## country year lifeExp ## <fct> <int> <dbl> ## 1 Swaziland 2007 39.6 ## 2 Mozambique 2007 42.1 ## 3 Zambia 2007 42.4 ## 4 Sierra Leone 2007 42.6 ## 5 Lesotho 2007 42.6 ## 6 Angola 2007 42.7 ## 7 Zimbabwe 2007 43.5 ## 8 Afghanistan 2007 43.8 ## 9 Central African Republic 2007 44.7 ## 10 Liberia 2007 45.7 ## # … with 132 more rows ``` --- top to bottom, then use `arrange(desc())` command as follows: ```r gapminder %>% select(country, year,lifeExp) %>% filter(year==2007) %>% arrange(desc(lifeExp)) ``` ``` ## # A tibble: 142 × 3 ## country year lifeExp ## <fct> <int> <dbl> ## 1 Japan 2007 82.6 ## 2 Hong Kong, China 2007 82.2 ## 3 Iceland 2007 81.8 ## 4 Switzerland 2007 81.7 ## 5 Australia 2007 81.2 ## 6 Spain 2007 80.9 ## 7 Sweden 2007 80.9 ## 8 Israel 2007 80.7 ## 9 France 2007 80.7 ## 10 Canada 2007 80.7 ## # … with 132 more rows ``` --- ### Top 5 ```r gapminder %>% select(country, year,lifeExp) %>% filter(year==2007) %>% arrange(desc(lifeExp)) %>% top_n(5) ``` ``` ## # A tibble: 5 × 3 ## country year lifeExp ## <fct> <int> <dbl> ## 1 Japan 2007 82.6 ## 2 Hong Kong, China 2007 82.2 ## 3 Iceland 2007 81.8 ## 4 Switzerland 2007 81.7 ## 5 Australia 2007 81.2 ``` --- # Summarising data Another feature of dplyr is `summarise` data ```r gapminder %>% filter(year==2007) %>% group_by(continent) %>% summarise(mean=mean(lifeExp),min=min(lifeExp),max=max(lifeExp)) ``` ``` ## # A tibble: 5 × 4 ## continent mean min max ## <fct> <dbl> <dbl> <dbl> ## 1 Africa 54.8 39.6 76.4 ## 2 Americas 73.6 60.9 80.7 ## 3 Asia 70.7 43.8 82.6 ## 4 Europe 77.6 71.8 81.8 ## 5 Oceania 80.7 80.2 81.2 ``` --- ```r gapminder %>% summarise(avglifeExp=mean(lifeExp)) ``` ``` ## # A tibble: 1 × 1 ## avglifeExp ## <dbl> ## 1 59.5 ``` --- ## Summarising data by groups ```r gapminder %>% filter(year == 2007, continent == "Asia") %>% summarize(avgLifeExp = mean(lifeExp)) ``` ``` ## # A tibble: 1 × 1 ## avgLifeExp ## <dbl> ## 1 70.7 ``` --- ```r gapminder %>% group_by(continent) %>% filter(year==2007) %>% summarize(avglife=mean(lifeExp)) ``` ``` ## # A tibble: 5 × 2 ## continent avglife ## <fct> <dbl> ## 1 Africa 54.8 ## 2 Americas 73.6 ## 3 Asia 70.7 ## 4 Europe 77.6 ## 5 Oceania 80.7 ``` --- ## if_else command alongwith mutate ```r gapminder %>% filter(year == 2007) %>% group_by(continent) %>% summarize(avgLifeExp = mean(lifeExp)) %>% mutate(over75 = if_else(avgLifeExp > 70, "Y", "N")) ``` ``` ## # A tibble: 5 × 3 ## continent avgLifeExp over75 ## <fct> <dbl> <chr> ## 1 Africa 54.8 N ## 2 Americas 73.6 Y ## 3 Asia 70.7 Y ## 4 Europe 77.6 Y ## 5 Oceania 80.7 Y ``` --- ## Total Population by Continets in 2007 ```r gapminder %>% filter(year==2007) %>% group_by(continent) %>% summarize(tot_pop=sum(pop)) ``` ``` ## # A tibble: 5 × 2 ## continent tot_pop ## <fct> <dbl> ## 1 Africa 929539692 ## 2 Americas 898871184 ## 3 Asia 3811953827 ## 4 Europe 586098529 ## 5 Oceania 24549947 ``` --- ## Percentiles In general it is assumed that higher the GDP , higher the lifeExp. To test this assumption, lets calculate percentiles of lifeExp. This will indicate how many countries have ranking lower than the current country. ```r gapminder %>% select(country,year, lifeExp, gdpPercap) %>% filter(year == 2007) %>% mutate(percentile = ntile(lifeExp, 100)) %>% arrange(desc(gdpPercap)) ``` ``` ## # A tibble: 142 × 5 ## country year lifeExp gdpPercap percentile ## <fct> <int> <dbl> <dbl> <int> ## 1 Norway 2007 80.2 49357. 88 ## 2 Kuwait 2007 77.6 47307. 68 ## 3 Singapore 2007 80.0 47143. 87 ## 4 United States 2007 78.2 42952. 71 ## 5 Ireland 2007 78.9 40676. 79 ## 6 Hong Kong, China 2007 82.2 39725. 99 ## 7 Switzerland 2007 81.7 37506. 97 ## 8 Netherlands 2007 79.8 36798. 85 ## 9 Canada 2007 80.7 36319. 91 ## 10 Iceland 2007 81.8 36181. 98 ## # … with 132 more rows ``` One can notice that all countries are well above 60th percentile on lifeExpectancy when arranged by GDP per capita. Before you conclude, lets see the bottom side --- So it makes sense that higher the GDP, higher the lifeExp. This is not formal testing but exploratory data makes lot of sense here. ```r gapminder %>% select(country,year, lifeExp, gdpPercap) %>% filter(year == 2007) %>% mutate(percentile = ntile(lifeExp, 100)) %>% arrange(gdpPercap) ``` ``` ## # A tibble: 142 × 5 ## country year lifeExp gdpPercap percentile ## <fct> <int> <dbl> <dbl> <int> ## 1 Congo, Dem. Rep. 2007 46.5 278. 7 ## 2 Liberia 2007 45.7 415. 5 ## 3 Burundi 2007 49.6 430. 10 ## 4 Zimbabwe 2007 43.5 470. 4 ## 5 Guinea-Bissau 2007 46.4 579. 6 ## 6 Niger 2007 56.9 620. 18 ## 7 Eritrea 2007 58.0 641. 19 ## 8 Ethiopia 2007 52.9 691. 14 ## 9 Central African Republic 2007 44.7 706. 5 ## 10 Gambia 2007 59.4 753. 21 ## # … with 132 more rows ``` --- # Advanced Analysis Filtering data as done in introductory analysis seems quite difficult if you are not familiar with these simple things. But if you are working with dplyr for quite sometime, there is not anything very advanced or difficult. For example, let's say you have to find out the top 10 countries in the 90th percentile regarding life expectancy in 2007. You can reuse some of the logic from the previous sections, but answering this question alone requires `multiple filtering` and `subsetting`: ```r gapminder %>% filter(year==2007) %>% mutate(percentile=ntile(lifeExp,100)) %>% filter(percentile>90) %>% arrange(desc(percentile)) %>% top_n(10,wt=percentile) %>% select(country,continent,lifeExp,gdpPercap) ``` ``` ## # A tibble: 10 × 4 ## country continent lifeExp gdpPercap ## <fct> <fct> <dbl> <dbl> ## 1 Japan Asia 82.6 31656. ## 2 Hong Kong, China Asia 82.2 39725. ## 3 Iceland Europe 81.8 36181. ## 4 Switzerland Europe 81.7 37506. ## 5 Australia Oceania 81.2 34435. ## 6 Spain Europe 80.9 28821. ## 7 Sweden Europe 80.9 33860. ## 8 Israel Asia 80.7 25523. ## 9 France Europe 80.7 30470. ## 10 Canada Americas 80.7 36319. ``` --- In case you are interested in bottom 10 (worst lifeExp countries from the bottom), use `top_n` with `-10`. ```r gapminder %>% filter(year==2007) %>% mutate(percentile=ntile(lifeExp,100)) %>% filter(percentile<10) %>% arrange(percentile) %>% top_n(-10,wt=percentile) %>% select(country,continent,lifeExp,gdpPercap) ``` ``` ## # A tibble: 10 × 4 ## country continent lifeExp gdpPercap ## <fct> <fct> <dbl> <dbl> ## 1 Mozambique Africa 42.1 824. ## 2 Swaziland Africa 39.6 4513. ## 3 Sierra Leone Africa 42.6 863. ## 4 Zambia Africa 42.4 1271. ## 5 Angola Africa 42.7 4797. ## 6 Lesotho Africa 42.6 1569. ## 7 Afghanistan Asia 43.8 975. ## 8 Zimbabwe Africa 43.5 470. ## 9 Central African Republic Africa 44.7 706. ## 10 Liberia Africa 45.7 415. ``` --- # Visualizing data to get data insight Visualizing data is one of the most important aspect of getting data insight and may provide a better data insight than a complicated model. Visualizing large data sets were not an easy task, so researchers relied on mathematical and core econometric/regression models. `ggplot2` which is a set of `tidyverse` package is probably one of the greatest tool for data visualization used in `R`. In the following sections we are going to visualize `gapminder` data. Stat graphics is a mapping of variable to `aes`thetic attributes of `geom`etric objects. --- ## 3 Essential components of `ggplot2` - data: dataset containing the variables of interest - geom: geometric object in question line, point, bars - aes: aesthetic attributes of an object x/y position, colors, shape, size --- ## Scatter plot ```r gapminder2007<-gapminder %>% filter(year==2007) p1<-ggplot(data=gapminder2007,mapping = aes(x=gdpPercap,y=lifeExp,color=continent,size=pop))+geom_point() p1+facet_wrap(~continent) ``` <img src="index_files/figure-html/unnamed-chunk-31-1.png" width="100%" /> ```r p1+ labs(x = "GDP Per Capita", y = "Life Expectancy in Years", title = "Economic Growth and Life Expectancy", subtitle = "Data points are country-years", caption = "Source: Gapminder.") ``` <img src="index_files/figure-html/unnamed-chunk-31-2.png" width="100%" /> --- ## Bubbleplot <img src="index_files/figure-html/unnamed-chunk-32-1.png" width="100%" /> --- If you just want to highlight the relationship between gbp per capita and life Expectancy you’ve probably done most of the work now. However, it is a good practice to highlight a few interesting dots in this chart to give more insight to the plot: <img src="index_files/figure-html/unnamed-chunk-33-1.png" width="100%" /> --- ```r ##This is a table of data about a large number of countries, each observed over several years. Let's make a scatterplot with it. P<-ggplot(data=gapminder,mapping = aes(x=gdpPercap,y=lifeExp)) P+geom_point()+geom_smooth() ``` <img src="index_files/figure-html/unnamed-chunk-34-1.png" width="100%" /> ```r P+geom_point()+geom_smooth(method = "lm") ``` <img src="index_files/figure-html/unnamed-chunk-34-2.png" width="100%" /> ```r P+geom_point()+geom_smooth(method = "gam")+scale_x_log10() ``` <img src="index_files/figure-html/unnamed-chunk-34-3.png" width="100%" /> ```r P+geom_point()+geom_smooth(method = "gam")+scale_x_log10(labels=scales::dollar) ``` <img src="index_files/figure-html/unnamed-chunk-34-4.png" width="100%" /> ```r P<-ggplot(data=gapminder,mapping = aes(gdpPercap,y=lifeExp,color="purple")) P+geom_point()+geom_smooth(method = "loess")+scale_x_log10() ``` <img src="index_files/figure-html/unnamed-chunk-34-5.png" width="100%" /> --- <img src="index_files/figure-html/unnamed-chunk-35-1.png" width="100%" /><img src="index_files/figure-html/unnamed-chunk-35-2.png" width="100%" /> --- #With proper title ```r P<-ggplot(data=gapminder,mapping = aes(gdpPercap,y=lifeExp)) P+geom_point(alpha=0.3)+ geom_smooth(method = "gam")+ scale_x_log10(labels=scales::dollar)+ labs(x = "GDP Per Capita", y = "Life Expectancy in Years", title = "Economic Growth and Life Expectancy", subtitle = "Data points are country-years", caption = "Source: Gapminder.") ``` <img src="index_files/figure-html/unnamed-chunk-36-1.png" width="100%" /> --- <img src="index_files/figure-html/unnamed-chunk-37-1.png" width="100%" /> --- <img src="index_files/figure-html/unnamed-chunk-38-1.png" width="100%" /> --- <img src="index_files/figure-html/unnamed-chunk-39-1.png" width="100%" /> ```r p + geom_point(mapping = aes(color = log(pop))) + scale_x_log10() ``` <img src="index_files/figure-html/unnamed-chunk-40-1.png" width="100%" /> --- ## Comparison of 2007 vs 1952 continentwise <img src="index_files/figure-html/unnamed-chunk-41-1.png" width="100%" /> --- ```r p1+labs(x = "GDP Per Capita", y = "Life Expectancy in Years", title = "Economic Growth and Life Expectancy", subtitle = "Data points are country-years", caption = "Source: Gapminder.") ``` <img src="index_files/figure-html/unnamed-chunk-42-1.png" width="100%" /> --- --- ```r gapminder %>% select(-pop, -gdpPercap) %>% filter(year == 2007) %>% group_by(continent) %>% summarise(mean_life_exp = mean(lifeExp),median_life_exp=median(lifeExp)) ``` ``` ## # A tibble: 5 × 3 ## continent mean_life_exp median_life_exp ## <fct> <dbl> <dbl> ## 1 Africa 54.8 52.9 ## 2 Americas 73.6 72.9 ## 3 Asia 70.7 72.4 ## 4 Europe 77.6 78.6 ## 5 Oceania 80.7 80.7 ``` --- ```r gapminder %>% select(-pop, -gdpPercap) %>% filter(year == 2007) %>% group_by(continent) %>% summarise(median_life_exp = median(lifeExp)) %>% ggplot() + aes(x = continent) + aes(y = median_life_exp) + geom_point(color = "blue", size = 3, alpha = .4) + scale_y_continuous(limits = c(0,85)) + labs(title = "Median life expectency by continent in 2007") + labs(subtitle = "Data Source: Gapminder package in R") + labs(x = "") + labs(y = "Median life expectancy") + labs(caption = "Zahid Asghar for 'QM4SSH'") + theme_minimal() ``` <img src="index_files/figure-html/unnamed-chunk-44-1.png" width="100%" /> --- .pull-left[ ```r gapminder %>% filter(year == 1997) %>% select(country, continent, lifeExp) %>% mutate( life_cats = case_when(lifeExp >= 70 ~ "70+", lifeExp < 70 ~ "<70")) %>% ggplot() + aes(x = continent, y = life_cats) + geom_jitter(width = .25, height = .25)+ aes(col = continent) + scale_color_discrete(guide = FALSE) + theme_dark() + labs(x = "", y = "") + labs(title = "Life expectency beyond 70 in 1997") ``` ``` ## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please ## use `guide = "none"` instead. ``` <img src="index_files/figure-html/unnamed-chunk-45-1.png" width="100%" /> --- ```r gapminder %>% mutate(gdp_billions = gdpPercap * pop/1000000000) %>% ggplot() + aes(x = year) + aes(y = gdp_billions) + geom_line() + aes(group = country) + scale_y_log10() + aes(col = continent) + facet_wrap( ~ continent) + scale_color_discrete(guide = F) + theme_minimal() ``` ``` ## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please ## use `guide = "none"` instead. ``` <img src="index_files/figure-html/unnamed-chunk-46-1.png" width="100%" /> --- ```r gapminder %>% filter(year == 2007) %>% ggplot() + aes(x = continent, y = lifeExp) + geom_boxplot() + geom_jitter(height = 0, width = .2) + stat_summary(fun.y = mean, geom = "point", col = "goldenrod3", size = 5) ``` ``` ## Warning: `fun.y` is deprecated. Use `fun` instead. ``` <img src="index_files/figure-html/unnamed-chunk-47-1.png" width="100%" /> --- ## Pakistan ```r gapminder %>% filter(country == "Pakistan") %>% ggplot() + aes(x = year, y = lifeExp) + geom_point() + geom_line() + aes(alpha = year) + aes(col = year) + scale_color_viridis_c() + theme_classic() ``` <img src="index_files/figure-html/unnamed-chunk-48-1.png" width="100%" /> --- ```r gapminder %>% filter(year == 2007) %>% ggplot() + aes(x = gdpPercap) + aes(y = lifeExp) + geom_point(alpha = .5) + geom_rug(size = 1) + aes(col = continent) + aes(col = lifeExp) + scale_x_log10() + aes(size = gdpPercap) + aes(size = pop) + geom_point(col = "darkgreen", size = 1) + facet_wrap(~ continent) ``` <img src="index_files/figure-html/unnamed-chunk-49-1.png" width="100%" /> ---