Exploring Worldwide Trends in Women's Fertility, Per Capita Income, and Life Expectancy¶

Table of Contents¶

  • Introduction
  • Data Wrangling
  • Exploratory Data Analysis
  • Conclusions
  • Resources

Introduction¶

The Gapminder World dataset contains data points on various socioeconomic metrics for the world's countries over the past 200 years. The exploration described below focuses on trends in women's fertility, per capita income, and life expectancy. The specific questions explored are:

  • which countries currently have the highest and lowest fertility rates?
  • how have the fertility rates for those countries been trending in the last 20 years?
  • which sub-regions of the world are currently the richest, and which are the poorest?
  • how has that wealth pattern changed over the past 200 years?
  • which sub-regions of the world currently have the highest and lowest life expectancies?
  • how does life expectancy relate to income over the past 100 years?

NOTE: This exploration was not intended to imply causation between any of the examined metrics. Any trends shown here are at best correlation rather than causation.

Data Wrangling¶

General Properties¶

In [1]:
import pandas as pd
import numpy as np
from numpy.polynomial import Polynomial
import matplotlib.pyplot as plt
import seaborn as sns
import math

%matplotlib inline
In [2]:
def read_data(filename: str, sep: str = ','):
    return pd.read_csv(f'datasets/{filename}', sep=sep)

Fertility¶

In [3]:
fertility_df = read_data('children_per_woman_total_fertility.csv')
fertility_df.head()
Out[3]:
country 1800 1801 1802 1803 1804 1805 1806 1807 1808 ... 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100
0 Aruba 5.64 5.64 5.64 5.64 5.64 5.64 5.64 5.64 5.64 ... 1.82 1.82 1.82 1.82 1.82 1.82 1.82 1.82 1.83 1.83
1 Afghanistan 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 ... 1.74 1.74 1.74 1.74 1.74 1.74 1.74 1.74 1.74 1.74
2 Angola 6.93 6.93 6.93 6.93 6.93 6.93 6.93 6.94 6.94 ... 2.54 2.52 2.50 2.48 2.47 2.45 2.43 2.42 2.40 2.40
3 Albania 4.60 4.60 4.60 4.60 4.60 4.60 4.60 4.60 4.60 ... 1.78 1.78 1.78 1.79 1.79 1.79 1.79 1.79 1.79 1.79
4 Netherlands Antilles 5.80 5.80 5.80 5.80 5.80 5.80 5.80 5.80 5.80 ... 2.00 2.00 2.01 2.01 2.01 2.01 2.01 2.02 2.02 2.02

5 rows × 302 columns

In [4]:
fertility_df.shape
Out[4]:
(202, 302)

The dataset contains data for up to 202 countries, and includes projected fertility rates going out to 2100.

In [5]:
fertility_df.info(verbose=True, show_counts=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202 entries, 0 to 201
Data columns (total 302 columns):
 #    Column   Non-Null Count  Dtype  
---   ------   --------------  -----  
 0    country  202 non-null    object 
 1    1800     202 non-null    float64
 2    1801     202 non-null    float64
 3    1802     202 non-null    float64
 4    1803     202 non-null    float64
 5    1804     202 non-null    float64
 6    1805     202 non-null    float64
 7    1806     202 non-null    float64
 8    1807     202 non-null    float64
 9    1808     202 non-null    float64
 10   1809     202 non-null    float64
 11   1810     202 non-null    float64
 12   1811     202 non-null    float64
 13   1812     202 non-null    float64
 14   1813     202 non-null    float64
 15   1814     202 non-null    float64
 16   1815     202 non-null    float64
 17   1816     202 non-null    float64
 18   1817     202 non-null    float64
 19   1818     202 non-null    float64
 20   1819     202 non-null    float64
 21   1820     202 non-null    float64
 22   1821     202 non-null    float64
 23   1822     202 non-null    float64
 24   1823     202 non-null    float64
 25   1824     202 non-null    float64
 26   1825     202 non-null    float64
 27   1826     202 non-null    float64
 28   1827     202 non-null    float64
 29   1828     202 non-null    float64
 30   1829     202 non-null    float64
 31   1830     202 non-null    float64
 32   1831     202 non-null    float64
 33   1832     202 non-null    float64
 34   1833     202 non-null    float64
 35   1834     202 non-null    float64
 36   1835     202 non-null    float64
 37   1836     202 non-null    float64
 38   1837     202 non-null    float64
 39   1838     202 non-null    float64
 40   1839     202 non-null    float64
 41   1840     202 non-null    float64
 42   1841     202 non-null    float64
 43   1842     202 non-null    float64
 44   1843     202 non-null    float64
 45   1844     202 non-null    float64
 46   1845     202 non-null    float64
 47   1846     202 non-null    float64
 48   1847     202 non-null    float64
 49   1848     202 non-null    float64
 50   1849     202 non-null    float64
 51   1850     202 non-null    float64
 52   1851     202 non-null    float64
 53   1852     202 non-null    float64
 54   1853     202 non-null    float64
 55   1854     202 non-null    float64
 56   1855     202 non-null    float64
 57   1856     202 non-null    float64
 58   1857     202 non-null    float64
 59   1858     202 non-null    float64
 60   1859     202 non-null    float64
 61   1860     202 non-null    float64
 62   1861     202 non-null    float64
 63   1862     202 non-null    float64
 64   1863     202 non-null    float64
 65   1864     202 non-null    float64
 66   1865     202 non-null    float64
 67   1866     202 non-null    float64
 68   1867     202 non-null    float64
 69   1868     202 non-null    float64
 70   1869     202 non-null    float64
 71   1870     202 non-null    float64
 72   1871     202 non-null    float64
 73   1872     202 non-null    float64
 74   1873     202 non-null    float64
 75   1874     202 non-null    float64
 76   1875     202 non-null    float64
 77   1876     202 non-null    float64
 78   1877     202 non-null    float64
 79   1878     202 non-null    float64
 80   1879     202 non-null    float64
 81   1880     202 non-null    float64
 82   1881     202 non-null    float64
 83   1882     202 non-null    float64
 84   1883     202 non-null    float64
 85   1884     202 non-null    float64
 86   1885     202 non-null    float64
 87   1886     202 non-null    float64
 88   1887     202 non-null    float64
 89   1888     202 non-null    float64
 90   1889     202 non-null    float64
 91   1890     202 non-null    float64
 92   1891     202 non-null    float64
 93   1892     202 non-null    float64
 94   1893     202 non-null    float64
 95   1894     202 non-null    float64
 96   1895     202 non-null    float64
 97   1896     202 non-null    float64
 98   1897     202 non-null    float64
 99   1898     202 non-null    float64
 100  1899     202 non-null    float64
 101  1900     202 non-null    float64
 102  1901     202 non-null    float64
 103  1902     202 non-null    float64
 104  1903     202 non-null    float64
 105  1904     202 non-null    float64
 106  1905     202 non-null    float64
 107  1906     202 non-null    float64
 108  1907     202 non-null    float64
 109  1908     202 non-null    float64
 110  1909     202 non-null    float64
 111  1910     202 non-null    float64
 112  1911     202 non-null    float64
 113  1912     202 non-null    float64
 114  1913     202 non-null    float64
 115  1914     202 non-null    float64
 116  1915     202 non-null    float64
 117  1916     202 non-null    float64
 118  1917     202 non-null    float64
 119  1918     202 non-null    float64
 120  1919     202 non-null    float64
 121  1920     202 non-null    float64
 122  1921     202 non-null    float64
 123  1922     202 non-null    float64
 124  1923     202 non-null    float64
 125  1924     202 non-null    float64
 126  1925     202 non-null    float64
 127  1926     202 non-null    float64
 128  1927     202 non-null    float64
 129  1928     202 non-null    float64
 130  1929     202 non-null    float64
 131  1930     202 non-null    float64
 132  1931     202 non-null    float64
 133  1932     202 non-null    float64
 134  1933     202 non-null    float64
 135  1934     202 non-null    float64
 136  1935     202 non-null    float64
 137  1936     202 non-null    float64
 138  1937     202 non-null    float64
 139  1938     202 non-null    float64
 140  1939     202 non-null    float64
 141  1940     202 non-null    float64
 142  1941     202 non-null    float64
 143  1942     202 non-null    float64
 144  1943     202 non-null    float64
 145  1944     202 non-null    float64
 146  1945     202 non-null    float64
 147  1946     202 non-null    float64
 148  1947     202 non-null    float64
 149  1948     202 non-null    float64
 150  1949     202 non-null    float64
 151  1950     202 non-null    float64
 152  1951     202 non-null    float64
 153  1952     202 non-null    float64
 154  1953     202 non-null    float64
 155  1954     202 non-null    float64
 156  1955     202 non-null    float64
 157  1956     202 non-null    float64
 158  1957     202 non-null    float64
 159  1958     202 non-null    float64
 160  1959     202 non-null    float64
 161  1960     202 non-null    float64
 162  1961     202 non-null    float64
 163  1962     202 non-null    float64
 164  1963     202 non-null    float64
 165  1964     202 non-null    float64
 166  1965     202 non-null    float64
 167  1966     202 non-null    float64
 168  1967     202 non-null    float64
 169  1968     202 non-null    float64
 170  1969     202 non-null    float64
 171  1970     202 non-null    float64
 172  1971     202 non-null    float64
 173  1972     202 non-null    float64
 174  1973     202 non-null    float64
 175  1974     202 non-null    float64
 176  1975     202 non-null    float64
 177  1976     202 non-null    float64
 178  1977     202 non-null    float64
 179  1978     202 non-null    float64
 180  1979     202 non-null    float64
 181  1980     202 non-null    float64
 182  1981     202 non-null    float64
 183  1982     202 non-null    float64
 184  1983     202 non-null    float64
 185  1984     202 non-null    float64
 186  1985     202 non-null    float64
 187  1986     202 non-null    float64
 188  1987     202 non-null    float64
 189  1988     202 non-null    float64
 190  1989     202 non-null    float64
 191  1990     202 non-null    float64
 192  1991     202 non-null    float64
 193  1992     202 non-null    float64
 194  1993     202 non-null    float64
 195  1994     202 non-null    float64
 196  1995     202 non-null    float64
 197  1996     202 non-null    float64
 198  1997     202 non-null    float64
 199  1998     202 non-null    float64
 200  1999     202 non-null    float64
 201  2000     202 non-null    float64
 202  2001     202 non-null    float64
 203  2002     202 non-null    float64
 204  2003     202 non-null    float64
 205  2004     202 non-null    float64
 206  2005     202 non-null    float64
 207  2006     202 non-null    float64
 208  2007     202 non-null    float64
 209  2008     202 non-null    float64
 210  2009     202 non-null    float64
 211  2010     202 non-null    float64
 212  2011     202 non-null    float64
 213  2012     202 non-null    float64
 214  2013     202 non-null    float64
 215  2014     201 non-null    float64
 216  2015     201 non-null    float64
 217  2016     201 non-null    float64
 218  2017     201 non-null    float64
 219  2018     201 non-null    float64
 220  2019     201 non-null    float64
 221  2020     201 non-null    float64
 222  2021     201 non-null    float64
 223  2022     201 non-null    float64
 224  2023     201 non-null    float64
 225  2024     201 non-null    float64
 226  2025     201 non-null    float64
 227  2026     201 non-null    float64
 228  2027     201 non-null    float64
 229  2028     201 non-null    float64
 230  2029     201 non-null    float64
 231  2030     201 non-null    float64
 232  2031     201 non-null    float64
 233  2032     201 non-null    float64
 234  2033     201 non-null    float64
 235  2034     201 non-null    float64
 236  2035     201 non-null    float64
 237  2036     201 non-null    float64
 238  2037     201 non-null    float64
 239  2038     201 non-null    float64
 240  2039     201 non-null    float64
 241  2040     201 non-null    float64
 242  2041     201 non-null    float64
 243  2042     201 non-null    float64
 244  2043     201 non-null    float64
 245  2044     201 non-null    float64
 246  2045     201 non-null    float64
 247  2046     201 non-null    float64
 248  2047     201 non-null    float64
 249  2048     201 non-null    float64
 250  2049     201 non-null    float64
 251  2050     201 non-null    float64
 252  2051     201 non-null    float64
 253  2052     201 non-null    float64
 254  2053     201 non-null    float64
 255  2054     201 non-null    float64
 256  2055     201 non-null    float64
 257  2056     201 non-null    float64
 258  2057     201 non-null    float64
 259  2058     201 non-null    float64
 260  2059     201 non-null    float64
 261  2060     201 non-null    float64
 262  2061     201 non-null    float64
 263  2062     201 non-null    float64
 264  2063     201 non-null    float64
 265  2064     201 non-null    float64
 266  2065     201 non-null    float64
 267  2066     201 non-null    float64
 268  2067     201 non-null    float64
 269  2068     201 non-null    float64
 270  2069     201 non-null    float64
 271  2070     201 non-null    float64
 272  2071     201 non-null    float64
 273  2072     201 non-null    float64
 274  2073     201 non-null    float64
 275  2074     201 non-null    float64
 276  2075     201 non-null    float64
 277  2076     201 non-null    float64
 278  2077     201 non-null    float64
 279  2078     201 non-null    float64
 280  2079     201 non-null    float64
 281  2080     201 non-null    float64
 282  2081     201 non-null    float64
 283  2082     201 non-null    float64
 284  2083     201 non-null    float64
 285  2084     201 non-null    float64
 286  2085     201 non-null    float64
 287  2086     201 non-null    float64
 288  2087     201 non-null    float64
 289  2088     201 non-null    float64
 290  2089     201 non-null    float64
 291  2090     201 non-null    float64
 292  2091     201 non-null    float64
 293  2092     201 non-null    float64
 294  2093     201 non-null    float64
 295  2094     201 non-null    float64
 296  2095     201 non-null    float64
 297  2096     201 non-null    float64
 298  2097     201 non-null    float64
 299  2098     201 non-null    float64
 300  2099     201 non-null    float64
 301  2100     201 non-null    float64
dtypes: float64(301), object(1)
memory usage: 476.7+ KB

All the numbers are already floats, and there's very little missing data (only one country doesn't have data starting from 2014).

In [6]:
fertility_df.duplicated().sum()
Out[6]:
0

There are no duplicated rows.

Per Capita Income¶

In [ ]:
 
In [7]:
income_per_person_df = read_data('income_per_person_gdppercapita_ppp_inflation_adjusted.csv')
income_per_person_df.head()
Out[7]:
country 1800 1801 1802 1803 1804 1805 1806 1807 1808 ... 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050
0 Afghanistan 683 683 683 683 683 683 683 683 683 ... 2690 2750 2810 2870 2930 2990 3060 3120 3190 3260
1 Angola 700 702 705 709 711 714 718 721 725 ... 8000 8170 8350 8530 8710 8900 9090 9280 9480 9690
2 Albania 755 755 755 755 755 756 756 756 756 ... 25.1k 25.6k 26.2k 26.7k 27.3k 27.9k 28.5k 29.1k 29.7k 30.4k
3 Andorra 1360 1360 1360 1360 1370 1370 1370 1370 1380 ... 68.9k 70.4k 71.9k 73.4k 75k 76.6k 78.3k 80k 81.7k 83.4k
4 United Arab Emirates 1130 1130 1140 1140 1150 1150 1160 1160 1160 ... 101k 103k 105k 107k 110k 112k 114k 117k 119k 122k

5 rows × 252 columns

In [8]:
income_per_person_df.shape
Out[8]:
(195, 252)

The dataset contains data for up to 195 countries, and includes projected income levels going out to 2050.

In [9]:
income_per_person_df.info(verbose=True, show_counts=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 252 columns):
 #    Column   Non-Null Count  Dtype 
---   ------   --------------  ----- 
 0    country  195 non-null    object
 1    1800     195 non-null    int64 
 2    1801     195 non-null    int64 
 3    1802     195 non-null    int64 
 4    1803     195 non-null    int64 
 5    1804     195 non-null    int64 
 6    1805     195 non-null    int64 
 7    1806     195 non-null    int64 
 8    1807     195 non-null    int64 
 9    1808     195 non-null    int64 
 10   1809     195 non-null    int64 
 11   1810     195 non-null    int64 
 12   1811     195 non-null    int64 
 13   1812     195 non-null    int64 
 14   1813     195 non-null    int64 
 15   1814     195 non-null    int64 
 16   1815     195 non-null    int64 
 17   1816     195 non-null    int64 
 18   1817     195 non-null    int64 
 19   1818     195 non-null    int64 
 20   1819     195 non-null    int64 
 21   1820     195 non-null    int64 
 22   1821     195 non-null    int64 
 23   1822     195 non-null    int64 
 24   1823     195 non-null    int64 
 25   1824     195 non-null    int64 
 26   1825     195 non-null    int64 
 27   1826     195 non-null    int64 
 28   1827     195 non-null    int64 
 29   1828     195 non-null    int64 
 30   1829     195 non-null    int64 
 31   1830     195 non-null    int64 
 32   1831     195 non-null    int64 
 33   1832     195 non-null    int64 
 34   1833     195 non-null    int64 
 35   1834     195 non-null    int64 
 36   1835     195 non-null    int64 
 37   1836     195 non-null    int64 
 38   1837     195 non-null    int64 
 39   1838     195 non-null    int64 
 40   1839     195 non-null    int64 
 41   1840     195 non-null    int64 
 42   1841     195 non-null    int64 
 43   1842     195 non-null    int64 
 44   1843     195 non-null    int64 
 45   1844     195 non-null    int64 
 46   1845     195 non-null    int64 
 47   1846     195 non-null    int64 
 48   1847     195 non-null    int64 
 49   1848     195 non-null    int64 
 50   1849     195 non-null    int64 
 51   1850     195 non-null    int64 
 52   1851     195 non-null    int64 
 53   1852     195 non-null    int64 
 54   1853     195 non-null    int64 
 55   1854     195 non-null    int64 
 56   1855     195 non-null    int64 
 57   1856     195 non-null    int64 
 58   1857     195 non-null    int64 
 59   1858     195 non-null    int64 
 60   1859     195 non-null    int64 
 61   1860     195 non-null    int64 
 62   1861     195 non-null    int64 
 63   1862     195 non-null    int64 
 64   1863     195 non-null    int64 
 65   1864     195 non-null    int64 
 66   1865     195 non-null    int64 
 67   1866     195 non-null    int64 
 68   1867     195 non-null    int64 
 69   1868     195 non-null    int64 
 70   1869     195 non-null    int64 
 71   1870     195 non-null    int64 
 72   1871     195 non-null    int64 
 73   1872     195 non-null    int64 
 74   1873     195 non-null    int64 
 75   1874     195 non-null    int64 
 76   1875     195 non-null    int64 
 77   1876     195 non-null    int64 
 78   1877     195 non-null    int64 
 79   1878     195 non-null    int64 
 80   1879     195 non-null    int64 
 81   1880     195 non-null    int64 
 82   1881     195 non-null    int64 
 83   1882     195 non-null    int64 
 84   1883     195 non-null    int64 
 85   1884     195 non-null    int64 
 86   1885     195 non-null    int64 
 87   1886     195 non-null    int64 
 88   1887     195 non-null    int64 
 89   1888     195 non-null    int64 
 90   1889     195 non-null    int64 
 91   1890     195 non-null    int64 
 92   1891     195 non-null    int64 
 93   1892     195 non-null    int64 
 94   1893     195 non-null    int64 
 95   1894     195 non-null    int64 
 96   1895     195 non-null    int64 
 97   1896     195 non-null    int64 
 98   1897     195 non-null    int64 
 99   1898     195 non-null    int64 
 100  1899     195 non-null    int64 
 101  1900     195 non-null    int64 
 102  1901     195 non-null    object
 103  1902     195 non-null    object
 104  1903     195 non-null    object
 105  1904     195 non-null    object
 106  1905     195 non-null    object
 107  1906     195 non-null    object
 108  1907     195 non-null    object
 109  1908     195 non-null    object
 110  1909     195 non-null    object
 111  1910     195 non-null    object
 112  1911     195 non-null    object
 113  1912     195 non-null    object
 114  1913     195 non-null    object
 115  1914     195 non-null    object
 116  1915     195 non-null    object
 117  1916     195 non-null    object
 118  1917     195 non-null    object
 119  1918     195 non-null    object
 120  1919     195 non-null    object
 121  1920     195 non-null    object
 122  1921     195 non-null    object
 123  1922     195 non-null    object
 124  1923     195 non-null    object
 125  1924     195 non-null    object
 126  1925     195 non-null    object
 127  1926     195 non-null    object
 128  1927     195 non-null    object
 129  1928     195 non-null    object
 130  1929     195 non-null    object
 131  1930     195 non-null    object
 132  1931     195 non-null    object
 133  1932     195 non-null    object
 134  1933     195 non-null    object
 135  1934     195 non-null    object
 136  1935     195 non-null    object
 137  1936     195 non-null    object
 138  1937     195 non-null    object
 139  1938     195 non-null    object
 140  1939     195 non-null    object
 141  1940     195 non-null    object
 142  1941     195 non-null    object
 143  1942     195 non-null    object
 144  1943     195 non-null    object
 145  1944     195 non-null    object
 146  1945     195 non-null    object
 147  1946     195 non-null    object
 148  1947     195 non-null    object
 149  1948     195 non-null    object
 150  1949     195 non-null    object
 151  1950     195 non-null    object
 152  1951     195 non-null    object
 153  1952     195 non-null    object
 154  1953     195 non-null    object
 155  1954     195 non-null    object
 156  1955     195 non-null    object
 157  1956     195 non-null    object
 158  1957     195 non-null    object
 159  1958     195 non-null    object
 160  1959     195 non-null    object
 161  1960     195 non-null    object
 162  1961     195 non-null    object
 163  1962     195 non-null    object
 164  1963     195 non-null    object
 165  1964     195 non-null    object
 166  1965     195 non-null    object
 167  1966     195 non-null    object
 168  1967     195 non-null    object
 169  1968     195 non-null    object
 170  1969     195 non-null    object
 171  1970     195 non-null    object
 172  1971     195 non-null    object
 173  1972     195 non-null    object
 174  1973     195 non-null    object
 175  1974     195 non-null    object
 176  1975     195 non-null    object
 177  1976     195 non-null    object
 178  1977     195 non-null    object
 179  1978     195 non-null    object
 180  1979     195 non-null    object
 181  1980     195 non-null    object
 182  1981     195 non-null    object
 183  1982     195 non-null    object
 184  1983     195 non-null    object
 185  1984     195 non-null    object
 186  1985     195 non-null    object
 187  1986     195 non-null    object
 188  1987     195 non-null    object
 189  1988     195 non-null    object
 190  1989     195 non-null    object
 191  1990     195 non-null    object
 192  1991     195 non-null    object
 193  1992     195 non-null    object
 194  1993     195 non-null    object
 195  1994     195 non-null    object
 196  1995     195 non-null    object
 197  1996     195 non-null    object
 198  1997     195 non-null    object
 199  1998     195 non-null    object
 200  1999     195 non-null    object
 201  2000     195 non-null    object
 202  2001     195 non-null    object
 203  2002     195 non-null    object
 204  2003     195 non-null    object
 205  2004     195 non-null    object
 206  2005     195 non-null    object
 207  2006     195 non-null    object
 208  2007     195 non-null    object
 209  2008     195 non-null    object
 210  2009     195 non-null    object
 211  2010     195 non-null    object
 212  2011     195 non-null    object
 213  2012     195 non-null    object
 214  2013     195 non-null    object
 215  2014     195 non-null    object
 216  2015     195 non-null    object
 217  2016     195 non-null    object
 218  2017     195 non-null    object
 219  2018     195 non-null    object
 220  2019     195 non-null    object
 221  2020     195 non-null    object
 222  2021     195 non-null    object
 223  2022     195 non-null    object
 224  2023     195 non-null    object
 225  2024     195 non-null    object
 226  2025     195 non-null    object
 227  2026     195 non-null    object
 228  2027     195 non-null    object
 229  2028     195 non-null    object
 230  2029     195 non-null    object
 231  2030     195 non-null    object
 232  2031     195 non-null    object
 233  2032     195 non-null    object
 234  2033     195 non-null    object
 235  2034     195 non-null    object
 236  2035     195 non-null    object
 237  2036     195 non-null    object
 238  2037     195 non-null    object
 239  2038     195 non-null    object
 240  2039     195 non-null    object
 241  2040     195 non-null    object
 242  2041     195 non-null    object
 243  2042     195 non-null    object
 244  2043     195 non-null    object
 245  2044     195 non-null    object
 246  2045     195 non-null    object
 247  2046     195 non-null    object
 248  2047     195 non-null    object
 249  2048     195 non-null    object
 250  2049     195 non-null    object
 251  2050     195 non-null    object
dtypes: int64(101), object(151)
memory usage: 384.0+ KB

There appear to be no empty values, though the data types change from integers to strings beginning from 1901.

In [10]:
income_per_person_df['1808'].unique()
Out[10]:
array([ 683,  725,  756, 1380, 1160, 1770,  582,  858,  968, 2160,  878,
        477, 2730,  676,  543,  992, 1230, 1450, 1640,  757,  691,  655,
       1010, 1040, 1710,  713,  451,  481, 1560, 2780, 1050,  678,  710,
        567,  666,  795,  600,  769,  967, 2170, 1830,  873,  752, 2310,
        796, 1120,  603, 1760, 1070,  592, 1460,  890, 2090,  589,  443,
       3780,  616,  792,  511,  925,  642,  406, 1780, 1090,  971, 1210,
        764, 1310,  718, 1420, 1060, 1650, 1020, 2860, 3210, 1300, 1530,
        991,  576,  436,  628, 1320,  948, 1280,  980, 1250,  903, 1190,
        400,  809, 2230,  704,  649,  954, 1330,  614,  782,  885,  933,
       1200,  671,  444,  598,  396,  612,  506,  978, 1100, 3060, 1580,
        745,  810,  960, 1080,  951,  619,  863,  281, 1680,  946,  765,
        225,  492,  918,  588,  564,  411,  832, 1110, 1610,  791, 1540,
        575,  977, 1850, 1620, 1600, 1490,  555, 1260,  474,  773,  632,
        599,  751, 1400,  799,  814,  637,  525,  872, 2200, 2940,  569,
        950,  976,  663, 1590,  993])
In [11]:
income_per_person_df['1908'].unique()
Out[11]:
array(['1120', '774', '1410', '3980', '5770', '6780', '1090', '1480',
       '8700', '6150', '1740', '523', '7290', '1040', '601', '1160',
       '2060', '3070', '2970', '1340', '1190', '1880', '1100', '2310',
       '850', '689', '736', '6620', '9930', '4670', '1070', '682', '1200',
       '857', '883', '1350', '880', '2350', '2130', '2740', '5020',
       '6030', '1360', '6580', '1220', '2040', '1510', '1870', '756',
       '3390', '2340', '611', '5640', '627', '877', '8670', '1140',
       '1030', '993', '691', '637', '2960', '1580', '1990', '2490',
       '2540', '2170', '3630', '1760', '3580', '1150', '4820', '1430',
       '3220', '3160', '4410', '1110', '1780', '2840', '1250', '497',
       '686', '2390', '1260', '3140', '1210', '2560', '2920', '1330',
       '2260', '423', '1280', '6300', '1380', '994', '1060', '652',
       '1860', '705', '1080', '2240', '855', '471', '738', '1630', '403',
       '2280', '2320', '6540', '4200', '976', '961', '8960', '2910',
       '1800', '1440', '835', '3130', '385', '2850', '2800', '732',
       '3560', '559', '900', '863', '2890', '562', '1790', '2940', '871',
       '3660', '881', '1180', '3360', '3740', '3800', '4600', '1370',
       '2300', '726', '1540', '1270', '730', '1010', '2250', '1570',
       '1670', '617', '805', '1900', '5060', '10.4k', '920', '1720',
       '1970', '1290', '895', '2140', '975'], dtype=object)

Comparing the unique integer values for 1808 with the unique string values for 1908, it becomes clear that some of the larger numbers are shortened (eg 10.4k instead of 10400). This will require further processing to achieve consistency.

In [12]:
income_per_person_df.duplicated().sum()
Out[12]:
0

There are no duplicated rows.

Life Expectancy¶

In [13]:
life_expectancy_df = read_data('life_expectancy_years.csv')
life_expectancy_df.head()
Out[13]:
country 1800 1801 1802 1803 1804 1805 1806 1807 1808 ... 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100
0 Afghanistan 28.2 28.2 28.2 28.2 28.2 28.2 28.1 28.1 28.1 ... 75.5 75.7 75.8 76.0 76.1 76.2 76.4 76.5 76.6 76.8
1 Angola 27.0 27.0 27.0 27.0 27.0 27.0 27.0 27.0 27.0 ... 78.8 79.0 79.1 79.2 79.3 79.5 79.6 79.7 79.9 80.0
2 Albania 35.4 35.4 35.4 35.4 35.4 35.4 35.4 35.4 35.4 ... 87.4 87.5 87.6 87.7 87.8 87.9 88.0 88.2 88.3 88.4
3 Andorra NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 United Arab Emirates 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 ... 82.4 82.5 82.6 82.7 82.8 82.9 83.0 83.1 83.2 83.3

5 rows × 302 columns

In [14]:
life_expectancy_df.shape
Out[14]:
(195, 302)

The dataset contains data for up to 195 countries, and includes projected life expectancy levels going out to 2100.

In [15]:
life_expectancy_df.info(verbose=True, show_counts=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 302 columns):
 #    Column   Non-Null Count  Dtype  
---   ------   --------------  -----  
 0    country  195 non-null    object 
 1    1800     186 non-null    float64
 2    1801     186 non-null    float64
 3    1802     186 non-null    float64
 4    1803     186 non-null    float64
 5    1804     186 non-null    float64
 6    1805     186 non-null    float64
 7    1806     186 non-null    float64
 8    1807     186 non-null    float64
 9    1808     186 non-null    float64
 10   1809     186 non-null    float64
 11   1810     186 non-null    float64
 12   1811     186 non-null    float64
 13   1812     186 non-null    float64
 14   1813     186 non-null    float64
 15   1814     186 non-null    float64
 16   1815     186 non-null    float64
 17   1816     186 non-null    float64
 18   1817     186 non-null    float64
 19   1818     186 non-null    float64
 20   1819     186 non-null    float64
 21   1820     186 non-null    float64
 22   1821     186 non-null    float64
 23   1822     186 non-null    float64
 24   1823     186 non-null    float64
 25   1824     186 non-null    float64
 26   1825     186 non-null    float64
 27   1826     186 non-null    float64
 28   1827     186 non-null    float64
 29   1828     186 non-null    float64
 30   1829     186 non-null    float64
 31   1830     186 non-null    float64
 32   1831     186 non-null    float64
 33   1832     186 non-null    float64
 34   1833     186 non-null    float64
 35   1834     186 non-null    float64
 36   1835     186 non-null    float64
 37   1836     186 non-null    float64
 38   1837     186 non-null    float64
 39   1838     186 non-null    float64
 40   1839     186 non-null    float64
 41   1840     186 non-null    float64
 42   1841     186 non-null    float64
 43   1842     186 non-null    float64
 44   1843     186 non-null    float64
 45   1844     186 non-null    float64
 46   1845     186 non-null    float64
 47   1846     186 non-null    float64
 48   1847     186 non-null    float64
 49   1848     186 non-null    float64
 50   1849     186 non-null    float64
 51   1850     186 non-null    float64
 52   1851     186 non-null    float64
 53   1852     186 non-null    float64
 54   1853     186 non-null    float64
 55   1854     186 non-null    float64
 56   1855     186 non-null    float64
 57   1856     186 non-null    float64
 58   1857     186 non-null    float64
 59   1858     186 non-null    float64
 60   1859     186 non-null    float64
 61   1860     186 non-null    float64
 62   1861     186 non-null    float64
 63   1862     186 non-null    float64
 64   1863     186 non-null    float64
 65   1864     186 non-null    float64
 66   1865     186 non-null    float64
 67   1866     186 non-null    float64
 68   1867     186 non-null    float64
 69   1868     186 non-null    float64
 70   1869     186 non-null    float64
 71   1870     186 non-null    float64
 72   1871     186 non-null    float64
 73   1872     186 non-null    float64
 74   1873     186 non-null    float64
 75   1874     186 non-null    float64
 76   1875     186 non-null    float64
 77   1876     186 non-null    float64
 78   1877     186 non-null    float64
 79   1878     186 non-null    float64
 80   1879     186 non-null    float64
 81   1880     186 non-null    float64
 82   1881     186 non-null    float64
 83   1882     186 non-null    float64
 84   1883     186 non-null    float64
 85   1884     186 non-null    float64
 86   1885     186 non-null    float64
 87   1886     186 non-null    float64
 88   1887     186 non-null    float64
 89   1888     186 non-null    float64
 90   1889     186 non-null    float64
 91   1890     186 non-null    float64
 92   1891     186 non-null    float64
 93   1892     186 non-null    float64
 94   1893     186 non-null    float64
 95   1894     186 non-null    float64
 96   1895     186 non-null    float64
 97   1896     186 non-null    float64
 98   1897     186 non-null    float64
 99   1898     186 non-null    float64
 100  1899     186 non-null    float64
 101  1900     186 non-null    float64
 102  1901     186 non-null    float64
 103  1902     186 non-null    float64
 104  1903     186 non-null    float64
 105  1904     186 non-null    float64
 106  1905     186 non-null    float64
 107  1906     186 non-null    float64
 108  1907     186 non-null    float64
 109  1908     186 non-null    float64
 110  1909     186 non-null    float64
 111  1910     186 non-null    float64
 112  1911     186 non-null    float64
 113  1912     186 non-null    float64
 114  1913     186 non-null    float64
 115  1914     186 non-null    float64
 116  1915     186 non-null    float64
 117  1916     186 non-null    float64
 118  1917     186 non-null    float64
 119  1918     186 non-null    float64
 120  1919     186 non-null    float64
 121  1920     186 non-null    float64
 122  1921     186 non-null    float64
 123  1922     186 non-null    float64
 124  1923     186 non-null    float64
 125  1924     186 non-null    float64
 126  1925     186 non-null    float64
 127  1926     186 non-null    float64
 128  1927     186 non-null    float64
 129  1928     186 non-null    float64
 130  1929     186 non-null    float64
 131  1930     186 non-null    float64
 132  1931     186 non-null    float64
 133  1932     186 non-null    float64
 134  1933     186 non-null    float64
 135  1934     186 non-null    float64
 136  1935     186 non-null    float64
 137  1936     186 non-null    float64
 138  1937     186 non-null    float64
 139  1938     186 non-null    float64
 140  1939     186 non-null    float64
 141  1940     186 non-null    float64
 142  1941     186 non-null    float64
 143  1942     186 non-null    float64
 144  1943     186 non-null    float64
 145  1944     186 non-null    float64
 146  1945     186 non-null    float64
 147  1946     186 non-null    float64
 148  1947     186 non-null    float64
 149  1948     186 non-null    float64
 150  1949     186 non-null    float64
 151  1950     195 non-null    float64
 152  1951     195 non-null    float64
 153  1952     195 non-null    float64
 154  1953     195 non-null    float64
 155  1954     195 non-null    float64
 156  1955     195 non-null    float64
 157  1956     195 non-null    float64
 158  1957     195 non-null    float64
 159  1958     195 non-null    float64
 160  1959     195 non-null    float64
 161  1960     195 non-null    float64
 162  1961     195 non-null    float64
 163  1962     195 non-null    float64
 164  1963     195 non-null    float64
 165  1964     195 non-null    float64
 166  1965     195 non-null    float64
 167  1966     195 non-null    float64
 168  1967     195 non-null    float64
 169  1968     195 non-null    float64
 170  1969     195 non-null    float64
 171  1970     195 non-null    float64
 172  1971     195 non-null    float64
 173  1972     195 non-null    float64
 174  1973     195 non-null    float64
 175  1974     195 non-null    float64
 176  1975     195 non-null    float64
 177  1976     195 non-null    float64
 178  1977     195 non-null    float64
 179  1978     195 non-null    float64
 180  1979     195 non-null    float64
 181  1980     195 non-null    float64
 182  1981     195 non-null    float64
 183  1982     195 non-null    float64
 184  1983     195 non-null    float64
 185  1984     195 non-null    float64
 186  1985     195 non-null    float64
 187  1986     195 non-null    float64
 188  1987     195 non-null    float64
 189  1988     195 non-null    float64
 190  1989     195 non-null    float64
 191  1990     195 non-null    float64
 192  1991     195 non-null    float64
 193  1992     195 non-null    float64
 194  1993     195 non-null    float64
 195  1994     195 non-null    float64
 196  1995     195 non-null    float64
 197  1996     195 non-null    float64
 198  1997     195 non-null    float64
 199  1998     195 non-null    float64
 200  1999     195 non-null    float64
 201  2000     195 non-null    float64
 202  2001     195 non-null    float64
 203  2002     195 non-null    float64
 204  2003     195 non-null    float64
 205  2004     195 non-null    float64
 206  2005     195 non-null    float64
 207  2006     195 non-null    float64
 208  2007     195 non-null    float64
 209  2008     195 non-null    float64
 210  2009     195 non-null    float64
 211  2010     195 non-null    float64
 212  2011     195 non-null    float64
 213  2012     195 non-null    float64
 214  2013     195 non-null    float64
 215  2014     195 non-null    float64
 216  2015     195 non-null    float64
 217  2016     195 non-null    float64
 218  2017     195 non-null    float64
 219  2018     195 non-null    float64
 220  2019     195 non-null    float64
 221  2020     186 non-null    float64
 222  2021     186 non-null    float64
 223  2022     186 non-null    float64
 224  2023     186 non-null    float64
 225  2024     186 non-null    float64
 226  2025     186 non-null    float64
 227  2026     186 non-null    float64
 228  2027     186 non-null    float64
 229  2028     186 non-null    float64
 230  2029     186 non-null    float64
 231  2030     186 non-null    float64
 232  2031     186 non-null    float64
 233  2032     186 non-null    float64
 234  2033     186 non-null    float64
 235  2034     186 non-null    float64
 236  2035     186 non-null    float64
 237  2036     186 non-null    float64
 238  2037     186 non-null    float64
 239  2038     186 non-null    float64
 240  2039     186 non-null    float64
 241  2040     186 non-null    float64
 242  2041     186 non-null    float64
 243  2042     186 non-null    float64
 244  2043     186 non-null    float64
 245  2044     186 non-null    float64
 246  2045     186 non-null    float64
 247  2046     186 non-null    float64
 248  2047     186 non-null    float64
 249  2048     186 non-null    float64
 250  2049     186 non-null    float64
 251  2050     186 non-null    float64
 252  2051     186 non-null    float64
 253  2052     186 non-null    float64
 254  2053     186 non-null    float64
 255  2054     186 non-null    float64
 256  2055     186 non-null    float64
 257  2056     186 non-null    float64
 258  2057     186 non-null    float64
 259  2058     186 non-null    float64
 260  2059     186 non-null    float64
 261  2060     186 non-null    float64
 262  2061     186 non-null    float64
 263  2062     186 non-null    float64
 264  2063     186 non-null    float64
 265  2064     186 non-null    float64
 266  2065     186 non-null    float64
 267  2066     186 non-null    float64
 268  2067     186 non-null    float64
 269  2068     186 non-null    float64
 270  2069     186 non-null    float64
 271  2070     186 non-null    float64
 272  2071     186 non-null    float64
 273  2072     186 non-null    float64
 274  2073     186 non-null    float64
 275  2074     186 non-null    float64
 276  2075     186 non-null    float64
 277  2076     186 non-null    float64
 278  2077     186 non-null    float64
 279  2078     186 non-null    float64
 280  2079     186 non-null    float64
 281  2080     186 non-null    float64
 282  2081     186 non-null    float64
 283  2082     186 non-null    float64
 284  2083     186 non-null    float64
 285  2084     186 non-null    float64
 286  2085     186 non-null    float64
 287  2086     186 non-null    float64
 288  2087     186 non-null    float64
 289  2088     186 non-null    float64
 290  2089     186 non-null    float64
 291  2090     186 non-null    float64
 292  2091     186 non-null    float64
 293  2092     186 non-null    float64
 294  2093     186 non-null    float64
 295  2094     186 non-null    float64
 296  2095     186 non-null    float64
 297  2096     186 non-null    float64
 298  2097     186 non-null    float64
 299  2098     186 non-null    float64
 300  2099     186 non-null    float64
 301  2100     186 non-null    float64
dtypes: float64(301), object(1)
memory usage: 460.2+ KB

A number of countries are missing data for the years before 1950 and after 2019. All the numbers are already formatted as floats.

In [16]:
life_expectancy_df.duplicated().sum()
Out[16]:
0

There are no duplicated rows.

Further Data Processing¶

Fertility¶

The first step is to identify the row(s) that contain missing data.

In [17]:
fertility_df[np.isnan(fertility_df['2014'])].loc[:,'2014':]
Out[17]:
2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 ... 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100
73 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

1 rows × 87 columns

There's only one row that has missing data, so it can be dropped from the dataset.

In [18]:
fertility_df.dropna(inplace=True)
fertility_df.shape
Out[18]:
(201, 302)
In [19]:
fertility_df.info(verbose=True, show_counts=True)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 201 entries, 0 to 201
Data columns (total 302 columns):
 #    Column   Non-Null Count  Dtype  
---   ------   --------------  -----  
 0    country  201 non-null    object 
 1    1800     201 non-null    float64
 2    1801     201 non-null    float64
 3    1802     201 non-null    float64
 4    1803     201 non-null    float64
 5    1804     201 non-null    float64
 6    1805     201 non-null    float64
 7    1806     201 non-null    float64
 8    1807     201 non-null    float64
 9    1808     201 non-null    float64
 10   1809     201 non-null    float64
 11   1810     201 non-null    float64
 12   1811     201 non-null    float64
 13   1812     201 non-null    float64
 14   1813     201 non-null    float64
 15   1814     201 non-null    float64
 16   1815     201 non-null    float64
 17   1816     201 non-null    float64
 18   1817     201 non-null    float64
 19   1818     201 non-null    float64
 20   1819     201 non-null    float64
 21   1820     201 non-null    float64
 22   1821     201 non-null    float64
 23   1822     201 non-null    float64
 24   1823     201 non-null    float64
 25   1824     201 non-null    float64
 26   1825     201 non-null    float64
 27   1826     201 non-null    float64
 28   1827     201 non-null    float64
 29   1828     201 non-null    float64
 30   1829     201 non-null    float64
 31   1830     201 non-null    float64
 32   1831     201 non-null    float64
 33   1832     201 non-null    float64
 34   1833     201 non-null    float64
 35   1834     201 non-null    float64
 36   1835     201 non-null    float64
 37   1836     201 non-null    float64
 38   1837     201 non-null    float64
 39   1838     201 non-null    float64
 40   1839     201 non-null    float64
 41   1840     201 non-null    float64
 42   1841     201 non-null    float64
 43   1842     201 non-null    float64
 44   1843     201 non-null    float64
 45   1844     201 non-null    float64
 46   1845     201 non-null    float64
 47   1846     201 non-null    float64
 48   1847     201 non-null    float64
 49   1848     201 non-null    float64
 50   1849     201 non-null    float64
 51   1850     201 non-null    float64
 52   1851     201 non-null    float64
 53   1852     201 non-null    float64
 54   1853     201 non-null    float64
 55   1854     201 non-null    float64
 56   1855     201 non-null    float64
 57   1856     201 non-null    float64
 58   1857     201 non-null    float64
 59   1858     201 non-null    float64
 60   1859     201 non-null    float64
 61   1860     201 non-null    float64
 62   1861     201 non-null    float64
 63   1862     201 non-null    float64
 64   1863     201 non-null    float64
 65   1864     201 non-null    float64
 66   1865     201 non-null    float64
 67   1866     201 non-null    float64
 68   1867     201 non-null    float64
 69   1868     201 non-null    float64
 70   1869     201 non-null    float64
 71   1870     201 non-null    float64
 72   1871     201 non-null    float64
 73   1872     201 non-null    float64
 74   1873     201 non-null    float64
 75   1874     201 non-null    float64
 76   1875     201 non-null    float64
 77   1876     201 non-null    float64
 78   1877     201 non-null    float64
 79   1878     201 non-null    float64
 80   1879     201 non-null    float64
 81   1880     201 non-null    float64
 82   1881     201 non-null    float64
 83   1882     201 non-null    float64
 84   1883     201 non-null    float64
 85   1884     201 non-null    float64
 86   1885     201 non-null    float64
 87   1886     201 non-null    float64
 88   1887     201 non-null    float64
 89   1888     201 non-null    float64
 90   1889     201 non-null    float64
 91   1890     201 non-null    float64
 92   1891     201 non-null    float64
 93   1892     201 non-null    float64
 94   1893     201 non-null    float64
 95   1894     201 non-null    float64
 96   1895     201 non-null    float64
 97   1896     201 non-null    float64
 98   1897     201 non-null    float64
 99   1898     201 non-null    float64
 100  1899     201 non-null    float64
 101  1900     201 non-null    float64
 102  1901     201 non-null    float64
 103  1902     201 non-null    float64
 104  1903     201 non-null    float64
 105  1904     201 non-null    float64
 106  1905     201 non-null    float64
 107  1906     201 non-null    float64
 108  1907     201 non-null    float64
 109  1908     201 non-null    float64
 110  1909     201 non-null    float64
 111  1910     201 non-null    float64
 112  1911     201 non-null    float64
 113  1912     201 non-null    float64
 114  1913     201 non-null    float64
 115  1914     201 non-null    float64
 116  1915     201 non-null    float64
 117  1916     201 non-null    float64
 118  1917     201 non-null    float64
 119  1918     201 non-null    float64
 120  1919     201 non-null    float64
 121  1920     201 non-null    float64
 122  1921     201 non-null    float64
 123  1922     201 non-null    float64
 124  1923     201 non-null    float64
 125  1924     201 non-null    float64
 126  1925     201 non-null    float64
 127  1926     201 non-null    float64
 128  1927     201 non-null    float64
 129  1928     201 non-null    float64
 130  1929     201 non-null    float64
 131  1930     201 non-null    float64
 132  1931     201 non-null    float64
 133  1932     201 non-null    float64
 134  1933     201 non-null    float64
 135  1934     201 non-null    float64
 136  1935     201 non-null    float64
 137  1936     201 non-null    float64
 138  1937     201 non-null    float64
 139  1938     201 non-null    float64
 140  1939     201 non-null    float64
 141  1940     201 non-null    float64
 142  1941     201 non-null    float64
 143  1942     201 non-null    float64
 144  1943     201 non-null    float64
 145  1944     201 non-null    float64
 146  1945     201 non-null    float64
 147  1946     201 non-null    float64
 148  1947     201 non-null    float64
 149  1948     201 non-null    float64
 150  1949     201 non-null    float64
 151  1950     201 non-null    float64
 152  1951     201 non-null    float64
 153  1952     201 non-null    float64
 154  1953     201 non-null    float64
 155  1954     201 non-null    float64
 156  1955     201 non-null    float64
 157  1956     201 non-null    float64
 158  1957     201 non-null    float64
 159  1958     201 non-null    float64
 160  1959     201 non-null    float64
 161  1960     201 non-null    float64
 162  1961     201 non-null    float64
 163  1962     201 non-null    float64
 164  1963     201 non-null    float64
 165  1964     201 non-null    float64
 166  1965     201 non-null    float64
 167  1966     201 non-null    float64
 168  1967     201 non-null    float64
 169  1968     201 non-null    float64
 170  1969     201 non-null    float64
 171  1970     201 non-null    float64
 172  1971     201 non-null    float64
 173  1972     201 non-null    float64
 174  1973     201 non-null    float64
 175  1974     201 non-null    float64
 176  1975     201 non-null    float64
 177  1976     201 non-null    float64
 178  1977     201 non-null    float64
 179  1978     201 non-null    float64
 180  1979     201 non-null    float64
 181  1980     201 non-null    float64
 182  1981     201 non-null    float64
 183  1982     201 non-null    float64
 184  1983     201 non-null    float64
 185  1984     201 non-null    float64
 186  1985     201 non-null    float64
 187  1986     201 non-null    float64
 188  1987     201 non-null    float64
 189  1988     201 non-null    float64
 190  1989     201 non-null    float64
 191  1990     201 non-null    float64
 192  1991     201 non-null    float64
 193  1992     201 non-null    float64
 194  1993     201 non-null    float64
 195  1994     201 non-null    float64
 196  1995     201 non-null    float64
 197  1996     201 non-null    float64
 198  1997     201 non-null    float64
 199  1998     201 non-null    float64
 200  1999     201 non-null    float64
 201  2000     201 non-null    float64
 202  2001     201 non-null    float64
 203  2002     201 non-null    float64
 204  2003     201 non-null    float64
 205  2004     201 non-null    float64
 206  2005     201 non-null    float64
 207  2006     201 non-null    float64
 208  2007     201 non-null    float64
 209  2008     201 non-null    float64
 210  2009     201 non-null    float64
 211  2010     201 non-null    float64
 212  2011     201 non-null    float64
 213  2012     201 non-null    float64
 214  2013     201 non-null    float64
 215  2014     201 non-null    float64
 216  2015     201 non-null    float64
 217  2016     201 non-null    float64
 218  2017     201 non-null    float64
 219  2018     201 non-null    float64
 220  2019     201 non-null    float64
 221  2020     201 non-null    float64
 222  2021     201 non-null    float64
 223  2022     201 non-null    float64
 224  2023     201 non-null    float64
 225  2024     201 non-null    float64
 226  2025     201 non-null    float64
 227  2026     201 non-null    float64
 228  2027     201 non-null    float64
 229  2028     201 non-null    float64
 230  2029     201 non-null    float64
 231  2030     201 non-null    float64
 232  2031     201 non-null    float64
 233  2032     201 non-null    float64
 234  2033     201 non-null    float64
 235  2034     201 non-null    float64
 236  2035     201 non-null    float64
 237  2036     201 non-null    float64
 238  2037     201 non-null    float64
 239  2038     201 non-null    float64
 240  2039     201 non-null    float64
 241  2040     201 non-null    float64
 242  2041     201 non-null    float64
 243  2042     201 non-null    float64
 244  2043     201 non-null    float64
 245  2044     201 non-null    float64
 246  2045     201 non-null    float64
 247  2046     201 non-null    float64
 248  2047     201 non-null    float64
 249  2048     201 non-null    float64
 250  2049     201 non-null    float64
 251  2050     201 non-null    float64
 252  2051     201 non-null    float64
 253  2052     201 non-null    float64
 254  2053     201 non-null    float64
 255  2054     201 non-null    float64
 256  2055     201 non-null    float64
 257  2056     201 non-null    float64
 258  2057     201 non-null    float64
 259  2058     201 non-null    float64
 260  2059     201 non-null    float64
 261  2060     201 non-null    float64
 262  2061     201 non-null    float64
 263  2062     201 non-null    float64
 264  2063     201 non-null    float64
 265  2064     201 non-null    float64
 266  2065     201 non-null    float64
 267  2066     201 non-null    float64
 268  2067     201 non-null    float64
 269  2068     201 non-null    float64
 270  2069     201 non-null    float64
 271  2070     201 non-null    float64
 272  2071     201 non-null    float64
 273  2072     201 non-null    float64
 274  2073     201 non-null    float64
 275  2074     201 non-null    float64
 276  2075     201 non-null    float64
 277  2076     201 non-null    float64
 278  2077     201 non-null    float64
 279  2078     201 non-null    float64
 280  2079     201 non-null    float64
 281  2080     201 non-null    float64
 282  2081     201 non-null    float64
 283  2082     201 non-null    float64
 284  2083     201 non-null    float64
 285  2084     201 non-null    float64
 286  2085     201 non-null    float64
 287  2086     201 non-null    float64
 288  2087     201 non-null    float64
 289  2088     201 non-null    float64
 290  2089     201 non-null    float64
 291  2090     201 non-null    float64
 292  2091     201 non-null    float64
 293  2092     201 non-null    float64
 294  2093     201 non-null    float64
 295  2094     201 non-null    float64
 296  2095     201 non-null    float64
 297  2096     201 non-null    float64
 298  2097     201 non-null    float64
 299  2098     201 non-null    float64
 300  2099     201 non-null    float64
 301  2100     201 non-null    float64
dtypes: float64(301), object(1)
memory usage: 475.8+ KB

Now there's data for all 201 countries remaining. This state of the dataset will be sufficient to answer the questions for fertility rate.

In [20]:
def countries_to_lowercase(country_name: str) -> str:
    return country_name.strip().lower().replace(' ', '_').replace('-', '_')
In [21]:
fertility_df.country = fertility_df.country.apply(countries_to_lowercase)
fertility_df.head()
Out[21]:
country 1800 1801 1802 1803 1804 1805 1806 1807 1808 ... 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100
0 aruba 5.64 5.64 5.64 5.64 5.64 5.64 5.64 5.64 5.64 ... 1.82 1.82 1.82 1.82 1.82 1.82 1.82 1.82 1.83 1.83
1 afghanistan 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 7.00 ... 1.74 1.74 1.74 1.74 1.74 1.74 1.74 1.74 1.74 1.74
2 angola 6.93 6.93 6.93 6.93 6.93 6.93 6.93 6.94 6.94 ... 2.54 2.52 2.50 2.48 2.47 2.45 2.43 2.42 2.40 2.40
3 albania 4.60 4.60 4.60 4.60 4.60 4.60 4.60 4.60 4.60 ... 1.78 1.78 1.78 1.79 1.79 1.79 1.79 1.79 1.79 1.79
4 netherlands_antilles 5.80 5.80 5.80 5.80 5.80 5.80 5.80 5.80 5.80 ... 2.00 2.00 2.01 2.01 2.01 2.01 2.01 2.02 2.02 2.02

5 rows × 302 columns

The country names have been converted to lowercase, and the spaces and dashes in them have been converted to underscores to enable easier manipulation with the code.

Income Per Capita¶

First of all, the differences in number format need to be rectified by converting all the strings to integers.

In [22]:
from typing import Union

def has_million(value: Union[str, int]) -> bool:
    return value.endswith('m') if isinstance(value, str) else False

def has_thousands(value: Union[str, int]) -> bool:
    return value.endswith('k') if isinstance(value, str) else False
In [23]:
print('Number of cells containing millions formatted with "m" suffix:', sum([np.fromiter((val for val in income_per_person_df[col].unique() if has_million(val)), dtype='S128').size for col in income_per_person_df.columns[1:]]))
Number of cells containing millions formatted with "m" suffix: 0
In [24]:
print('Number of cells containing thousands formatted with "k" suffix:', sum([np.fromiter((val for val in income_per_person_df[col].unique() if has_thousands(val)), dtype='S128').size for col in income_per_person_df.columns[1:]]))
Number of cells containing thousands formatted with "k" suffix: 8087

So only values containing thousands need to be reformatted.

In [25]:
income_per_person_df.iloc[:,1:].dtypes.unique()
Out[25]:
array([dtype('int64'), dtype('O')], dtype=object)

Those are the unique datatypes before reformatting, where dtype('O') represents strings and dtype('int64') represents integers.

In [26]:
for col in income_per_person_df.columns[1:]:
    if income_per_person_df[col].dtype != int:
        income_per_person_df[col] = income_per_person_df[col].apply(lambda x: int(float(x[:-1]) * 1000) if has_thousands(x) else int(x))
        
income_per_person_df.iloc[:, 1:].dtypes.unique()
Out[26]:
array([dtype('int64')], dtype=object)

The loop above iterates over all the year columns and converts the values in the column to integers if they're currently strings. The final output after the loop shows that integer is the only unique data type now.

Next, the sub-regions need to be added to this dataset. Currently only countries are specified, so both the sub-region and country code will be added.

In [27]:
income_per_person_df.country = income_per_person_df.country.apply(countries_to_lowercase)

The country names have been converted to lowercase, and the spaces and dashes in them have been converted to underscores to enable easier manipulation with the code.

In [28]:
country_df = read_data('unsd-methodology.csv', sep=';')
country_df.head()
Out[28]:
Global Code Global Name Region Code Region Name Sub-region Code Sub-region Name Intermediate Region Code Intermediate Region Name Country or Area M49 Code ISO-alpha2 Code ISO-alpha3 Code Least Developed Countries (LDC) Land Locked Developing Countries (LLDC) Small Island Developing States (SIDS)
0 1 World NaN NaN NaN NaN NaN NaN Antarctica 10 AQ ATA NaN NaN NaN
1 1 World 2.0 Africa 15.0 Northern Africa NaN NaN Algeria 12 DZ DZA NaN NaN NaN
2 1 World 2.0 Africa 15.0 Northern Africa NaN NaN Egypt 818 EG EGY NaN NaN NaN
3 1 World 2.0 Africa 15.0 Northern Africa NaN NaN Libya 434 LY LBY NaN NaN NaN
4 1 World 2.0 Africa 15.0 Northern Africa NaN NaN Morocco 504 MA MAR NaN NaN NaN

To get the country sub-regions, a new dataset containing the country name, sub-region, and country code has been loaded.

In [29]:
country_df.rename(columns=lambda x: x.strip().lower().replace(' ', '_').replace('-', '_'), inplace=True)
country_df = country_df.loc[:,['sub_region_name', 'country_or_area', 'iso_alpha3_code']]
country_df.country_or_area = country_df.country_or_area.apply(countries_to_lowercase)
country_df.head()
Out[29]:
sub_region_name country_or_area iso_alpha3_code
0 NaN antarctica ATA
1 Northern Africa algeria DZA
2 Northern Africa egypt EGY
3 Northern Africa libya LBY
4 Northern Africa morocco MAR

The code above does the following in order:

  • changes the column names in the country dataframe to lowercase
  • removes all columns except the sub-region, name and country code
  • converts the country names to lowercase and converts the dashes and spaces to underscores

These steps leave data that can be more easily merged with the income dataframe to add the needed sub-regions.

Before performing the merge, a check is needed to confirm which of the country names can be merged smoothly.

In [30]:
non_common_countries = (set(income_per_person_df.country) | set(country_df.country_or_area)) - (set(income_per_person_df.country) & set(country_df.country_or_area))
non_common_countries
Out[30]:
{'american_samoa',
 'anguilla',
 'antarctica',
 'aruba',
 'bermuda',
 'bolivia',
 'bolivia_(plurinational_state_of)',
 'bonaire,_sint_eustatius_and_saba',
 'bouvet_island',
 'british_indian_ocean_territory',
 'british_virgin_islands',
 'brunei',
 'brunei_darussalam',
 'cabo_verde',
 'cape_verde',
 'cayman_islands',
 'china,_hong_kong_special_administrative_region',
 'china,_macao_special_administrative_region',
 'christmas_island',
 'cocos_(keeling)_islands',
 'congo',
 'congo,_dem._rep.',
 'congo,_rep.',
 'cook_islands',
 "cote_d'ivoire",
 'curaçao',
 'czech_republic',
 'czechia',
 'côte_d’ivoire',
 "democratic_people's_republic_of_korea",
 'democratic_republic_of_the_congo',
 'falkland_islands_(malvinas)',
 'faroe_islands',
 'french_guiana',
 'french_polynesia',
 'french_southern_territories',
 'gibraltar',
 'greenland',
 'guadeloupe',
 'guam',
 'guernsey',
 'heard_island_and_mcdonald_islands',
 'holy_see',
 'hong_kong,_china',
 'iran',
 'iran_(islamic_republic_of)',
 'isle_of_man',
 'jersey',
 'kyrgyz_republic',
 'kyrgyzstan',
 'lao',
 "lao_people's_democratic_republic",
 'liechtenstein',
 'martinique',
 'mayotte',
 'micronesia,_fed._sts.',
 'micronesia_(federated_states_of)',
 'moldova',
 'montserrat',
 'new_caledonia',
 'niue',
 'norfolk_island',
 'north_korea',
 'northern_mariana_islands',
 'palestine',
 'pitcairn',
 'puerto_rico',
 'republic_of_korea',
 'republic_of_moldova',
 'russia',
 'russian_federation',
 'réunion',
 'saint_barthélemy',
 'saint_helena',
 'saint_kitts_and_nevis',
 'saint_lucia',
 'saint_martin_(french_part)',
 'saint_pierre_and_miquelon',
 'saint_vincent_and_the_grenadines',
 'sark',
 'sint_maarten_(dutch_part)',
 'slovak_republic',
 'slovakia',
 'south_georgia_and_the_south_sandwich_islands',
 'south_korea',
 'st._kitts_and_nevis',
 'st._lucia',
 'st._vincent_and_the_grenadines',
 'state_of_palestine',
 'svalbard_and_jan_mayen_islands',
 'syria',
 'syrian_arab_republic',
 'taiwan',
 'tanzania',
 'tokelau',
 'turkey',
 'turks_and_caicos_islands',
 'türkiye',
 'united_kingdom',
 'united_kingdom_of_great_britain_and_northern_ireland',
 'united_republic_of_tanzania',
 'united_states',
 'united_states_minor_outlying_islands',
 'united_states_of_america',
 'united_states_virgin_islands',
 'venezuela',
 'venezuela_(bolivarian_republic_of)',
 'viet_nam',
 'vietnam',
 'wallis_and_futuna_islands',
 'western_sahara',
 'åland_islands'}
In [31]:
non_common_from_income = []
non_common_from_country = []
for country_name in non_common_countries:
    if income_per_person_df.query(f'country == "{country_name}"').size > 0:
        non_common_from_income.append(country_name)
    else:
        non_common_from_country.append(country_name)
        
print('Non common names from the country dataframe:\n', non_common_from_country, '\n\n')
print('Non common names from the income dataframe:\n', non_common_from_income)
Non common names from the country dataframe:
 ['saint_helena', 'aruba', 'french_guiana', 'state_of_palestine', 'czechia', 'montserrat', 'holy_see', 'united_states_minor_outlying_islands', 'british_indian_ocean_territory', 'mayotte', 'congo', 'bolivia_(plurinational_state_of)', 'democratic_republic_of_the_congo', 'russian_federation', 'cocos_(keeling)_islands', 'saint_lucia', 'isle_of_man', 'cabo_verde', 'south_georgia_and_the_south_sandwich_islands', 'viet_nam', 'micronesia_(federated_states_of)', 'western_sahara', "democratic_people's_republic_of_korea", 'united_republic_of_tanzania', 'wallis_and_futuna_islands', 'saint_martin_(french_part)', 'cook_islands', 'puerto_rico', 'saint_vincent_and_the_grenadines', 'réunion', 'american_samoa', 'sint_maarten_(dutch_part)', 'guernsey', 'saint_barthélemy', 'china,_macao_special_administrative_region', 'slovakia', 'brunei_darussalam', 'antarctica', 'tokelau', 'republic_of_korea', 'french_southern_territories', 'gibraltar', 'kyrgyzstan', 'new_caledonia', 'iran_(islamic_republic_of)', 'bermuda', 'falkland_islands_(malvinas)', 'saint_pierre_and_miquelon', 'united_kingdom_of_great_britain_and_northern_ireland', 'faroe_islands', 'svalbard_and_jan_mayen_islands', "lao_people's_democratic_republic", 'pitcairn', 'liechtenstein', 'british_virgin_islands', 'saint_kitts_and_nevis', 'christmas_island', 'republic_of_moldova', 'guadeloupe', 'greenland', 'bonaire,_sint_eustatius_and_saba', 'martinique', 'venezuela_(bolivarian_republic_of)', 'bouvet_island', 'united_states_of_america', 'united_states_virgin_islands', 'curaçao', 'syrian_arab_republic', 'åland_islands', 'norfolk_island', 'heard_island_and_mcdonald_islands', 'china,_hong_kong_special_administrative_region', 'türkiye', 'guam', 'cayman_islands', 'northern_mariana_islands', 'anguilla', 'turks_and_caicos_islands', 'jersey', 'niue', 'sark', 'côte_d’ivoire', 'french_polynesia'] 


Non common names from the income dataframe:
 ['palestine', 'st._vincent_and_the_grenadines', 'czech_republic', 'venezuela', 'turkey', 'moldova', 'vietnam', "cote_d'ivoire", 'kyrgyz_republic', 'st._kitts_and_nevis', 'slovak_republic', 'lao', 'russia', 'south_korea', 'st._lucia', 'taiwan', 'congo,_rep.', 'bolivia', 'cape_verde', 'iran', 'tanzania', 'syria', 'united_kingdom', 'congo,_dem._rep.', 'north_korea', 'hong_kong,_china', 'united_states', 'brunei', 'micronesia,_fed._sts.']

The above are the country names that are present in only one of the dataframes. After comparing both lists to identify slight spelling differences, the following map of the matching names was created:

In [32]:
different_spellings = {
    'bolivia': 'bolivia_(plurinational_state_of)',
    'brunei': 'brunei_darussalam',
    'cape_verde': 'cabo_verde',
    "cote_d'ivoire": "côte_d’ivoire",
    'hong_kong,_china': 'china,_hong_kong_special_administrative_region',
    'congo,_rep.': 'congo',
    'congo,_dem._rep.': 'democratic_republic_of_the_congo',
    'czech_republic': 'czechia',
    'south_korea': 'republic_of_korea',
    'north_korea': "democratic_people's_republic_of_korea",
    'iran': 'iran_(islamic_republic_of)',
    'kyrgyz_republic': 'kyrgyzstan',
    'lao': "lao_people's_democratic_republic",
    'micronesia,_fed._sts.': 'micronesia_(federated_states_of)',
    'moldova': 'republic_of_moldova',
    'russia': 'russian_federation',
    'st._lucia': 'saint_lucia',
    'st._kitts_and_nevis': 'saint_kitts_and_nevis',
    'st._vincent_and_the_grenadines': 'saint_vincent_and_the_grenadines',
    'syria': 'syrian_arab_republic',
    'slovak_republic': 'slovakia',
    'palestine': 'state_of_palestine',
    'turkey': 'türkiye',
    'united_kingdom': 'united_kingdom_of_great_britain_and_northern_ireland',
    'tanzania': 'united_republic_of_tanzania',
    'united_states': 'united_states_of_america',
    'vietnam': 'viet_nam',
    'venezuela': 'venezuela_(bolivarian_republic_of)',
}

Now the names need to be updated:

In [33]:
# reverse the keys and values to fit what we actually need
different_spellings = {val: key for key, val in different_spellings.items()}

# change the names in the countries df
country_df.country_or_area = country_df.country_or_area.apply(lambda x: different_spellings[x] if x in different_spellings else x)

Now the merge can proceed:

In [34]:
merged_income_country = pd.merge(left=income_per_person_df, right=country_df, left_on='country', right_on='country_or_area', how='outer')
merged_income_country.head()
Out[34]:
country 1800 1801 1802 1803 1804 1805 1806 1807 1808 ... 2044 2045 2046 2047 2048 2049 2050 sub_region_name country_or_area iso_alpha3_code
0 afghanistan 683.0 683.0 683.0 683.0 683.0 683.0 683.0 683.0 683.0 ... 2870.0 2930.0 2990.0 3060.0 3120.0 3190.0 3260.0 Southern Asia afghanistan AFG
1 angola 700.0 702.0 705.0 709.0 711.0 714.0 718.0 721.0 725.0 ... 8530.0 8710.0 8900.0 9090.0 9280.0 9480.0 9690.0 Sub-Saharan Africa angola AGO
2 albania 755.0 755.0 755.0 755.0 755.0 756.0 756.0 756.0 756.0 ... 26700.0 27300.0 27900.0 28500.0 29100.0 29700.0 30400.0 Southern Europe albania ALB
3 andorra 1360.0 1360.0 1360.0 1360.0 1370.0 1370.0 1370.0 1370.0 1380.0 ... 73400.0 75000.0 76600.0 78300.0 80000.0 81700.0 83400.0 Southern Europe andorra AND
4 united_arab_emirates 1130.0 1130.0 1140.0 1140.0 1150.0 1150.0 1160.0 1160.0 1160.0 ... 107000.0 110000.0 112000.0 114000.0 117000.0 119000.0 122000.0 Western Asia united_arab_emirates ARE

5 rows × 255 columns

The merge is an outer merge to preserve the rows that don't merge smoothly.

It appears the country and country_or_area fields merged as expected.

In [35]:
missing_country_or_area = merged_income_country[merged_income_country.country_or_area.isna()].loc[:, ['country', 'country_or_area']]
missing_country_or_area.shape
Out[35]:
(1, 2)

However it looks like one country wasn't merged successfully because it's only present in the income dataframe.

In [36]:
missing_country = merged_income_country[merged_income_country.country.isna()].loc[:, ['country', 'country_or_area']].sort_values(by='country_or_area')
missing_country.shape
Out[36]:
(55, 2)

Similarly, 55 other countries were also not merged successfully because they're only present in the country dataframe.

In [37]:
rows_with_incomplete_country_info = pd.concat((missing_country, missing_country_or_area))
merged_income_country.drop(index=rows_with_incomplete_country_info.index, inplace=True)
print('Number of rows with missing "country" field:', merged_income_country.country.isna().sum())
print('Number of rows with missing "country_or_area" field:', merged_income_country.country_or_area.isna().sum())
Number of rows with missing "country" field: 0
Number of rows with missing "country_or_area" field: 0

The code above generates a list of the row numbers of countries that were not merged successfully, and then drops them from the dataframe.

Now the country_or_area column can be dropped because it's effectively now a duplicate of the country column.

In [38]:
merged_income_country.drop(columns='country_or_area', inplace=True)

Now the sub-regions can be reformatted to lowercase:

In [39]:
merged_income_country.sub_region_name = merged_income_country.sub_region_name.apply(countries_to_lowercase)
merged_income_country.head()
Out[39]:
country 1800 1801 1802 1803 1804 1805 1806 1807 1808 ... 2043 2044 2045 2046 2047 2048 2049 2050 sub_region_name iso_alpha3_code
0 afghanistan 683.0 683.0 683.0 683.0 683.0 683.0 683.0 683.0 683.0 ... 2810.0 2870.0 2930.0 2990.0 3060.0 3120.0 3190.0 3260.0 southern_asia AFG
1 angola 700.0 702.0 705.0 709.0 711.0 714.0 718.0 721.0 725.0 ... 8350.0 8530.0 8710.0 8900.0 9090.0 9280.0 9480.0 9690.0 sub_saharan_africa AGO
2 albania 755.0 755.0 755.0 755.0 755.0 756.0 756.0 756.0 756.0 ... 26200.0 26700.0 27300.0 27900.0 28500.0 29100.0 29700.0 30400.0 southern_europe ALB
3 andorra 1360.0 1360.0 1360.0 1360.0 1370.0 1370.0 1370.0 1370.0 1380.0 ... 71900.0 73400.0 75000.0 76600.0 78300.0 80000.0 81700.0 83400.0 southern_europe AND
4 united_arab_emirates 1130.0 1130.0 1140.0 1140.0 1150.0 1150.0 1160.0 1160.0 1160.0 ... 105000.0 107000.0 110000.0 112000.0 114000.0 117000.0 119000.0 122000.0 western_asia ARE

5 rows × 254 columns

Life Expectancy¶

First, the country names need to be renamed as done for the other dataframes.

In [40]:
life_expectancy_df.country = life_expectancy_df.country.apply(countries_to_lowercase)

Now the null values can be removed.

In [41]:
life_expectancy_df.dropna(inplace=True)
life_expectancy_df.shape
Out[41]:
(186, 302)

Next, the life expectancy dataframe also needs the sub-region column.

In [42]:
# first create a dataframe that only contains the country fields
country_name_region = merged_income_country.loc[:, ['country', 'sub_region_name', 'iso_alpha3_code']].rename(columns={'country':'tmp_country_name'})

# then merge it with the life expectancy dataframe
merged_life_exp_country_df = pd.merge(left=life_expectancy_df, right=country_name_region, left_on='country', right_on='tmp_country_name', how='inner')
merged_life_exp_country_df.head()
Out[42]:
country 1800 1801 1802 1803 1804 1805 1806 1807 1808 ... 2094 2095 2096 2097 2098 2099 2100 tmp_country_name sub_region_name iso_alpha3_code
0 afghanistan 28.2 28.2 28.2 28.2 28.2 28.2 28.1 28.1 28.1 ... 76.0 76.1 76.2 76.4 76.5 76.6 76.8 afghanistan southern_asia AFG
1 angola 27.0 27.0 27.0 27.0 27.0 27.0 27.0 27.0 27.0 ... 79.2 79.3 79.5 79.6 79.7 79.9 80.0 angola sub_saharan_africa AGO
2 albania 35.4 35.4 35.4 35.4 35.4 35.4 35.4 35.4 35.4 ... 87.7 87.8 87.9 88.0 88.2 88.3 88.4 albania southern_europe ALB
3 united_arab_emirates 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 ... 82.7 82.8 82.9 83.0 83.1 83.2 83.3 united_arab_emirates western_asia ARE
4 argentina 33.2 33.2 33.2 33.2 33.2 33.2 33.2 33.2 33.2 ... 86.5 86.7 86.8 86.9 87.0 87.1 87.2 argentina latin_america_and_the_caribbean ARG

5 rows × 305 columns

The code above first renames the country field in the temporary dataframe to avoid conflicts with the life expectancy dataframe, then an inner merge is performed to make sure that only the rows in the life expectancy dataframe are preserved.

In [43]:
merged_life_exp_country_df.drop(columns='tmp_country_name', inplace=True)

The tmp_country_name column has been removed.

In [44]:
merged_life_exp_country_df.head()
Out[44]:
country 1800 1801 1802 1803 1804 1805 1806 1807 1808 ... 2093 2094 2095 2096 2097 2098 2099 2100 sub_region_name iso_alpha3_code
0 afghanistan 28.2 28.2 28.2 28.2 28.2 28.2 28.1 28.1 28.1 ... 75.8 76.0 76.1 76.2 76.4 76.5 76.6 76.8 southern_asia AFG
1 angola 27.0 27.0 27.0 27.0 27.0 27.0 27.0 27.0 27.0 ... 79.1 79.2 79.3 79.5 79.6 79.7 79.9 80.0 sub_saharan_africa AGO
2 albania 35.4 35.4 35.4 35.4 35.4 35.4 35.4 35.4 35.4 ... 87.6 87.7 87.8 87.9 88.0 88.2 88.3 88.4 southern_europe ALB
3 united_arab_emirates 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 ... 82.6 82.7 82.8 82.9 83.0 83.1 83.2 83.3 western_asia ARE
4 argentina 33.2 33.2 33.2 33.2 33.2 33.2 33.2 33.2 33.2 ... 86.5 86.5 86.7 86.8 86.9 87.0 87.1 87.2 latin_america_and_the_caribbean ARG

5 rows × 304 columns

In [45]:
merged_life_exp_country_df.shape
Out[45]:
(185, 304)

This shows that one country was not successfully merged because the previous row count was 186.

In [46]:
merged_life_exp_country_df.isna().sum().sum()
Out[46]:
0

This confirms that there's no null value left in this merged dataframe.

To answer the final question (looking for a relationship between life expectancy and income per capita), the data of the corresponding dataframes need to be merged into 1 dataframe.

In [47]:
alt_merged_income_country_df = merged_income_country.rename(columns=lambda x: f'income_{x}')
alt_merged_income_country_df.columns
Out[47]:
Index(['income_country', 'income_1800', 'income_1801', 'income_1802',
       'income_1803', 'income_1804', 'income_1805', 'income_1806',
       'income_1807', 'income_1808',
       ...
       'income_2043', 'income_2044', 'income_2045', 'income_2046',
       'income_2047', 'income_2048', 'income_2049', 'income_2050',
       'income_sub_region_name', 'income_iso_alpha3_code'],
      dtype='object', length=254)

The code above creates a new dataframe where the column names from the income dataframe all have income_ prepended to them, to distinguish them from those of the life expectancy dataframe.

In [48]:
merged_income_life_exp_df = pd.merge(left=merged_life_exp_country_df, right=alt_merged_income_country_df, left_on='country', right_on='income_country', how='inner')
merged_income_life_exp_df.head()
Out[48]:
country 1800 1801 1802 1803 1804 1805 1806 1807 1808 ... income_2043 income_2044 income_2045 income_2046 income_2047 income_2048 income_2049 income_2050 income_sub_region_name income_iso_alpha3_code
0 afghanistan 28.2 28.2 28.2 28.2 28.2 28.2 28.1 28.1 28.1 ... 2810.0 2870.0 2930.0 2990.0 3060.0 3120.0 3190.0 3260.0 southern_asia AFG
1 angola 27.0 27.0 27.0 27.0 27.0 27.0 27.0 27.0 27.0 ... 8350.0 8530.0 8710.0 8900.0 9090.0 9280.0 9480.0 9690.0 sub_saharan_africa AGO
2 albania 35.4 35.4 35.4 35.4 35.4 35.4 35.4 35.4 35.4 ... 26200.0 26700.0 27300.0 27900.0 28500.0 29100.0 29700.0 30400.0 southern_europe ALB
3 united_arab_emirates 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 ... 105000.0 107000.0 110000.0 112000.0 114000.0 117000.0 119000.0 122000.0 western_asia ARE
4 argentina 33.2 33.2 33.2 33.2 33.2 33.2 33.2 33.2 33.2 ... 31800.0 32500.0 33200.0 33900.0 34700.0 35400.0 36200.0 36900.0 latin_america_and_the_caribbean ARG

5 rows × 558 columns

Next, the 2 dataframes are merged to create a new one where the countries are shared and the column names indicate which cells represent income data. The merge is an inner merge to make sure that only rows of matching countries are preserved.

In [49]:
merged_income_life_exp_df.shape
Out[49]:
(185, 558)

The number of rows did not change, so this means all rows were successfully merged.

Exploratory Data Analysis¶

Data Distribution¶

In [50]:
current_year = '2022'
last_20_years = np.arange(int(current_year) - 19, int(current_year) + 1)

Fertility¶

In [51]:
fertility_df.loc[:,str(int(last_20_years[0])):current_year].boxplot()
plt.title('Distribution of fertility rates for the last 20 years')
plt.xlabel('Year')
plt.ylabel('Fertility (babies per woman)')
plt.xticks(rotation='vertical');

The chart above shows some key details on the distribution of fertility rates over the last twenty years. The green bar in the middle of each box represents the median fertility rate for that year, which appears to have been slowly reducing over the last 20 years. The bars connected to lines above and below the each box represent the range that should contain the majority of the data, so the values outside that range (the empty circles) represent outliers. There was at least one outlier in each of the last 20 years. Interestingly, the range appears to be contracting during this period, with the lowest birthrate countries increasing their fertility rates and vice versa for the highest birthrate countries. The data is skewed right throughout the period, which indicates that the majority of countries are closer to the lower end of the range than to the higher end. The charts below can be used to better visualize the skew:

In [52]:
for year in last_20_years:
    values = fertility_df[str(year)]
    values.hist(density=True, bins=math.floor(values.max()), figsize=(5,5))
    plt.xlabel('Birth rate (babies per woman)')
    plt.ylabel('Ratio of total countries with birth rate')
    plt.title(f'Distribution of national birthrates for {year}')
    plt.show();

Each bar is intended to roughly approximate the increment of each additional baby to the fertility rate. The charts show that the 2 leftmost bars (corresponding to the lowest recorded birth rates) have consistently contained the birthrates for the highest percentage of countries.

Per Capita Income¶

In [53]:
period_size = 50
period_buckets = [
    (str(int(current_year) - period_size * 1), current_year),
    (str(int(current_year) - period_size * 2), str(int(current_year) - period_size * 1)),
    (str(int(current_year) - period_size * 3), str(int(current_year) - period_size * 2)),
    (str(int(current_year) - period_size * 4), str(int(current_year) - period_size * 3)),
]
period_buckets
Out[53]:
[('1972', '2022'), ('1922', '1972'), ('1872', '1922'), ('1822', '1872')]

The code above creates 4 buckets with a size of 50 years each, starting from the current year and going back 200 years.

In [54]:
for period_start, period_end in period_buckets:
    period_start = str(int(period_start) + 1)
    merged_income_country.groupby('sub_region_name').mean(numeric_only=True).loc[:,period_start:period_end].boxplot(figsize=(10,10))
    plt.ylabel('Per Capita Income (inflation-adjusted USD)')
    plt.xlabel('Year')
    plt.title(f'Distribution of Per Capita Income from {period_start} to {period_end}')
    plt.xticks(rotation='vertical')
    plt.show();

The charts above show the distribution of per capita income over the last 200 years, in 50 year buckets. One striking observation is that the number of outliers (the empty circles) has reduced as the years go by. That would seem to suggest that the world's wealth has become more evenly distributed over this period. Another interesting observation is that the range that should contain most values (the top and bottom dashes above and below each box) has grown noticeably beginning from about 1950. That suggests a widening gap between the richest and poorest countries, even though overall rates have been on the rise throughout that period.

Life Expectancy¶

In [55]:
for period_start, period_end in period_buckets:
    period_start = str(int(period_start) + 1)
    merged_income_life_exp_df.loc[:,period_start:period_end].boxplot(figsize=(10,10))
    plt.ylabel('Life expectancy(years)')
    plt.xlabel('Year')
    plt.title(f'Distribution of Life Expectancy from {period_start} to {period_end}')
    plt.xticks(rotation='vertical')
    plt.show();

The charts above show the distribution of life expectancy over the past 200 years, in 50 year buckets. The first striking observation is the relative flatness of life expectancy over the first 100 years of the period, followed by a sizeable rise (more than a doubling for the highest value!) in the second 100 years of the period. The second striking observation is that the number of outliers also drops noticeably in the second 100 years, suggesting that in recent years the majority of countries have benefitted from the driver's of higher life expectancy.

Q1: Which countries currently have the highest and lowest fertility rates?¶

Highest Fertility Rates¶

In [56]:
current_fertility = fertility_df[current_year]
current_fertility = current_fertility.set_axis(fertility_df.country, copy=False)
current_fertility = current_fertility.sort_values(ascending=False)
current_fertility.head()
Out[56]:
country
niger               6.86
somalia             5.71
congo,_dem._rep.    5.52
mali                5.51
chad                5.36
Name: 2022, dtype: float64
In [57]:
current_fertility_highest = current_fertility[:10]

plt.bar(x=np.arange(10), height=current_fertility_highest)
plt.xticks(np.arange(10), current_fertility_highest.keys(), rotation='vertical')
for i, val in enumerate(current_fertility_highest):
    plt.text(i, val * 1.05, f'{val:.1f}', horizontalalignment='center', rotation='vertical')
plt.title(f'10 highest fertility rates in the world by country for the year {current_year}')
plt.ylim(0, current_fertility_highest.max() * 1.2)
plt.xlabel('Country')
plt.ylabel('Fertility (babies per woman)')
plt.show();

The chart above indicates high population growth rates in some countries, with 5 babies per woman being the smallest rate of the top 10 countries.

Lowest Fertility Rates¶

In [58]:
current_fertility_lowest = current_fertility[-10:]
current_fertility_lowest.sort_values(ascending=True, inplace=True)

plt.bar(x=np.arange(10), height=current_fertility_lowest)
plt.xticks(np.arange(10), current_fertility_lowest.keys(), rotation='vertical')
for i, val in enumerate(current_fertility_lowest):
    plt.text(i, val * 1.05, f'{val:.1f}', horizontalalignment='center', rotation='vertical')
plt.title(f'10 lowest fertility rates in the world by country for the year {current_year}')
plt.ylim(0, current_fertility_lowest.max() * 1.2)
plt.xlabel('Country')
plt.ylabel('Fertility (babies per woman)')
plt.show();

The chart above indicates that some countries have declining populations. 2.1 is the approximate threshold needed to maintain a population's size, but all the countries shown in the chart have population growth rates of 1.4 or less.

Q2: How have the fertility rates for those countries been trending in the last 20 years?¶

Countries with highest current fertility rates¶

In [59]:
fertility_df_highest_current = fertility_df.query('country in ["' + '","'.join(list(current_fertility_highest.keys())) + '"]')
fertility_df_highest_current = fertility_df_highest_current.loc[:, ['country'] + [str(year) for year in last_20_years]]
fertility_df_highest_current
Out[59]:
country 2003 2004 2005 2006 2007 2008 2009 2010 2011 ... 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
2 angola 6.54 6.50 6.47 6.42 6.37 6.31 6.24 6.16 6.08 ... 5.92 5.84 5.77 5.69 5.62 5.55 5.48 5.41 5.34 5.28
12 burundi 6.82 6.76 6.69 6.62 6.53 6.45 6.35 6.26 6.16 ... 5.96 5.87 5.78 5.70 5.62 5.53 5.45 5.38 5.30 5.22
37 congo,_dem._rep. 6.81 6.76 6.71 6.66 6.61 6.57 6.52 6.47 6.42 ... 6.30 6.24 6.17 6.09 6.02 5.92 5.82 5.72 5.62 5.52
68 gambia 5.88 5.86 5.84 5.81 5.79 5.76 5.74 5.71 5.68 ... 5.60 5.54 5.49 5.43 5.36 5.29 5.22 5.14 5.07 4.99
118 mali 6.84 6.82 6.79 6.76 6.72 6.67 6.61 6.55 6.47 ... 6.32 6.23 6.15 6.06 5.97 5.88 5.79 5.70 5.60 5.51
132 niger 7.64 7.63 7.61 7.59 7.57 7.54 7.52 7.49 7.46 ... 7.38 7.34 7.29 7.24 7.18 7.13 7.07 7.00 6.93 6.86
133 nigeria 6.04 6.01 5.99 5.96 5.93 5.90 5.87 5.84 5.80 ... 5.71 5.65 5.59 5.53 5.46 5.39 5.32 5.25 5.18 5.11
164 somalia 7.45 7.38 7.31 7.23 7.14 7.06 6.96 6.87 6.77 ... 6.56 6.46 6.37 6.27 6.17 6.08 5.98 5.89 5.80 5.71
175 chad 7.21 7.15 7.07 6.99 6.90 6.80 6.70 6.59 6.48 ... 6.26 6.16 6.05 5.95 5.85 5.75 5.65 5.55 5.45 5.36
187 uganda 6.72 6.65 6.58 6.51 6.42 6.34 6.25 6.15 6.06 ... 5.87 5.78 5.68 5.59 5.50 5.41 5.32 5.24 5.15 5.06

10 rows × 21 columns

First, the fertility rates of the last 20 years are collected for the countries with the highest current fertility rates.

In [60]:
fertility_df_highest_current.transpose().iloc[1:,:].plot(figsize=(8,8)).legend(fertility_df_highest_current.country)
plt.title(f'20 year trend of fertility rates for the countries with the highest rates in {current_year}')
plt.xlabel('Year')
plt.ylabel('Fertility rate (babies per woman)');

The chart above suggests that fertility rates have recently been dropping for all the countries. Note that the dataframe had to be transposed (i.e. the axes had to be flipped) in order to generate this chart.

Countries with lowest current fertility rates¶

In [61]:
fertility_df_lowest_current = fertility_df.query('country in ["' + '","'.join(list(current_fertility_lowest.keys())) + '"]')
fertility_df_lowest_current = fertility_df_lowest_current.loc[:, ['country'] + [str(year) for year in range(int(current_year) - 19, int(current_year) + 1)]]
fertility_df_lowest_current
Out[61]:
country 2003 2004 2005 2006 2007 2008 2009 2010 2011 ... 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
44 cyprus 1.58 1.55 1.52 1.50 1.48 1.46 1.44 1.42 1.41 ... 1.37 1.36 1.35 1.34 1.34 1.34 1.33 1.33 1.34 1.34
71 greece 1.36 1.38 1.40 1.42 1.44 1.44 1.43 1.41 1.39 ... 1.35 1.33 1.31 1.31 1.30 1.30 1.30 1.30 1.30 1.31
78 hong_kong,_china 0.95 0.96 0.97 0.99 1.02 1.05 1.08 1.12 1.15 ... 1.22 1.24 1.27 1.30 1.32 1.34 1.36 1.38 1.40 1.42
99 south_korea 1.21 1.18 1.17 1.16 1.16 1.17 1.18 1.19 1.21 ... 1.24 1.26 1.28 1.30 1.31 1.33 1.35 1.36 1.38 1.39
105 st._lucia 1.83 1.74 1.68 1.63 1.60 1.57 1.56 1.54 1.52 ... 1.50 1.48 1.47 1.46 1.45 1.44 1.43 1.42 1.42 1.41
113 moldova 1.25 1.22 1.22 1.23 1.24 1.26 1.27 1.27 1.27 ... 1.26 1.26 1.25 1.24 1.23 1.23 1.23 1.23 1.23 1.24
145 poland 1.28 1.28 1.30 1.32 1.34 1.36 1.37 1.37 1.36 ... 1.33 1.32 1.31 1.30 1.29 1.29 1.29 1.29 1.29 1.29
148 portugal 1.44 1.43 1.42 1.40 1.38 1.36 1.34 1.33 1.31 ... 1.28 1.26 1.25 1.25 1.24 1.24 1.24 1.24 1.24 1.25
160 singapore 1.27 1.26 1.26 1.28 1.29 1.28 1.26 1.26 1.26 ... 1.25 1.25 1.24 1.25 1.25 1.26 1.27 1.27 1.28 1.28
185 taiwan 1.24 1.18 1.12 1.12 1.10 1.05 1.03 0.90 1.07 ... 1.07 1.15 1.17 1.19 1.21 1.22 1.24 1.25 1.27 1.28

10 rows × 21 columns

In [62]:
# set ylim to a range of 2.5 for uniformity with the plot for highest fertility countries
fertility_df_lowest_current.transpose().iloc[1:,:].plot(figsize=(8,8), ylim=(0.5, 3)).legend(fertility_df_lowest_current.country)
plt.title(f'20 year trend of fertility rates for the countries with the lowest rates in {current_year}')
plt.xlabel('Year')
plt.ylabel('Fertility rate (babies per woman)');

The chart above suggests a mix of trends for the countries with the current lowest fertility rates. Most notably, St. Lucia and Cyprus appear to still have droppping rates whereas Hong Kong's rate appears to be on the rise.

Q3: Which sub-regions of the world are currently the richest, and which are the poorest?¶

In [63]:
current_richest_regions = merged_income_country.groupby('sub_region_name').mean(numeric_only=True)[current_year].sort_values(ascending=False)
In [64]:
current_richest_regions_ticks = np.arange(len(current_richest_regions))
plt.bar(x=current_richest_regions_ticks, height=current_richest_regions)
plt.xticks(current_richest_regions_ticks, current_richest_regions.index, rotation='vertical')
for i, val in enumerate(current_richest_regions):
    plt.text(i, val + 750, f'{val:.1f}', rotation='vertical', horizontalalignment='center', verticalalignment='bottom')

plt.ylim(0, current_richest_regions.max() * 1.2)
plt.title(f'Per capita income in worldwide sub-regions for {current_year}')
plt.xlabel('Sub-region')
plt.ylabel('Per capita income (inflation adjusted USD)');

The chart above suggests that the world's richest societies are concentrated in the area colloquially known as the "West" i.e. Europe, North America, Australia and New Zealand. Sub-Saharan Africa and island nations around Australia (Polynesia, Micronesia, and Melanesia) have by far the poorest societies, with Asia and the Middle East occupying the middle of the range alongside Latin America.

Q4: How has that wealth pattern changed over the past 200 years?¶

In [65]:
for period_start, period_end in period_buckets:
    period_start = str(int(period_start) + 1)
    period_mean_income_per_sub_region = merged_income_country.groupby('sub_region_name').mean(numeric_only=True).loc[:, period_start:period_end].mean(axis=1).sort_values(ascending=False)
    bar_ticks = np.arange(len(period_mean_income_per_sub_region))
    plt.bar(x=bar_ticks, height=period_mean_income_per_sub_region)
    plt.xticks(bar_ticks, period_mean_income_per_sub_region.index, rotation='vertical')
    for i, val in enumerate(period_mean_income_per_sub_region):
        # add 2% of the value for some space above the top of the bar
        plt.text(i, val * 1.02, f'{val:.1f}', verticalalignment='bottom', horizontalalignment='center', rotation='vertical')
    
    # add 20% of the max to the y scale so the highest text label doesn't overflow the top of the chart
    plt.ylim(0, period_mean_income_per_sub_region.max() * 1.2)
    plt.title(f'Per capita income in worldwide sub-regions between {period_start} and {period_end}')
    plt.xlabel('Sub-region')
    plt.ylabel('Per capita income (inflation adjusted USD)')
    plt.figure(figsize=(20,20))
    plt.show();
<Figure size 2000x2000 with 0 Axes>
<Figure size 2000x2000 with 0 Axes>
<Figure size 2000x2000 with 0 Axes>
<Figure size 2000x2000 with 0 Axes>

The charts above indicate that the composition of the extremes have not changed much in the last 200 years. The "West" has always recorded the highest per capita income during that period, and Sub-Saharan Africa and Melanesia have recorded the lowest. However, some interesting changes occurred over that period.

Micronesia¶

Micronesia has experienced quite a swing in its fortunes compared to other regions. For the oldest and second oldest buckets (1823 - 1872 and 1873 - 1922, respectively) it was the second poorest and poorest region, respectively. However, from 1923 to 1972 it shot up to the top of the mid range, behind only the "West" and western Asia. The countries within Micronesia are shown below:

In [66]:
print(merged_income_country[merged_income_country.sub_region_name == 'micronesia'].country.unique())
['micronesia,_fed._sts.' 'kiribati' 'marshall_islands' 'nauru' 'palau']

A decent history lesson is likely necessary to fully investigate the reasons for such an impressive jump, nonetheless it's certainly noteworthy! More so, further investigation is also needed to determine the spread of that newfound wealth within the population of that region during that period.

Polynesia¶

Polynesia appears to have experienced the reverse change in relative fortunes. In the oldest bucket Polynesia was firmly in the middle of the pack, but it steadily fell lower until it became the second poorest region in the most recent bucket, ahead of only Sub-Saharan Africa. The countries in Polynesia are:

In [67]:
print(merged_income_country[merged_income_country.sub_region_name == 'polynesia'].country.unique())
['tonga' 'tuvalu' 'samoa']

Similar to Micronesia, detecting the full story of what happened here is beyond the scope of this exploration.

Q5: Which sub-regions of the world currently have the highest and lowest life expectancies?¶

In [68]:
current_highest_life_exp = merged_life_exp_country_df.groupby('sub_region_name').mean(numeric_only=True).loc[:, current_year].sort_values(ascending=False)
In [69]:
current_highest_life_exp_ticks = np.arange(len(current_highest_life_exp))
plt.bar(x=current_highest_life_exp_ticks, height=current_highest_life_exp)
plt.xticks(current_highest_life_exp_ticks, labels=current_highest_life_exp.index, rotation='vertical')
plt.ylim((0, current_highest_life_exp.max() * 1.2))
for i, val in enumerate(current_highest_life_exp):
    plt.text(i, val * 1.02, f'{val:.1f}', verticalalignment='bottom', horizontalalignment='center', rotation='vertical')
plt.title(f"Life expectancy of the world's sub-regions for {current_year}")
plt.xlabel('Sub-region')
plt.ylabel('Life expectancy (years)')
plt.show();

The chart above suggests that the distribution of life expectancy follows a similar pattern to what was seen with per capita income. The "West" has the highest life expectancy, and Sub-Saharan Africa, Melanesia and Micronesia have the lowest.

Q6: How does life expectancy relate to income over the past 100 years?¶

In [70]:
income_life_exp_period_size = 20
income_life_exp_periods = [
    (str(int(current_year) - income_life_exp_period_size * 1), current_year),
    (str(int(current_year) - income_life_exp_period_size * 2), str(int(current_year) - income_life_exp_period_size * 1)),
    (str(int(current_year) - income_life_exp_period_size * 3), str(int(current_year) - income_life_exp_period_size * 2)),
    (str(int(current_year) - income_life_exp_period_size * 4), str(int(current_year) - income_life_exp_period_size * 3)),
    (str(int(current_year) - income_life_exp_period_size * 5), str(int(current_year) - income_life_exp_period_size * 4)),
]
income_life_exp_periods
Out[70]:
[('2002', '2022'),
 ('1982', '2002'),
 ('1962', '1982'),
 ('1942', '1962'),
 ('1922', '1942')]

The code above creates 5 buckets with a size of 20 years each, starting from the current year and going back 100 years.

In [71]:
for period_start, period_end in income_life_exp_periods:
    period_start = str(int(period_start) + 1)
    life_exp_cols = [str(num) for num in np.arange(int(period_start), int(period_end) + 1)]
    income_cols = [f'income_{str(num)}' for num in np.arange(int(period_start), int(period_end) + 1)]
    mean_period_life_exp_df = merged_income_life_exp_df.loc[:, life_exp_cols].mean(axis=1)
    mean_period_income_df = merged_income_life_exp_df.loc[:, income_cols].mean(axis=1)
    plt.scatter(mean_period_income_df, mean_period_life_exp_df);
    plt.title(f'Life expectancy vs per capita income from {period_start} to {period_end}')
    plt.xlabel('Per capita income (inflation-adjusted USD)')
    plt.ylabel('Life expectancy (years)')
    lin_reg_1_deg = np.polyfit(mean_period_income_df, mean_period_life_exp_df, 1)
    poly_1_deg = np.poly1d(lin_reg_1_deg)
    predicted_1_deg = poly_1_deg(mean_period_income_df)
    corr_matrix_1_deg = np.corrcoef(mean_period_life_exp_df, predicted_1_deg)
    coef_1_deg = corr_matrix_1_deg[0,1]**2
    plt.plot(mean_period_income_df, predicted_1_deg, 'r--')
    plt.text(mean_period_income_df.max() * 0.8, mean_period_life_exp_df.max() * 1.05, f'r_sq={coef_1_deg:.6f}')
    plt.show();

Looking at the R-square values in the charts above, it appears that there was never a strong linear correlation between per capita income and life expectancy during the last 100 years. In fact, in all the charts the highest life expectancy occurred around the middle of the range of income values. However, by visually inspecting the charts, it does seem as though a strong linear relationship exists at lower levels of per capita income, but the effect diminishes beyond a certain income level.

Interestingly, the charts also indicate that there are some notable outliers with income levels far above those of the other countries. In particular, the periods 1963-1982 and 2003-2022 each have a country with income levels above $100k per capita. That's very high, particularly from 1963 to 1982! What countries could those possibly be?

In [72]:
period_cols_1963_1982 = ['country'] + [f'income_{str(num)}' for num in np.arange(1963, 1983)]
period_income_life_exp_df_1963_1982 = merged_income_life_exp_df.loc[:, period_cols_1963_1982]
period_income_life_exp_df_1963_1982[period_income_life_exp_df_1963_1982.iloc[:, 1:].mean(axis=1) > 100000]
Out[72]:
country income_1963 income_1964 income_1965 income_1966 income_1967 income_1968 income_1969 income_1970 income_1971 ... income_1973 income_1974 income_1975 income_1976 income_1977 income_1978 income_1979 income_1980 income_1981 income_1982
138 qatar 32700.0 40800.0 50900.0 63500.0 79100.0 98700.0 123000.0 153000.0 154000.0 ... 157000.0 153000.0 150000.0 153000.0 134000.0 137000.0 134000.0 121000.0 109000.0 90500.0

1 rows × 21 columns

For the 1963-1982 period, Qatar was the country! Some quick research reveals that this period coincided with the discovery of oil reserves in Qatar, so perhaps that boost in national revenue being applied to a small population was responsible for such a high number.

In [73]:
period_cols_2003_2022 = ['country'] + [f'income_{str(num)}' for num in np.arange(2003, 2023)]
period_income_life_exp_df_2003_2022 = merged_income_life_exp_df.loc[:, period_cols_2003_2022]
period_income_life_exp_df_2003_2022[period_income_life_exp_df_2003_2022.iloc[:, 1:].mean(axis=1) > 100000]
Out[73]:
country income_2003 income_2004 income_2005 income_2006 income_2007 income_2008 income_2009 income_2010 income_2011 ... income_2013 income_2014 income_2015 income_2016 income_2017 income_2018 income_2019 income_2020 income_2021 income_2022
101 luxembourg 105000.0 108000.0 109000.0 113000.0 121000.0 118000.0 112000.0 114000.0 113000.0 ... 113000.0 113000.0 113000.0 116000.0 115000.0 115000.0 117000.0 112000.0 118000.0 119000.0

1 rows × 21 columns

For the 2003-2022 period, Luxembourg was the country! Some quick research reveals that Luxembourg has a large share of workers who work across borders and therefore are not considered residents even though their income contributes to the country's GDP (read here). In other words, the per capita income calculation unintentionally assigns the income of those cross-border workers to the remaining citizens who're full-time residents of the country.

Conclusions¶

This exploration has revealed some interesting trends in the levels of fertility, income, and life expectancy in different parts of the world. In summary:

  • fertility rates are still very high (>= 5 babies per woman) in some parts of the world, whereas in other parts of the world fertility rates are well below the threshold needed to maintain population levels (~2.1 babies per woman).
  • the majority of countries have fertility rates that are closer to the lower end of the worldwide range than to the higher end.
  • more countries have broken into the upper income bracket in the past 200 years, although the gap between richest and poorest countries widened during that same period.
  • the distribution of income in the world has skewed in favor of the "West" (North America, Europe, Australia, and New Zealand) for the last 200 years, whereas Sub-Saharan Africa and Melanesia have remained among the poorest regions over the same period of time.
  • life expectancy was relatively flat between 1823 and 1922, then began to rise quite dramatically afterwards. More so, an increasing portion of countries appear to have benefitted from the gains in life expectancy.
  • life expectancy is currently highest in the "West" and lowest in Sub-Saharan Africa, Melanesia, Micronesia.
  • life expectancy appears to have only a partial correlation with per capita income. Increases in income level from a low starting point appear correlated to higher life expectancy, but only up to a point.

It must be restated that this exploration was not intended to imply causation between any of the examined metrics. Any trends shown here are at best correlation rather than causation.

Resources¶

  • Pandas documentation: https://pandas.pydata.org/pandas-docs/stable/reference/index.html
  • Numpy documentation: https://numpy.org/doc/stable/reference/index.html
  • Eurostat: https://ec.europa.eu/eurostat/web/products-eurostat-news/-/ddn-20211215-1#:~:text=For%20more%20information%3A,to%20calculate%20GDP%20per%20capita.
  • Wikipedia: https://en.wikipedia.org/wiki/Economy_of_Qatar#:~:text=The%20exploration%20of%20oil%20and,downturn%20from%201982%20to%201989.
  • Gapminder World Dataset: https://www.gapminder.org/data/
In [ ]: