The Gapminder World dataset contains data points on various socioeconomic metrics for the world's countries over the past 200 years. The exploration described below focuses on trends in women's fertility, per capita income, and life expectancy. The specific questions explored are:
NOTE: This exploration was not intended to imply causation between any of the examined metrics. Any trends shown here are at best correlation rather than causation.
import pandas as pd
import numpy as np
from numpy.polynomial import Polynomial
import matplotlib.pyplot as plt
import seaborn as sns
import math
%matplotlib inline
def read_data(filename: str, sep: str = ','):
return pd.read_csv(f'datasets/{filename}', sep=sep)
fertility_df = read_data('children_per_woman_total_fertility.csv')
fertility_df.head()
country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2091 | 2092 | 2093 | 2094 | 2095 | 2096 | 2097 | 2098 | 2099 | 2100 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Aruba | 5.64 | 5.64 | 5.64 | 5.64 | 5.64 | 5.64 | 5.64 | 5.64 | 5.64 | ... | 1.82 | 1.82 | 1.82 | 1.82 | 1.82 | 1.82 | 1.82 | 1.82 | 1.83 | 1.83 |
1 | Afghanistan | 7.00 | 7.00 | 7.00 | 7.00 | 7.00 | 7.00 | 7.00 | 7.00 | 7.00 | ... | 1.74 | 1.74 | 1.74 | 1.74 | 1.74 | 1.74 | 1.74 | 1.74 | 1.74 | 1.74 |
2 | Angola | 6.93 | 6.93 | 6.93 | 6.93 | 6.93 | 6.93 | 6.93 | 6.94 | 6.94 | ... | 2.54 | 2.52 | 2.50 | 2.48 | 2.47 | 2.45 | 2.43 | 2.42 | 2.40 | 2.40 |
3 | Albania | 4.60 | 4.60 | 4.60 | 4.60 | 4.60 | 4.60 | 4.60 | 4.60 | 4.60 | ... | 1.78 | 1.78 | 1.78 | 1.79 | 1.79 | 1.79 | 1.79 | 1.79 | 1.79 | 1.79 |
4 | Netherlands Antilles | 5.80 | 5.80 | 5.80 | 5.80 | 5.80 | 5.80 | 5.80 | 5.80 | 5.80 | ... | 2.00 | 2.00 | 2.01 | 2.01 | 2.01 | 2.01 | 2.01 | 2.02 | 2.02 | 2.02 |
5 rows × 302 columns
fertility_df.shape
(202, 302)
The dataset contains data for up to 202 countries, and includes projected fertility rates going out to 2100.
fertility_df.info(verbose=True, show_counts=True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 202 entries, 0 to 201 Data columns (total 302 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 202 non-null object 1 1800 202 non-null float64 2 1801 202 non-null float64 3 1802 202 non-null float64 4 1803 202 non-null float64 5 1804 202 non-null float64 6 1805 202 non-null float64 7 1806 202 non-null float64 8 1807 202 non-null float64 9 1808 202 non-null float64 10 1809 202 non-null float64 11 1810 202 non-null float64 12 1811 202 non-null float64 13 1812 202 non-null float64 14 1813 202 non-null float64 15 1814 202 non-null float64 16 1815 202 non-null float64 17 1816 202 non-null float64 18 1817 202 non-null float64 19 1818 202 non-null float64 20 1819 202 non-null float64 21 1820 202 non-null float64 22 1821 202 non-null float64 23 1822 202 non-null float64 24 1823 202 non-null float64 25 1824 202 non-null float64 26 1825 202 non-null float64 27 1826 202 non-null float64 28 1827 202 non-null float64 29 1828 202 non-null float64 30 1829 202 non-null float64 31 1830 202 non-null float64 32 1831 202 non-null float64 33 1832 202 non-null float64 34 1833 202 non-null float64 35 1834 202 non-null float64 36 1835 202 non-null float64 37 1836 202 non-null float64 38 1837 202 non-null float64 39 1838 202 non-null float64 40 1839 202 non-null float64 41 1840 202 non-null float64 42 1841 202 non-null float64 43 1842 202 non-null float64 44 1843 202 non-null float64 45 1844 202 non-null float64 46 1845 202 non-null float64 47 1846 202 non-null float64 48 1847 202 non-null float64 49 1848 202 non-null float64 50 1849 202 non-null float64 51 1850 202 non-null float64 52 1851 202 non-null float64 53 1852 202 non-null float64 54 1853 202 non-null float64 55 1854 202 non-null float64 56 1855 202 non-null float64 57 1856 202 non-null float64 58 1857 202 non-null float64 59 1858 202 non-null float64 60 1859 202 non-null float64 61 1860 202 non-null float64 62 1861 202 non-null float64 63 1862 202 non-null float64 64 1863 202 non-null float64 65 1864 202 non-null float64 66 1865 202 non-null float64 67 1866 202 non-null float64 68 1867 202 non-null float64 69 1868 202 non-null float64 70 1869 202 non-null float64 71 1870 202 non-null float64 72 1871 202 non-null float64 73 1872 202 non-null float64 74 1873 202 non-null float64 75 1874 202 non-null float64 76 1875 202 non-null float64 77 1876 202 non-null float64 78 1877 202 non-null float64 79 1878 202 non-null float64 80 1879 202 non-null float64 81 1880 202 non-null float64 82 1881 202 non-null float64 83 1882 202 non-null float64 84 1883 202 non-null float64 85 1884 202 non-null float64 86 1885 202 non-null float64 87 1886 202 non-null float64 88 1887 202 non-null float64 89 1888 202 non-null float64 90 1889 202 non-null float64 91 1890 202 non-null float64 92 1891 202 non-null float64 93 1892 202 non-null float64 94 1893 202 non-null float64 95 1894 202 non-null float64 96 1895 202 non-null float64 97 1896 202 non-null float64 98 1897 202 non-null float64 99 1898 202 non-null float64 100 1899 202 non-null float64 101 1900 202 non-null float64 102 1901 202 non-null float64 103 1902 202 non-null float64 104 1903 202 non-null float64 105 1904 202 non-null float64 106 1905 202 non-null float64 107 1906 202 non-null float64 108 1907 202 non-null float64 109 1908 202 non-null float64 110 1909 202 non-null float64 111 1910 202 non-null float64 112 1911 202 non-null float64 113 1912 202 non-null float64 114 1913 202 non-null float64 115 1914 202 non-null float64 116 1915 202 non-null float64 117 1916 202 non-null float64 118 1917 202 non-null float64 119 1918 202 non-null float64 120 1919 202 non-null float64 121 1920 202 non-null float64 122 1921 202 non-null float64 123 1922 202 non-null float64 124 1923 202 non-null float64 125 1924 202 non-null float64 126 1925 202 non-null float64 127 1926 202 non-null float64 128 1927 202 non-null float64 129 1928 202 non-null float64 130 1929 202 non-null float64 131 1930 202 non-null float64 132 1931 202 non-null float64 133 1932 202 non-null float64 134 1933 202 non-null float64 135 1934 202 non-null float64 136 1935 202 non-null float64 137 1936 202 non-null float64 138 1937 202 non-null float64 139 1938 202 non-null float64 140 1939 202 non-null float64 141 1940 202 non-null float64 142 1941 202 non-null float64 143 1942 202 non-null float64 144 1943 202 non-null float64 145 1944 202 non-null float64 146 1945 202 non-null float64 147 1946 202 non-null float64 148 1947 202 non-null float64 149 1948 202 non-null float64 150 1949 202 non-null float64 151 1950 202 non-null float64 152 1951 202 non-null float64 153 1952 202 non-null float64 154 1953 202 non-null float64 155 1954 202 non-null float64 156 1955 202 non-null float64 157 1956 202 non-null float64 158 1957 202 non-null float64 159 1958 202 non-null float64 160 1959 202 non-null float64 161 1960 202 non-null float64 162 1961 202 non-null float64 163 1962 202 non-null float64 164 1963 202 non-null float64 165 1964 202 non-null float64 166 1965 202 non-null float64 167 1966 202 non-null float64 168 1967 202 non-null float64 169 1968 202 non-null float64 170 1969 202 non-null float64 171 1970 202 non-null float64 172 1971 202 non-null float64 173 1972 202 non-null float64 174 1973 202 non-null float64 175 1974 202 non-null float64 176 1975 202 non-null float64 177 1976 202 non-null float64 178 1977 202 non-null float64 179 1978 202 non-null float64 180 1979 202 non-null float64 181 1980 202 non-null float64 182 1981 202 non-null float64 183 1982 202 non-null float64 184 1983 202 non-null float64 185 1984 202 non-null float64 186 1985 202 non-null float64 187 1986 202 non-null float64 188 1987 202 non-null float64 189 1988 202 non-null float64 190 1989 202 non-null float64 191 1990 202 non-null float64 192 1991 202 non-null float64 193 1992 202 non-null float64 194 1993 202 non-null float64 195 1994 202 non-null float64 196 1995 202 non-null float64 197 1996 202 non-null float64 198 1997 202 non-null float64 199 1998 202 non-null float64 200 1999 202 non-null float64 201 2000 202 non-null float64 202 2001 202 non-null float64 203 2002 202 non-null float64 204 2003 202 non-null float64 205 2004 202 non-null float64 206 2005 202 non-null float64 207 2006 202 non-null float64 208 2007 202 non-null float64 209 2008 202 non-null float64 210 2009 202 non-null float64 211 2010 202 non-null float64 212 2011 202 non-null float64 213 2012 202 non-null float64 214 2013 202 non-null float64 215 2014 201 non-null float64 216 2015 201 non-null float64 217 2016 201 non-null float64 218 2017 201 non-null float64 219 2018 201 non-null float64 220 2019 201 non-null float64 221 2020 201 non-null float64 222 2021 201 non-null float64 223 2022 201 non-null float64 224 2023 201 non-null float64 225 2024 201 non-null float64 226 2025 201 non-null float64 227 2026 201 non-null float64 228 2027 201 non-null float64 229 2028 201 non-null float64 230 2029 201 non-null float64 231 2030 201 non-null float64 232 2031 201 non-null float64 233 2032 201 non-null float64 234 2033 201 non-null float64 235 2034 201 non-null float64 236 2035 201 non-null float64 237 2036 201 non-null float64 238 2037 201 non-null float64 239 2038 201 non-null float64 240 2039 201 non-null float64 241 2040 201 non-null float64 242 2041 201 non-null float64 243 2042 201 non-null float64 244 2043 201 non-null float64 245 2044 201 non-null float64 246 2045 201 non-null float64 247 2046 201 non-null float64 248 2047 201 non-null float64 249 2048 201 non-null float64 250 2049 201 non-null float64 251 2050 201 non-null float64 252 2051 201 non-null float64 253 2052 201 non-null float64 254 2053 201 non-null float64 255 2054 201 non-null float64 256 2055 201 non-null float64 257 2056 201 non-null float64 258 2057 201 non-null float64 259 2058 201 non-null float64 260 2059 201 non-null float64 261 2060 201 non-null float64 262 2061 201 non-null float64 263 2062 201 non-null float64 264 2063 201 non-null float64 265 2064 201 non-null float64 266 2065 201 non-null float64 267 2066 201 non-null float64 268 2067 201 non-null float64 269 2068 201 non-null float64 270 2069 201 non-null float64 271 2070 201 non-null float64 272 2071 201 non-null float64 273 2072 201 non-null float64 274 2073 201 non-null float64 275 2074 201 non-null float64 276 2075 201 non-null float64 277 2076 201 non-null float64 278 2077 201 non-null float64 279 2078 201 non-null float64 280 2079 201 non-null float64 281 2080 201 non-null float64 282 2081 201 non-null float64 283 2082 201 non-null float64 284 2083 201 non-null float64 285 2084 201 non-null float64 286 2085 201 non-null float64 287 2086 201 non-null float64 288 2087 201 non-null float64 289 2088 201 non-null float64 290 2089 201 non-null float64 291 2090 201 non-null float64 292 2091 201 non-null float64 293 2092 201 non-null float64 294 2093 201 non-null float64 295 2094 201 non-null float64 296 2095 201 non-null float64 297 2096 201 non-null float64 298 2097 201 non-null float64 299 2098 201 non-null float64 300 2099 201 non-null float64 301 2100 201 non-null float64 dtypes: float64(301), object(1) memory usage: 476.7+ KB
All the numbers are already floats, and there's very little missing data (only one country doesn't have data starting from 2014).
fertility_df.duplicated().sum()
0
There are no duplicated rows.
income_per_person_df = read_data('income_per_person_gdppercapita_ppp_inflation_adjusted.csv')
income_per_person_df.head()
country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2041 | 2042 | 2043 | 2044 | 2045 | 2046 | 2047 | 2048 | 2049 | 2050 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 683 | 683 | 683 | 683 | 683 | 683 | 683 | 683 | 683 | ... | 2690 | 2750 | 2810 | 2870 | 2930 | 2990 | 3060 | 3120 | 3190 | 3260 |
1 | Angola | 700 | 702 | 705 | 709 | 711 | 714 | 718 | 721 | 725 | ... | 8000 | 8170 | 8350 | 8530 | 8710 | 8900 | 9090 | 9280 | 9480 | 9690 |
2 | Albania | 755 | 755 | 755 | 755 | 755 | 756 | 756 | 756 | 756 | ... | 25.1k | 25.6k | 26.2k | 26.7k | 27.3k | 27.9k | 28.5k | 29.1k | 29.7k | 30.4k |
3 | Andorra | 1360 | 1360 | 1360 | 1360 | 1370 | 1370 | 1370 | 1370 | 1380 | ... | 68.9k | 70.4k | 71.9k | 73.4k | 75k | 76.6k | 78.3k | 80k | 81.7k | 83.4k |
4 | United Arab Emirates | 1130 | 1130 | 1140 | 1140 | 1150 | 1150 | 1160 | 1160 | 1160 | ... | 101k | 103k | 105k | 107k | 110k | 112k | 114k | 117k | 119k | 122k |
5 rows × 252 columns
income_per_person_df.shape
(195, 252)
The dataset contains data for up to 195 countries, and includes projected income levels going out to 2050.
income_per_person_df.info(verbose=True, show_counts=True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 195 entries, 0 to 194 Data columns (total 252 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 195 non-null object 1 1800 195 non-null int64 2 1801 195 non-null int64 3 1802 195 non-null int64 4 1803 195 non-null int64 5 1804 195 non-null int64 6 1805 195 non-null int64 7 1806 195 non-null int64 8 1807 195 non-null int64 9 1808 195 non-null int64 10 1809 195 non-null int64 11 1810 195 non-null int64 12 1811 195 non-null int64 13 1812 195 non-null int64 14 1813 195 non-null int64 15 1814 195 non-null int64 16 1815 195 non-null int64 17 1816 195 non-null int64 18 1817 195 non-null int64 19 1818 195 non-null int64 20 1819 195 non-null int64 21 1820 195 non-null int64 22 1821 195 non-null int64 23 1822 195 non-null int64 24 1823 195 non-null int64 25 1824 195 non-null int64 26 1825 195 non-null int64 27 1826 195 non-null int64 28 1827 195 non-null int64 29 1828 195 non-null int64 30 1829 195 non-null int64 31 1830 195 non-null int64 32 1831 195 non-null int64 33 1832 195 non-null int64 34 1833 195 non-null int64 35 1834 195 non-null int64 36 1835 195 non-null int64 37 1836 195 non-null int64 38 1837 195 non-null int64 39 1838 195 non-null int64 40 1839 195 non-null int64 41 1840 195 non-null int64 42 1841 195 non-null int64 43 1842 195 non-null int64 44 1843 195 non-null int64 45 1844 195 non-null int64 46 1845 195 non-null int64 47 1846 195 non-null int64 48 1847 195 non-null int64 49 1848 195 non-null int64 50 1849 195 non-null int64 51 1850 195 non-null int64 52 1851 195 non-null int64 53 1852 195 non-null int64 54 1853 195 non-null int64 55 1854 195 non-null int64 56 1855 195 non-null int64 57 1856 195 non-null int64 58 1857 195 non-null int64 59 1858 195 non-null int64 60 1859 195 non-null int64 61 1860 195 non-null int64 62 1861 195 non-null int64 63 1862 195 non-null int64 64 1863 195 non-null int64 65 1864 195 non-null int64 66 1865 195 non-null int64 67 1866 195 non-null int64 68 1867 195 non-null int64 69 1868 195 non-null int64 70 1869 195 non-null int64 71 1870 195 non-null int64 72 1871 195 non-null int64 73 1872 195 non-null int64 74 1873 195 non-null int64 75 1874 195 non-null int64 76 1875 195 non-null int64 77 1876 195 non-null int64 78 1877 195 non-null int64 79 1878 195 non-null int64 80 1879 195 non-null int64 81 1880 195 non-null int64 82 1881 195 non-null int64 83 1882 195 non-null int64 84 1883 195 non-null int64 85 1884 195 non-null int64 86 1885 195 non-null int64 87 1886 195 non-null int64 88 1887 195 non-null int64 89 1888 195 non-null int64 90 1889 195 non-null int64 91 1890 195 non-null int64 92 1891 195 non-null int64 93 1892 195 non-null int64 94 1893 195 non-null int64 95 1894 195 non-null int64 96 1895 195 non-null int64 97 1896 195 non-null int64 98 1897 195 non-null int64 99 1898 195 non-null int64 100 1899 195 non-null int64 101 1900 195 non-null int64 102 1901 195 non-null object 103 1902 195 non-null object 104 1903 195 non-null object 105 1904 195 non-null object 106 1905 195 non-null object 107 1906 195 non-null object 108 1907 195 non-null object 109 1908 195 non-null object 110 1909 195 non-null object 111 1910 195 non-null object 112 1911 195 non-null object 113 1912 195 non-null object 114 1913 195 non-null object 115 1914 195 non-null object 116 1915 195 non-null object 117 1916 195 non-null object 118 1917 195 non-null object 119 1918 195 non-null object 120 1919 195 non-null object 121 1920 195 non-null object 122 1921 195 non-null object 123 1922 195 non-null object 124 1923 195 non-null object 125 1924 195 non-null object 126 1925 195 non-null object 127 1926 195 non-null object 128 1927 195 non-null object 129 1928 195 non-null object 130 1929 195 non-null object 131 1930 195 non-null object 132 1931 195 non-null object 133 1932 195 non-null object 134 1933 195 non-null object 135 1934 195 non-null object 136 1935 195 non-null object 137 1936 195 non-null object 138 1937 195 non-null object 139 1938 195 non-null object 140 1939 195 non-null object 141 1940 195 non-null object 142 1941 195 non-null object 143 1942 195 non-null object 144 1943 195 non-null object 145 1944 195 non-null object 146 1945 195 non-null object 147 1946 195 non-null object 148 1947 195 non-null object 149 1948 195 non-null object 150 1949 195 non-null object 151 1950 195 non-null object 152 1951 195 non-null object 153 1952 195 non-null object 154 1953 195 non-null object 155 1954 195 non-null object 156 1955 195 non-null object 157 1956 195 non-null object 158 1957 195 non-null object 159 1958 195 non-null object 160 1959 195 non-null object 161 1960 195 non-null object 162 1961 195 non-null object 163 1962 195 non-null object 164 1963 195 non-null object 165 1964 195 non-null object 166 1965 195 non-null object 167 1966 195 non-null object 168 1967 195 non-null object 169 1968 195 non-null object 170 1969 195 non-null object 171 1970 195 non-null object 172 1971 195 non-null object 173 1972 195 non-null object 174 1973 195 non-null object 175 1974 195 non-null object 176 1975 195 non-null object 177 1976 195 non-null object 178 1977 195 non-null object 179 1978 195 non-null object 180 1979 195 non-null object 181 1980 195 non-null object 182 1981 195 non-null object 183 1982 195 non-null object 184 1983 195 non-null object 185 1984 195 non-null object 186 1985 195 non-null object 187 1986 195 non-null object 188 1987 195 non-null object 189 1988 195 non-null object 190 1989 195 non-null object 191 1990 195 non-null object 192 1991 195 non-null object 193 1992 195 non-null object 194 1993 195 non-null object 195 1994 195 non-null object 196 1995 195 non-null object 197 1996 195 non-null object 198 1997 195 non-null object 199 1998 195 non-null object 200 1999 195 non-null object 201 2000 195 non-null object 202 2001 195 non-null object 203 2002 195 non-null object 204 2003 195 non-null object 205 2004 195 non-null object 206 2005 195 non-null object 207 2006 195 non-null object 208 2007 195 non-null object 209 2008 195 non-null object 210 2009 195 non-null object 211 2010 195 non-null object 212 2011 195 non-null object 213 2012 195 non-null object 214 2013 195 non-null object 215 2014 195 non-null object 216 2015 195 non-null object 217 2016 195 non-null object 218 2017 195 non-null object 219 2018 195 non-null object 220 2019 195 non-null object 221 2020 195 non-null object 222 2021 195 non-null object 223 2022 195 non-null object 224 2023 195 non-null object 225 2024 195 non-null object 226 2025 195 non-null object 227 2026 195 non-null object 228 2027 195 non-null object 229 2028 195 non-null object 230 2029 195 non-null object 231 2030 195 non-null object 232 2031 195 non-null object 233 2032 195 non-null object 234 2033 195 non-null object 235 2034 195 non-null object 236 2035 195 non-null object 237 2036 195 non-null object 238 2037 195 non-null object 239 2038 195 non-null object 240 2039 195 non-null object 241 2040 195 non-null object 242 2041 195 non-null object 243 2042 195 non-null object 244 2043 195 non-null object 245 2044 195 non-null object 246 2045 195 non-null object 247 2046 195 non-null object 248 2047 195 non-null object 249 2048 195 non-null object 250 2049 195 non-null object 251 2050 195 non-null object dtypes: int64(101), object(151) memory usage: 384.0+ KB
There appear to be no empty values, though the data types change from integers to strings beginning from 1901.
income_per_person_df['1808'].unique()
array([ 683, 725, 756, 1380, 1160, 1770, 582, 858, 968, 2160, 878, 477, 2730, 676, 543, 992, 1230, 1450, 1640, 757, 691, 655, 1010, 1040, 1710, 713, 451, 481, 1560, 2780, 1050, 678, 710, 567, 666, 795, 600, 769, 967, 2170, 1830, 873, 752, 2310, 796, 1120, 603, 1760, 1070, 592, 1460, 890, 2090, 589, 443, 3780, 616, 792, 511, 925, 642, 406, 1780, 1090, 971, 1210, 764, 1310, 718, 1420, 1060, 1650, 1020, 2860, 3210, 1300, 1530, 991, 576, 436, 628, 1320, 948, 1280, 980, 1250, 903, 1190, 400, 809, 2230, 704, 649, 954, 1330, 614, 782, 885, 933, 1200, 671, 444, 598, 396, 612, 506, 978, 1100, 3060, 1580, 745, 810, 960, 1080, 951, 619, 863, 281, 1680, 946, 765, 225, 492, 918, 588, 564, 411, 832, 1110, 1610, 791, 1540, 575, 977, 1850, 1620, 1600, 1490, 555, 1260, 474, 773, 632, 599, 751, 1400, 799, 814, 637, 525, 872, 2200, 2940, 569, 950, 976, 663, 1590, 993])
income_per_person_df['1908'].unique()
array(['1120', '774', '1410', '3980', '5770', '6780', '1090', '1480', '8700', '6150', '1740', '523', '7290', '1040', '601', '1160', '2060', '3070', '2970', '1340', '1190', '1880', '1100', '2310', '850', '689', '736', '6620', '9930', '4670', '1070', '682', '1200', '857', '883', '1350', '880', '2350', '2130', '2740', '5020', '6030', '1360', '6580', '1220', '2040', '1510', '1870', '756', '3390', '2340', '611', '5640', '627', '877', '8670', '1140', '1030', '993', '691', '637', '2960', '1580', '1990', '2490', '2540', '2170', '3630', '1760', '3580', '1150', '4820', '1430', '3220', '3160', '4410', '1110', '1780', '2840', '1250', '497', '686', '2390', '1260', '3140', '1210', '2560', '2920', '1330', '2260', '423', '1280', '6300', '1380', '994', '1060', '652', '1860', '705', '1080', '2240', '855', '471', '738', '1630', '403', '2280', '2320', '6540', '4200', '976', '961', '8960', '2910', '1800', '1440', '835', '3130', '385', '2850', '2800', '732', '3560', '559', '900', '863', '2890', '562', '1790', '2940', '871', '3660', '881', '1180', '3360', '3740', '3800', '4600', '1370', '2300', '726', '1540', '1270', '730', '1010', '2250', '1570', '1670', '617', '805', '1900', '5060', '10.4k', '920', '1720', '1970', '1290', '895', '2140', '975'], dtype=object)
Comparing the unique integer values for 1808 with the unique string values for 1908, it becomes clear that some of the larger numbers are shortened (eg 10.4k
instead of 10400
). This will require further processing to achieve consistency.
income_per_person_df.duplicated().sum()
0
There are no duplicated rows.
life_expectancy_df = read_data('life_expectancy_years.csv')
life_expectancy_df.head()
country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2091 | 2092 | 2093 | 2094 | 2095 | 2096 | 2097 | 2098 | 2099 | 2100 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 28.2 | 28.2 | 28.2 | 28.2 | 28.2 | 28.2 | 28.1 | 28.1 | 28.1 | ... | 75.5 | 75.7 | 75.8 | 76.0 | 76.1 | 76.2 | 76.4 | 76.5 | 76.6 | 76.8 |
1 | Angola | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | ... | 78.8 | 79.0 | 79.1 | 79.2 | 79.3 | 79.5 | 79.6 | 79.7 | 79.9 | 80.0 |
2 | Albania | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | ... | 87.4 | 87.5 | 87.6 | 87.7 | 87.8 | 87.9 | 88.0 | 88.2 | 88.3 | 88.4 |
3 | Andorra | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | United Arab Emirates | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | ... | 82.4 | 82.5 | 82.6 | 82.7 | 82.8 | 82.9 | 83.0 | 83.1 | 83.2 | 83.3 |
5 rows × 302 columns
life_expectancy_df.shape
(195, 302)
The dataset contains data for up to 195 countries, and includes projected life expectancy levels going out to 2100.
life_expectancy_df.info(verbose=True, show_counts=True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 195 entries, 0 to 194 Data columns (total 302 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 195 non-null object 1 1800 186 non-null float64 2 1801 186 non-null float64 3 1802 186 non-null float64 4 1803 186 non-null float64 5 1804 186 non-null float64 6 1805 186 non-null float64 7 1806 186 non-null float64 8 1807 186 non-null float64 9 1808 186 non-null float64 10 1809 186 non-null float64 11 1810 186 non-null float64 12 1811 186 non-null float64 13 1812 186 non-null float64 14 1813 186 non-null float64 15 1814 186 non-null float64 16 1815 186 non-null float64 17 1816 186 non-null float64 18 1817 186 non-null float64 19 1818 186 non-null float64 20 1819 186 non-null float64 21 1820 186 non-null float64 22 1821 186 non-null float64 23 1822 186 non-null float64 24 1823 186 non-null float64 25 1824 186 non-null float64 26 1825 186 non-null float64 27 1826 186 non-null float64 28 1827 186 non-null float64 29 1828 186 non-null float64 30 1829 186 non-null float64 31 1830 186 non-null float64 32 1831 186 non-null float64 33 1832 186 non-null float64 34 1833 186 non-null float64 35 1834 186 non-null float64 36 1835 186 non-null float64 37 1836 186 non-null float64 38 1837 186 non-null float64 39 1838 186 non-null float64 40 1839 186 non-null float64 41 1840 186 non-null float64 42 1841 186 non-null float64 43 1842 186 non-null float64 44 1843 186 non-null float64 45 1844 186 non-null float64 46 1845 186 non-null float64 47 1846 186 non-null float64 48 1847 186 non-null float64 49 1848 186 non-null float64 50 1849 186 non-null float64 51 1850 186 non-null float64 52 1851 186 non-null float64 53 1852 186 non-null float64 54 1853 186 non-null float64 55 1854 186 non-null float64 56 1855 186 non-null float64 57 1856 186 non-null float64 58 1857 186 non-null float64 59 1858 186 non-null float64 60 1859 186 non-null float64 61 1860 186 non-null float64 62 1861 186 non-null float64 63 1862 186 non-null float64 64 1863 186 non-null float64 65 1864 186 non-null float64 66 1865 186 non-null float64 67 1866 186 non-null float64 68 1867 186 non-null float64 69 1868 186 non-null float64 70 1869 186 non-null float64 71 1870 186 non-null float64 72 1871 186 non-null float64 73 1872 186 non-null float64 74 1873 186 non-null float64 75 1874 186 non-null float64 76 1875 186 non-null float64 77 1876 186 non-null float64 78 1877 186 non-null float64 79 1878 186 non-null float64 80 1879 186 non-null float64 81 1880 186 non-null float64 82 1881 186 non-null float64 83 1882 186 non-null float64 84 1883 186 non-null float64 85 1884 186 non-null float64 86 1885 186 non-null float64 87 1886 186 non-null float64 88 1887 186 non-null float64 89 1888 186 non-null float64 90 1889 186 non-null float64 91 1890 186 non-null float64 92 1891 186 non-null float64 93 1892 186 non-null float64 94 1893 186 non-null float64 95 1894 186 non-null float64 96 1895 186 non-null float64 97 1896 186 non-null float64 98 1897 186 non-null float64 99 1898 186 non-null float64 100 1899 186 non-null float64 101 1900 186 non-null float64 102 1901 186 non-null float64 103 1902 186 non-null float64 104 1903 186 non-null float64 105 1904 186 non-null float64 106 1905 186 non-null float64 107 1906 186 non-null float64 108 1907 186 non-null float64 109 1908 186 non-null float64 110 1909 186 non-null float64 111 1910 186 non-null float64 112 1911 186 non-null float64 113 1912 186 non-null float64 114 1913 186 non-null float64 115 1914 186 non-null float64 116 1915 186 non-null float64 117 1916 186 non-null float64 118 1917 186 non-null float64 119 1918 186 non-null float64 120 1919 186 non-null float64 121 1920 186 non-null float64 122 1921 186 non-null float64 123 1922 186 non-null float64 124 1923 186 non-null float64 125 1924 186 non-null float64 126 1925 186 non-null float64 127 1926 186 non-null float64 128 1927 186 non-null float64 129 1928 186 non-null float64 130 1929 186 non-null float64 131 1930 186 non-null float64 132 1931 186 non-null float64 133 1932 186 non-null float64 134 1933 186 non-null float64 135 1934 186 non-null float64 136 1935 186 non-null float64 137 1936 186 non-null float64 138 1937 186 non-null float64 139 1938 186 non-null float64 140 1939 186 non-null float64 141 1940 186 non-null float64 142 1941 186 non-null float64 143 1942 186 non-null float64 144 1943 186 non-null float64 145 1944 186 non-null float64 146 1945 186 non-null float64 147 1946 186 non-null float64 148 1947 186 non-null float64 149 1948 186 non-null float64 150 1949 186 non-null float64 151 1950 195 non-null float64 152 1951 195 non-null float64 153 1952 195 non-null float64 154 1953 195 non-null float64 155 1954 195 non-null float64 156 1955 195 non-null float64 157 1956 195 non-null float64 158 1957 195 non-null float64 159 1958 195 non-null float64 160 1959 195 non-null float64 161 1960 195 non-null float64 162 1961 195 non-null float64 163 1962 195 non-null float64 164 1963 195 non-null float64 165 1964 195 non-null float64 166 1965 195 non-null float64 167 1966 195 non-null float64 168 1967 195 non-null float64 169 1968 195 non-null float64 170 1969 195 non-null float64 171 1970 195 non-null float64 172 1971 195 non-null float64 173 1972 195 non-null float64 174 1973 195 non-null float64 175 1974 195 non-null float64 176 1975 195 non-null float64 177 1976 195 non-null float64 178 1977 195 non-null float64 179 1978 195 non-null float64 180 1979 195 non-null float64 181 1980 195 non-null float64 182 1981 195 non-null float64 183 1982 195 non-null float64 184 1983 195 non-null float64 185 1984 195 non-null float64 186 1985 195 non-null float64 187 1986 195 non-null float64 188 1987 195 non-null float64 189 1988 195 non-null float64 190 1989 195 non-null float64 191 1990 195 non-null float64 192 1991 195 non-null float64 193 1992 195 non-null float64 194 1993 195 non-null float64 195 1994 195 non-null float64 196 1995 195 non-null float64 197 1996 195 non-null float64 198 1997 195 non-null float64 199 1998 195 non-null float64 200 1999 195 non-null float64 201 2000 195 non-null float64 202 2001 195 non-null float64 203 2002 195 non-null float64 204 2003 195 non-null float64 205 2004 195 non-null float64 206 2005 195 non-null float64 207 2006 195 non-null float64 208 2007 195 non-null float64 209 2008 195 non-null float64 210 2009 195 non-null float64 211 2010 195 non-null float64 212 2011 195 non-null float64 213 2012 195 non-null float64 214 2013 195 non-null float64 215 2014 195 non-null float64 216 2015 195 non-null float64 217 2016 195 non-null float64 218 2017 195 non-null float64 219 2018 195 non-null float64 220 2019 195 non-null float64 221 2020 186 non-null float64 222 2021 186 non-null float64 223 2022 186 non-null float64 224 2023 186 non-null float64 225 2024 186 non-null float64 226 2025 186 non-null float64 227 2026 186 non-null float64 228 2027 186 non-null float64 229 2028 186 non-null float64 230 2029 186 non-null float64 231 2030 186 non-null float64 232 2031 186 non-null float64 233 2032 186 non-null float64 234 2033 186 non-null float64 235 2034 186 non-null float64 236 2035 186 non-null float64 237 2036 186 non-null float64 238 2037 186 non-null float64 239 2038 186 non-null float64 240 2039 186 non-null float64 241 2040 186 non-null float64 242 2041 186 non-null float64 243 2042 186 non-null float64 244 2043 186 non-null float64 245 2044 186 non-null float64 246 2045 186 non-null float64 247 2046 186 non-null float64 248 2047 186 non-null float64 249 2048 186 non-null float64 250 2049 186 non-null float64 251 2050 186 non-null float64 252 2051 186 non-null float64 253 2052 186 non-null float64 254 2053 186 non-null float64 255 2054 186 non-null float64 256 2055 186 non-null float64 257 2056 186 non-null float64 258 2057 186 non-null float64 259 2058 186 non-null float64 260 2059 186 non-null float64 261 2060 186 non-null float64 262 2061 186 non-null float64 263 2062 186 non-null float64 264 2063 186 non-null float64 265 2064 186 non-null float64 266 2065 186 non-null float64 267 2066 186 non-null float64 268 2067 186 non-null float64 269 2068 186 non-null float64 270 2069 186 non-null float64 271 2070 186 non-null float64 272 2071 186 non-null float64 273 2072 186 non-null float64 274 2073 186 non-null float64 275 2074 186 non-null float64 276 2075 186 non-null float64 277 2076 186 non-null float64 278 2077 186 non-null float64 279 2078 186 non-null float64 280 2079 186 non-null float64 281 2080 186 non-null float64 282 2081 186 non-null float64 283 2082 186 non-null float64 284 2083 186 non-null float64 285 2084 186 non-null float64 286 2085 186 non-null float64 287 2086 186 non-null float64 288 2087 186 non-null float64 289 2088 186 non-null float64 290 2089 186 non-null float64 291 2090 186 non-null float64 292 2091 186 non-null float64 293 2092 186 non-null float64 294 2093 186 non-null float64 295 2094 186 non-null float64 296 2095 186 non-null float64 297 2096 186 non-null float64 298 2097 186 non-null float64 299 2098 186 non-null float64 300 2099 186 non-null float64 301 2100 186 non-null float64 dtypes: float64(301), object(1) memory usage: 460.2+ KB
A number of countries are missing data for the years before 1950 and after 2019. All the numbers are already formatted as floats.
life_expectancy_df.duplicated().sum()
0
There are no duplicated rows.
The first step is to identify the row(s) that contain missing data.
fertility_df[np.isnan(fertility_df['2014'])].loc[:,'2014':]
2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | ... | 2091 | 2092 | 2093 | 2094 | 2095 | 2096 | 2097 | 2098 | 2099 | 2100 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
73 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 rows × 87 columns
There's only one row that has missing data, so it can be dropped from the dataset.
fertility_df.dropna(inplace=True)
fertility_df.shape
(201, 302)
fertility_df.info(verbose=True, show_counts=True)
<class 'pandas.core.frame.DataFrame'> Int64Index: 201 entries, 0 to 201 Data columns (total 302 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 201 non-null object 1 1800 201 non-null float64 2 1801 201 non-null float64 3 1802 201 non-null float64 4 1803 201 non-null float64 5 1804 201 non-null float64 6 1805 201 non-null float64 7 1806 201 non-null float64 8 1807 201 non-null float64 9 1808 201 non-null float64 10 1809 201 non-null float64 11 1810 201 non-null float64 12 1811 201 non-null float64 13 1812 201 non-null float64 14 1813 201 non-null float64 15 1814 201 non-null float64 16 1815 201 non-null float64 17 1816 201 non-null float64 18 1817 201 non-null float64 19 1818 201 non-null float64 20 1819 201 non-null float64 21 1820 201 non-null float64 22 1821 201 non-null float64 23 1822 201 non-null float64 24 1823 201 non-null float64 25 1824 201 non-null float64 26 1825 201 non-null float64 27 1826 201 non-null float64 28 1827 201 non-null float64 29 1828 201 non-null float64 30 1829 201 non-null float64 31 1830 201 non-null float64 32 1831 201 non-null float64 33 1832 201 non-null float64 34 1833 201 non-null float64 35 1834 201 non-null float64 36 1835 201 non-null float64 37 1836 201 non-null float64 38 1837 201 non-null float64 39 1838 201 non-null float64 40 1839 201 non-null float64 41 1840 201 non-null float64 42 1841 201 non-null float64 43 1842 201 non-null float64 44 1843 201 non-null float64 45 1844 201 non-null float64 46 1845 201 non-null float64 47 1846 201 non-null float64 48 1847 201 non-null float64 49 1848 201 non-null float64 50 1849 201 non-null float64 51 1850 201 non-null float64 52 1851 201 non-null float64 53 1852 201 non-null float64 54 1853 201 non-null float64 55 1854 201 non-null float64 56 1855 201 non-null float64 57 1856 201 non-null float64 58 1857 201 non-null float64 59 1858 201 non-null float64 60 1859 201 non-null float64 61 1860 201 non-null float64 62 1861 201 non-null float64 63 1862 201 non-null float64 64 1863 201 non-null float64 65 1864 201 non-null float64 66 1865 201 non-null float64 67 1866 201 non-null float64 68 1867 201 non-null float64 69 1868 201 non-null float64 70 1869 201 non-null float64 71 1870 201 non-null float64 72 1871 201 non-null float64 73 1872 201 non-null float64 74 1873 201 non-null float64 75 1874 201 non-null float64 76 1875 201 non-null float64 77 1876 201 non-null float64 78 1877 201 non-null float64 79 1878 201 non-null float64 80 1879 201 non-null float64 81 1880 201 non-null float64 82 1881 201 non-null float64 83 1882 201 non-null float64 84 1883 201 non-null float64 85 1884 201 non-null float64 86 1885 201 non-null float64 87 1886 201 non-null float64 88 1887 201 non-null float64 89 1888 201 non-null float64 90 1889 201 non-null float64 91 1890 201 non-null float64 92 1891 201 non-null float64 93 1892 201 non-null float64 94 1893 201 non-null float64 95 1894 201 non-null float64 96 1895 201 non-null float64 97 1896 201 non-null float64 98 1897 201 non-null float64 99 1898 201 non-null float64 100 1899 201 non-null float64 101 1900 201 non-null float64 102 1901 201 non-null float64 103 1902 201 non-null float64 104 1903 201 non-null float64 105 1904 201 non-null float64 106 1905 201 non-null float64 107 1906 201 non-null float64 108 1907 201 non-null float64 109 1908 201 non-null float64 110 1909 201 non-null float64 111 1910 201 non-null float64 112 1911 201 non-null float64 113 1912 201 non-null float64 114 1913 201 non-null float64 115 1914 201 non-null float64 116 1915 201 non-null float64 117 1916 201 non-null float64 118 1917 201 non-null float64 119 1918 201 non-null float64 120 1919 201 non-null float64 121 1920 201 non-null float64 122 1921 201 non-null float64 123 1922 201 non-null float64 124 1923 201 non-null float64 125 1924 201 non-null float64 126 1925 201 non-null float64 127 1926 201 non-null float64 128 1927 201 non-null float64 129 1928 201 non-null float64 130 1929 201 non-null float64 131 1930 201 non-null float64 132 1931 201 non-null float64 133 1932 201 non-null float64 134 1933 201 non-null float64 135 1934 201 non-null float64 136 1935 201 non-null float64 137 1936 201 non-null float64 138 1937 201 non-null float64 139 1938 201 non-null float64 140 1939 201 non-null float64 141 1940 201 non-null float64 142 1941 201 non-null float64 143 1942 201 non-null float64 144 1943 201 non-null float64 145 1944 201 non-null float64 146 1945 201 non-null float64 147 1946 201 non-null float64 148 1947 201 non-null float64 149 1948 201 non-null float64 150 1949 201 non-null float64 151 1950 201 non-null float64 152 1951 201 non-null float64 153 1952 201 non-null float64 154 1953 201 non-null float64 155 1954 201 non-null float64 156 1955 201 non-null float64 157 1956 201 non-null float64 158 1957 201 non-null float64 159 1958 201 non-null float64 160 1959 201 non-null float64 161 1960 201 non-null float64 162 1961 201 non-null float64 163 1962 201 non-null float64 164 1963 201 non-null float64 165 1964 201 non-null float64 166 1965 201 non-null float64 167 1966 201 non-null float64 168 1967 201 non-null float64 169 1968 201 non-null float64 170 1969 201 non-null float64 171 1970 201 non-null float64 172 1971 201 non-null float64 173 1972 201 non-null float64 174 1973 201 non-null float64 175 1974 201 non-null float64 176 1975 201 non-null float64 177 1976 201 non-null float64 178 1977 201 non-null float64 179 1978 201 non-null float64 180 1979 201 non-null float64 181 1980 201 non-null float64 182 1981 201 non-null float64 183 1982 201 non-null float64 184 1983 201 non-null float64 185 1984 201 non-null float64 186 1985 201 non-null float64 187 1986 201 non-null float64 188 1987 201 non-null float64 189 1988 201 non-null float64 190 1989 201 non-null float64 191 1990 201 non-null float64 192 1991 201 non-null float64 193 1992 201 non-null float64 194 1993 201 non-null float64 195 1994 201 non-null float64 196 1995 201 non-null float64 197 1996 201 non-null float64 198 1997 201 non-null float64 199 1998 201 non-null float64 200 1999 201 non-null float64 201 2000 201 non-null float64 202 2001 201 non-null float64 203 2002 201 non-null float64 204 2003 201 non-null float64 205 2004 201 non-null float64 206 2005 201 non-null float64 207 2006 201 non-null float64 208 2007 201 non-null float64 209 2008 201 non-null float64 210 2009 201 non-null float64 211 2010 201 non-null float64 212 2011 201 non-null float64 213 2012 201 non-null float64 214 2013 201 non-null float64 215 2014 201 non-null float64 216 2015 201 non-null float64 217 2016 201 non-null float64 218 2017 201 non-null float64 219 2018 201 non-null float64 220 2019 201 non-null float64 221 2020 201 non-null float64 222 2021 201 non-null float64 223 2022 201 non-null float64 224 2023 201 non-null float64 225 2024 201 non-null float64 226 2025 201 non-null float64 227 2026 201 non-null float64 228 2027 201 non-null float64 229 2028 201 non-null float64 230 2029 201 non-null float64 231 2030 201 non-null float64 232 2031 201 non-null float64 233 2032 201 non-null float64 234 2033 201 non-null float64 235 2034 201 non-null float64 236 2035 201 non-null float64 237 2036 201 non-null float64 238 2037 201 non-null float64 239 2038 201 non-null float64 240 2039 201 non-null float64 241 2040 201 non-null float64 242 2041 201 non-null float64 243 2042 201 non-null float64 244 2043 201 non-null float64 245 2044 201 non-null float64 246 2045 201 non-null float64 247 2046 201 non-null float64 248 2047 201 non-null float64 249 2048 201 non-null float64 250 2049 201 non-null float64 251 2050 201 non-null float64 252 2051 201 non-null float64 253 2052 201 non-null float64 254 2053 201 non-null float64 255 2054 201 non-null float64 256 2055 201 non-null float64 257 2056 201 non-null float64 258 2057 201 non-null float64 259 2058 201 non-null float64 260 2059 201 non-null float64 261 2060 201 non-null float64 262 2061 201 non-null float64 263 2062 201 non-null float64 264 2063 201 non-null float64 265 2064 201 non-null float64 266 2065 201 non-null float64 267 2066 201 non-null float64 268 2067 201 non-null float64 269 2068 201 non-null float64 270 2069 201 non-null float64 271 2070 201 non-null float64 272 2071 201 non-null float64 273 2072 201 non-null float64 274 2073 201 non-null float64 275 2074 201 non-null float64 276 2075 201 non-null float64 277 2076 201 non-null float64 278 2077 201 non-null float64 279 2078 201 non-null float64 280 2079 201 non-null float64 281 2080 201 non-null float64 282 2081 201 non-null float64 283 2082 201 non-null float64 284 2083 201 non-null float64 285 2084 201 non-null float64 286 2085 201 non-null float64 287 2086 201 non-null float64 288 2087 201 non-null float64 289 2088 201 non-null float64 290 2089 201 non-null float64 291 2090 201 non-null float64 292 2091 201 non-null float64 293 2092 201 non-null float64 294 2093 201 non-null float64 295 2094 201 non-null float64 296 2095 201 non-null float64 297 2096 201 non-null float64 298 2097 201 non-null float64 299 2098 201 non-null float64 300 2099 201 non-null float64 301 2100 201 non-null float64 dtypes: float64(301), object(1) memory usage: 475.8+ KB
Now there's data for all 201 countries remaining. This state of the dataset will be sufficient to answer the questions for fertility rate.
def countries_to_lowercase(country_name: str) -> str:
return country_name.strip().lower().replace(' ', '_').replace('-', '_')
fertility_df.country = fertility_df.country.apply(countries_to_lowercase)
fertility_df.head()
country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2091 | 2092 | 2093 | 2094 | 2095 | 2096 | 2097 | 2098 | 2099 | 2100 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | aruba | 5.64 | 5.64 | 5.64 | 5.64 | 5.64 | 5.64 | 5.64 | 5.64 | 5.64 | ... | 1.82 | 1.82 | 1.82 | 1.82 | 1.82 | 1.82 | 1.82 | 1.82 | 1.83 | 1.83 |
1 | afghanistan | 7.00 | 7.00 | 7.00 | 7.00 | 7.00 | 7.00 | 7.00 | 7.00 | 7.00 | ... | 1.74 | 1.74 | 1.74 | 1.74 | 1.74 | 1.74 | 1.74 | 1.74 | 1.74 | 1.74 |
2 | angola | 6.93 | 6.93 | 6.93 | 6.93 | 6.93 | 6.93 | 6.93 | 6.94 | 6.94 | ... | 2.54 | 2.52 | 2.50 | 2.48 | 2.47 | 2.45 | 2.43 | 2.42 | 2.40 | 2.40 |
3 | albania | 4.60 | 4.60 | 4.60 | 4.60 | 4.60 | 4.60 | 4.60 | 4.60 | 4.60 | ... | 1.78 | 1.78 | 1.78 | 1.79 | 1.79 | 1.79 | 1.79 | 1.79 | 1.79 | 1.79 |
4 | netherlands_antilles | 5.80 | 5.80 | 5.80 | 5.80 | 5.80 | 5.80 | 5.80 | 5.80 | 5.80 | ... | 2.00 | 2.00 | 2.01 | 2.01 | 2.01 | 2.01 | 2.01 | 2.02 | 2.02 | 2.02 |
5 rows × 302 columns
The country names have been converted to lowercase, and the spaces and dashes in them have been converted to underscores to enable easier manipulation with the code.
First of all, the differences in number format need to be rectified by converting all the strings to integers.
from typing import Union
def has_million(value: Union[str, int]) -> bool:
return value.endswith('m') if isinstance(value, str) else False
def has_thousands(value: Union[str, int]) -> bool:
return value.endswith('k') if isinstance(value, str) else False
print('Number of cells containing millions formatted with "m" suffix:', sum([np.fromiter((val for val in income_per_person_df[col].unique() if has_million(val)), dtype='S128').size for col in income_per_person_df.columns[1:]]))
Number of cells containing millions formatted with "m" suffix: 0
print('Number of cells containing thousands formatted with "k" suffix:', sum([np.fromiter((val for val in income_per_person_df[col].unique() if has_thousands(val)), dtype='S128').size for col in income_per_person_df.columns[1:]]))
Number of cells containing thousands formatted with "k" suffix: 8087
So only values containing thousands need to be reformatted.
income_per_person_df.iloc[:,1:].dtypes.unique()
array([dtype('int64'), dtype('O')], dtype=object)
Those are the unique datatypes before reformatting, where dtype('O')
represents strings and dtype('int64')
represents integers.
for col in income_per_person_df.columns[1:]:
if income_per_person_df[col].dtype != int:
income_per_person_df[col] = income_per_person_df[col].apply(lambda x: int(float(x[:-1]) * 1000) if has_thousands(x) else int(x))
income_per_person_df.iloc[:, 1:].dtypes.unique()
array([dtype('int64')], dtype=object)
The loop above iterates over all the year columns and converts the values in the column to integers if they're currently strings. The final output after the loop shows that integer is the only unique data type now.
Next, the sub-regions need to be added to this dataset. Currently only countries are specified, so both the sub-region and country code will be added.
income_per_person_df.country = income_per_person_df.country.apply(countries_to_lowercase)
The country names have been converted to lowercase, and the spaces and dashes in them have been converted to underscores to enable easier manipulation with the code.
country_df = read_data('unsd-methodology.csv', sep=';')
country_df.head()
Global Code | Global Name | Region Code | Region Name | Sub-region Code | Sub-region Name | Intermediate Region Code | Intermediate Region Name | Country or Area | M49 Code | ISO-alpha2 Code | ISO-alpha3 Code | Least Developed Countries (LDC) | Land Locked Developing Countries (LLDC) | Small Island Developing States (SIDS) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | World | NaN | NaN | NaN | NaN | NaN | NaN | Antarctica | 10 | AQ | ATA | NaN | NaN | NaN |
1 | 1 | World | 2.0 | Africa | 15.0 | Northern Africa | NaN | NaN | Algeria | 12 | DZ | DZA | NaN | NaN | NaN |
2 | 1 | World | 2.0 | Africa | 15.0 | Northern Africa | NaN | NaN | Egypt | 818 | EG | EGY | NaN | NaN | NaN |
3 | 1 | World | 2.0 | Africa | 15.0 | Northern Africa | NaN | NaN | Libya | 434 | LY | LBY | NaN | NaN | NaN |
4 | 1 | World | 2.0 | Africa | 15.0 | Northern Africa | NaN | NaN | Morocco | 504 | MA | MAR | NaN | NaN | NaN |
To get the country sub-regions, a new dataset containing the country name, sub-region, and country code has been loaded.
country_df.rename(columns=lambda x: x.strip().lower().replace(' ', '_').replace('-', '_'), inplace=True)
country_df = country_df.loc[:,['sub_region_name', 'country_or_area', 'iso_alpha3_code']]
country_df.country_or_area = country_df.country_or_area.apply(countries_to_lowercase)
country_df.head()
sub_region_name | country_or_area | iso_alpha3_code | |
---|---|---|---|
0 | NaN | antarctica | ATA |
1 | Northern Africa | algeria | DZA |
2 | Northern Africa | egypt | EGY |
3 | Northern Africa | libya | LBY |
4 | Northern Africa | morocco | MAR |
The code above does the following in order:
These steps leave data that can be more easily merged with the income dataframe to add the needed sub-regions.
Before performing the merge, a check is needed to confirm which of the country names can be merged smoothly.
non_common_countries = (set(income_per_person_df.country) | set(country_df.country_or_area)) - (set(income_per_person_df.country) & set(country_df.country_or_area))
non_common_countries
{'american_samoa', 'anguilla', 'antarctica', 'aruba', 'bermuda', 'bolivia', 'bolivia_(plurinational_state_of)', 'bonaire,_sint_eustatius_and_saba', 'bouvet_island', 'british_indian_ocean_territory', 'british_virgin_islands', 'brunei', 'brunei_darussalam', 'cabo_verde', 'cape_verde', 'cayman_islands', 'china,_hong_kong_special_administrative_region', 'china,_macao_special_administrative_region', 'christmas_island', 'cocos_(keeling)_islands', 'congo', 'congo,_dem._rep.', 'congo,_rep.', 'cook_islands', "cote_d'ivoire", 'curaçao', 'czech_republic', 'czechia', 'côte_d’ivoire', "democratic_people's_republic_of_korea", 'democratic_republic_of_the_congo', 'falkland_islands_(malvinas)', 'faroe_islands', 'french_guiana', 'french_polynesia', 'french_southern_territories', 'gibraltar', 'greenland', 'guadeloupe', 'guam', 'guernsey', 'heard_island_and_mcdonald_islands', 'holy_see', 'hong_kong,_china', 'iran', 'iran_(islamic_republic_of)', 'isle_of_man', 'jersey', 'kyrgyz_republic', 'kyrgyzstan', 'lao', "lao_people's_democratic_republic", 'liechtenstein', 'martinique', 'mayotte', 'micronesia,_fed._sts.', 'micronesia_(federated_states_of)', 'moldova', 'montserrat', 'new_caledonia', 'niue', 'norfolk_island', 'north_korea', 'northern_mariana_islands', 'palestine', 'pitcairn', 'puerto_rico', 'republic_of_korea', 'republic_of_moldova', 'russia', 'russian_federation', 'réunion', 'saint_barthélemy', 'saint_helena', 'saint_kitts_and_nevis', 'saint_lucia', 'saint_martin_(french_part)', 'saint_pierre_and_miquelon', 'saint_vincent_and_the_grenadines', 'sark', 'sint_maarten_(dutch_part)', 'slovak_republic', 'slovakia', 'south_georgia_and_the_south_sandwich_islands', 'south_korea', 'st._kitts_and_nevis', 'st._lucia', 'st._vincent_and_the_grenadines', 'state_of_palestine', 'svalbard_and_jan_mayen_islands', 'syria', 'syrian_arab_republic', 'taiwan', 'tanzania', 'tokelau', 'turkey', 'turks_and_caicos_islands', 'türkiye', 'united_kingdom', 'united_kingdom_of_great_britain_and_northern_ireland', 'united_republic_of_tanzania', 'united_states', 'united_states_minor_outlying_islands', 'united_states_of_america', 'united_states_virgin_islands', 'venezuela', 'venezuela_(bolivarian_republic_of)', 'viet_nam', 'vietnam', 'wallis_and_futuna_islands', 'western_sahara', 'åland_islands'}
non_common_from_income = []
non_common_from_country = []
for country_name in non_common_countries:
if income_per_person_df.query(f'country == "{country_name}"').size > 0:
non_common_from_income.append(country_name)
else:
non_common_from_country.append(country_name)
print('Non common names from the country dataframe:\n', non_common_from_country, '\n\n')
print('Non common names from the income dataframe:\n', non_common_from_income)
Non common names from the country dataframe: ['saint_helena', 'aruba', 'french_guiana', 'state_of_palestine', 'czechia', 'montserrat', 'holy_see', 'united_states_minor_outlying_islands', 'british_indian_ocean_territory', 'mayotte', 'congo', 'bolivia_(plurinational_state_of)', 'democratic_republic_of_the_congo', 'russian_federation', 'cocos_(keeling)_islands', 'saint_lucia', 'isle_of_man', 'cabo_verde', 'south_georgia_and_the_south_sandwich_islands', 'viet_nam', 'micronesia_(federated_states_of)', 'western_sahara', "democratic_people's_republic_of_korea", 'united_republic_of_tanzania', 'wallis_and_futuna_islands', 'saint_martin_(french_part)', 'cook_islands', 'puerto_rico', 'saint_vincent_and_the_grenadines', 'réunion', 'american_samoa', 'sint_maarten_(dutch_part)', 'guernsey', 'saint_barthélemy', 'china,_macao_special_administrative_region', 'slovakia', 'brunei_darussalam', 'antarctica', 'tokelau', 'republic_of_korea', 'french_southern_territories', 'gibraltar', 'kyrgyzstan', 'new_caledonia', 'iran_(islamic_republic_of)', 'bermuda', 'falkland_islands_(malvinas)', 'saint_pierre_and_miquelon', 'united_kingdom_of_great_britain_and_northern_ireland', 'faroe_islands', 'svalbard_and_jan_mayen_islands', "lao_people's_democratic_republic", 'pitcairn', 'liechtenstein', 'british_virgin_islands', 'saint_kitts_and_nevis', 'christmas_island', 'republic_of_moldova', 'guadeloupe', 'greenland', 'bonaire,_sint_eustatius_and_saba', 'martinique', 'venezuela_(bolivarian_republic_of)', 'bouvet_island', 'united_states_of_america', 'united_states_virgin_islands', 'curaçao', 'syrian_arab_republic', 'åland_islands', 'norfolk_island', 'heard_island_and_mcdonald_islands', 'china,_hong_kong_special_administrative_region', 'türkiye', 'guam', 'cayman_islands', 'northern_mariana_islands', 'anguilla', 'turks_and_caicos_islands', 'jersey', 'niue', 'sark', 'côte_d’ivoire', 'french_polynesia'] Non common names from the income dataframe: ['palestine', 'st._vincent_and_the_grenadines', 'czech_republic', 'venezuela', 'turkey', 'moldova', 'vietnam', "cote_d'ivoire", 'kyrgyz_republic', 'st._kitts_and_nevis', 'slovak_republic', 'lao', 'russia', 'south_korea', 'st._lucia', 'taiwan', 'congo,_rep.', 'bolivia', 'cape_verde', 'iran', 'tanzania', 'syria', 'united_kingdom', 'congo,_dem._rep.', 'north_korea', 'hong_kong,_china', 'united_states', 'brunei', 'micronesia,_fed._sts.']
The above are the country names that are present in only one of the dataframes. After comparing both lists to identify slight spelling differences, the following map of the matching names was created:
different_spellings = {
'bolivia': 'bolivia_(plurinational_state_of)',
'brunei': 'brunei_darussalam',
'cape_verde': 'cabo_verde',
"cote_d'ivoire": "côte_d’ivoire",
'hong_kong,_china': 'china,_hong_kong_special_administrative_region',
'congo,_rep.': 'congo',
'congo,_dem._rep.': 'democratic_republic_of_the_congo',
'czech_republic': 'czechia',
'south_korea': 'republic_of_korea',
'north_korea': "democratic_people's_republic_of_korea",
'iran': 'iran_(islamic_republic_of)',
'kyrgyz_republic': 'kyrgyzstan',
'lao': "lao_people's_democratic_republic",
'micronesia,_fed._sts.': 'micronesia_(federated_states_of)',
'moldova': 'republic_of_moldova',
'russia': 'russian_federation',
'st._lucia': 'saint_lucia',
'st._kitts_and_nevis': 'saint_kitts_and_nevis',
'st._vincent_and_the_grenadines': 'saint_vincent_and_the_grenadines',
'syria': 'syrian_arab_republic',
'slovak_republic': 'slovakia',
'palestine': 'state_of_palestine',
'turkey': 'türkiye',
'united_kingdom': 'united_kingdom_of_great_britain_and_northern_ireland',
'tanzania': 'united_republic_of_tanzania',
'united_states': 'united_states_of_america',
'vietnam': 'viet_nam',
'venezuela': 'venezuela_(bolivarian_republic_of)',
}
Now the names need to be updated:
# reverse the keys and values to fit what we actually need
different_spellings = {val: key for key, val in different_spellings.items()}
# change the names in the countries df
country_df.country_or_area = country_df.country_or_area.apply(lambda x: different_spellings[x] if x in different_spellings else x)
Now the merge can proceed:
merged_income_country = pd.merge(left=income_per_person_df, right=country_df, left_on='country', right_on='country_or_area', how='outer')
merged_income_country.head()
country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2044 | 2045 | 2046 | 2047 | 2048 | 2049 | 2050 | sub_region_name | country_or_area | iso_alpha3_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | afghanistan | 683.0 | 683.0 | 683.0 | 683.0 | 683.0 | 683.0 | 683.0 | 683.0 | 683.0 | ... | 2870.0 | 2930.0 | 2990.0 | 3060.0 | 3120.0 | 3190.0 | 3260.0 | Southern Asia | afghanistan | AFG |
1 | angola | 700.0 | 702.0 | 705.0 | 709.0 | 711.0 | 714.0 | 718.0 | 721.0 | 725.0 | ... | 8530.0 | 8710.0 | 8900.0 | 9090.0 | 9280.0 | 9480.0 | 9690.0 | Sub-Saharan Africa | angola | AGO |
2 | albania | 755.0 | 755.0 | 755.0 | 755.0 | 755.0 | 756.0 | 756.0 | 756.0 | 756.0 | ... | 26700.0 | 27300.0 | 27900.0 | 28500.0 | 29100.0 | 29700.0 | 30400.0 | Southern Europe | albania | ALB |
3 | andorra | 1360.0 | 1360.0 | 1360.0 | 1360.0 | 1370.0 | 1370.0 | 1370.0 | 1370.0 | 1380.0 | ... | 73400.0 | 75000.0 | 76600.0 | 78300.0 | 80000.0 | 81700.0 | 83400.0 | Southern Europe | andorra | AND |
4 | united_arab_emirates | 1130.0 | 1130.0 | 1140.0 | 1140.0 | 1150.0 | 1150.0 | 1160.0 | 1160.0 | 1160.0 | ... | 107000.0 | 110000.0 | 112000.0 | 114000.0 | 117000.0 | 119000.0 | 122000.0 | Western Asia | united_arab_emirates | ARE |
5 rows × 255 columns
The merge is an outer merge to preserve the rows that don't merge smoothly.
It appears the country
and country_or_area
fields merged as expected.
missing_country_or_area = merged_income_country[merged_income_country.country_or_area.isna()].loc[:, ['country', 'country_or_area']]
missing_country_or_area.shape
(1, 2)
However it looks like one country wasn't merged successfully because it's only present in the income dataframe.
missing_country = merged_income_country[merged_income_country.country.isna()].loc[:, ['country', 'country_or_area']].sort_values(by='country_or_area')
missing_country.shape
(55, 2)
Similarly, 55 other countries were also not merged successfully because they're only present in the country dataframe.
rows_with_incomplete_country_info = pd.concat((missing_country, missing_country_or_area))
merged_income_country.drop(index=rows_with_incomplete_country_info.index, inplace=True)
print('Number of rows with missing "country" field:', merged_income_country.country.isna().sum())
print('Number of rows with missing "country_or_area" field:', merged_income_country.country_or_area.isna().sum())
Number of rows with missing "country" field: 0 Number of rows with missing "country_or_area" field: 0
The code above generates a list of the row numbers of countries that were not merged successfully, and then drops them from the dataframe.
Now the country_or_area
column can be dropped because it's effectively now a duplicate of the country
column.
merged_income_country.drop(columns='country_or_area', inplace=True)
Now the sub-regions can be reformatted to lowercase:
merged_income_country.sub_region_name = merged_income_country.sub_region_name.apply(countries_to_lowercase)
merged_income_country.head()
country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2043 | 2044 | 2045 | 2046 | 2047 | 2048 | 2049 | 2050 | sub_region_name | iso_alpha3_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | afghanistan | 683.0 | 683.0 | 683.0 | 683.0 | 683.0 | 683.0 | 683.0 | 683.0 | 683.0 | ... | 2810.0 | 2870.0 | 2930.0 | 2990.0 | 3060.0 | 3120.0 | 3190.0 | 3260.0 | southern_asia | AFG |
1 | angola | 700.0 | 702.0 | 705.0 | 709.0 | 711.0 | 714.0 | 718.0 | 721.0 | 725.0 | ... | 8350.0 | 8530.0 | 8710.0 | 8900.0 | 9090.0 | 9280.0 | 9480.0 | 9690.0 | sub_saharan_africa | AGO |
2 | albania | 755.0 | 755.0 | 755.0 | 755.0 | 755.0 | 756.0 | 756.0 | 756.0 | 756.0 | ... | 26200.0 | 26700.0 | 27300.0 | 27900.0 | 28500.0 | 29100.0 | 29700.0 | 30400.0 | southern_europe | ALB |
3 | andorra | 1360.0 | 1360.0 | 1360.0 | 1360.0 | 1370.0 | 1370.0 | 1370.0 | 1370.0 | 1380.0 | ... | 71900.0 | 73400.0 | 75000.0 | 76600.0 | 78300.0 | 80000.0 | 81700.0 | 83400.0 | southern_europe | AND |
4 | united_arab_emirates | 1130.0 | 1130.0 | 1140.0 | 1140.0 | 1150.0 | 1150.0 | 1160.0 | 1160.0 | 1160.0 | ... | 105000.0 | 107000.0 | 110000.0 | 112000.0 | 114000.0 | 117000.0 | 119000.0 | 122000.0 | western_asia | ARE |
5 rows × 254 columns
First, the country names need to be renamed as done for the other dataframes.
life_expectancy_df.country = life_expectancy_df.country.apply(countries_to_lowercase)
Now the null values can be removed.
life_expectancy_df.dropna(inplace=True)
life_expectancy_df.shape
(186, 302)
Next, the life expectancy dataframe also needs the sub-region column.
# first create a dataframe that only contains the country fields
country_name_region = merged_income_country.loc[:, ['country', 'sub_region_name', 'iso_alpha3_code']].rename(columns={'country':'tmp_country_name'})
# then merge it with the life expectancy dataframe
merged_life_exp_country_df = pd.merge(left=life_expectancy_df, right=country_name_region, left_on='country', right_on='tmp_country_name', how='inner')
merged_life_exp_country_df.head()
country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2094 | 2095 | 2096 | 2097 | 2098 | 2099 | 2100 | tmp_country_name | sub_region_name | iso_alpha3_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | afghanistan | 28.2 | 28.2 | 28.2 | 28.2 | 28.2 | 28.2 | 28.1 | 28.1 | 28.1 | ... | 76.0 | 76.1 | 76.2 | 76.4 | 76.5 | 76.6 | 76.8 | afghanistan | southern_asia | AFG |
1 | angola | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | ... | 79.2 | 79.3 | 79.5 | 79.6 | 79.7 | 79.9 | 80.0 | angola | sub_saharan_africa | AGO |
2 | albania | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | ... | 87.7 | 87.8 | 87.9 | 88.0 | 88.2 | 88.3 | 88.4 | albania | southern_europe | ALB |
3 | united_arab_emirates | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | ... | 82.7 | 82.8 | 82.9 | 83.0 | 83.1 | 83.2 | 83.3 | united_arab_emirates | western_asia | ARE |
4 | argentina | 33.2 | 33.2 | 33.2 | 33.2 | 33.2 | 33.2 | 33.2 | 33.2 | 33.2 | ... | 86.5 | 86.7 | 86.8 | 86.9 | 87.0 | 87.1 | 87.2 | argentina | latin_america_and_the_caribbean | ARG |
5 rows × 305 columns
The code above first renames the country
field in the temporary dataframe to avoid conflicts with the life expectancy dataframe, then an inner merge is performed to make sure that only the rows in the life expectancy dataframe are preserved.
merged_life_exp_country_df.drop(columns='tmp_country_name', inplace=True)
The tmp_country_name
column has been removed.
merged_life_exp_country_df.head()
country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2093 | 2094 | 2095 | 2096 | 2097 | 2098 | 2099 | 2100 | sub_region_name | iso_alpha3_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | afghanistan | 28.2 | 28.2 | 28.2 | 28.2 | 28.2 | 28.2 | 28.1 | 28.1 | 28.1 | ... | 75.8 | 76.0 | 76.1 | 76.2 | 76.4 | 76.5 | 76.6 | 76.8 | southern_asia | AFG |
1 | angola | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | ... | 79.1 | 79.2 | 79.3 | 79.5 | 79.6 | 79.7 | 79.9 | 80.0 | sub_saharan_africa | AGO |
2 | albania | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | ... | 87.6 | 87.7 | 87.8 | 87.9 | 88.0 | 88.2 | 88.3 | 88.4 | southern_europe | ALB |
3 | united_arab_emirates | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | ... | 82.6 | 82.7 | 82.8 | 82.9 | 83.0 | 83.1 | 83.2 | 83.3 | western_asia | ARE |
4 | argentina | 33.2 | 33.2 | 33.2 | 33.2 | 33.2 | 33.2 | 33.2 | 33.2 | 33.2 | ... | 86.5 | 86.5 | 86.7 | 86.8 | 86.9 | 87.0 | 87.1 | 87.2 | latin_america_and_the_caribbean | ARG |
5 rows × 304 columns
merged_life_exp_country_df.shape
(185, 304)
This shows that one country was not successfully merged because the previous row count was 186
.
merged_life_exp_country_df.isna().sum().sum()
0
This confirms that there's no null value left in this merged dataframe.
To answer the final question (looking for a relationship between life expectancy and income per capita), the data of the corresponding dataframes need to be merged into 1 dataframe.
alt_merged_income_country_df = merged_income_country.rename(columns=lambda x: f'income_{x}')
alt_merged_income_country_df.columns
Index(['income_country', 'income_1800', 'income_1801', 'income_1802', 'income_1803', 'income_1804', 'income_1805', 'income_1806', 'income_1807', 'income_1808', ... 'income_2043', 'income_2044', 'income_2045', 'income_2046', 'income_2047', 'income_2048', 'income_2049', 'income_2050', 'income_sub_region_name', 'income_iso_alpha3_code'], dtype='object', length=254)
The code above creates a new dataframe where the column names from the income dataframe all have income_
prepended to them, to distinguish them from those of the life expectancy dataframe.
merged_income_life_exp_df = pd.merge(left=merged_life_exp_country_df, right=alt_merged_income_country_df, left_on='country', right_on='income_country', how='inner')
merged_income_life_exp_df.head()
country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | income_2043 | income_2044 | income_2045 | income_2046 | income_2047 | income_2048 | income_2049 | income_2050 | income_sub_region_name | income_iso_alpha3_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | afghanistan | 28.2 | 28.2 | 28.2 | 28.2 | 28.2 | 28.2 | 28.1 | 28.1 | 28.1 | ... | 2810.0 | 2870.0 | 2930.0 | 2990.0 | 3060.0 | 3120.0 | 3190.0 | 3260.0 | southern_asia | AFG |
1 | angola | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | 27.0 | ... | 8350.0 | 8530.0 | 8710.0 | 8900.0 | 9090.0 | 9280.0 | 9480.0 | 9690.0 | sub_saharan_africa | AGO |
2 | albania | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | 35.4 | ... | 26200.0 | 26700.0 | 27300.0 | 27900.0 | 28500.0 | 29100.0 | 29700.0 | 30400.0 | southern_europe | ALB |
3 | united_arab_emirates | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | 30.7 | ... | 105000.0 | 107000.0 | 110000.0 | 112000.0 | 114000.0 | 117000.0 | 119000.0 | 122000.0 | western_asia | ARE |
4 | argentina | 33.2 | 33.2 | 33.2 | 33.2 | 33.2 | 33.2 | 33.2 | 33.2 | 33.2 | ... | 31800.0 | 32500.0 | 33200.0 | 33900.0 | 34700.0 | 35400.0 | 36200.0 | 36900.0 | latin_america_and_the_caribbean | ARG |
5 rows × 558 columns
Next, the 2 dataframes are merged to create a new one where the countries are shared and the column names indicate which cells represent income data. The merge is an inner merge to make sure that only rows of matching countries are preserved.
merged_income_life_exp_df.shape
(185, 558)
The number of rows did not change, so this means all rows were successfully merged.
current_year = '2022'
last_20_years = np.arange(int(current_year) - 19, int(current_year) + 1)
fertility_df.loc[:,str(int(last_20_years[0])):current_year].boxplot()
plt.title('Distribution of fertility rates for the last 20 years')
plt.xlabel('Year')
plt.ylabel('Fertility (babies per woman)')
plt.xticks(rotation='vertical');
The chart above shows some key details on the distribution of fertility rates over the last twenty years. The green bar in the middle of each box represents the median fertility rate for that year, which appears to have been slowly reducing over the last 20 years. The bars connected to lines above and below the each box represent the range that should contain the majority of the data, so the values outside that range (the empty circles) represent outliers. There was at least one outlier in each of the last 20 years. Interestingly, the range appears to be contracting during this period, with the lowest birthrate countries increasing their fertility rates and vice versa for the highest birthrate countries. The data is skewed right throughout the period, which indicates that the majority of countries are closer to the lower end of the range than to the higher end. The charts below can be used to better visualize the skew:
for year in last_20_years:
values = fertility_df[str(year)]
values.hist(density=True, bins=math.floor(values.max()), figsize=(5,5))
plt.xlabel('Birth rate (babies per woman)')
plt.ylabel('Ratio of total countries with birth rate')
plt.title(f'Distribution of national birthrates for {year}')
plt.show();
Each bar is intended to roughly approximate the increment of each additional baby to the fertility rate. The charts show that the 2 leftmost bars (corresponding to the lowest recorded birth rates) have consistently contained the birthrates for the highest percentage of countries.
period_size = 50
period_buckets = [
(str(int(current_year) - period_size * 1), current_year),
(str(int(current_year) - period_size * 2), str(int(current_year) - period_size * 1)),
(str(int(current_year) - period_size * 3), str(int(current_year) - period_size * 2)),
(str(int(current_year) - period_size * 4), str(int(current_year) - period_size * 3)),
]
period_buckets
[('1972', '2022'), ('1922', '1972'), ('1872', '1922'), ('1822', '1872')]
The code above creates 4 buckets with a size of 50 years each, starting from the current year and going back 200 years.
for period_start, period_end in period_buckets:
period_start = str(int(period_start) + 1)
merged_income_country.groupby('sub_region_name').mean(numeric_only=True).loc[:,period_start:period_end].boxplot(figsize=(10,10))
plt.ylabel('Per Capita Income (inflation-adjusted USD)')
plt.xlabel('Year')
plt.title(f'Distribution of Per Capita Income from {period_start} to {period_end}')
plt.xticks(rotation='vertical')
plt.show();
The charts above show the distribution of per capita income over the last 200 years, in 50 year buckets. One striking observation is that the number of outliers (the empty circles) has reduced as the years go by. That would seem to suggest that the world's wealth has become more evenly distributed over this period. Another interesting observation is that the range that should contain most values (the top and bottom dashes above and below each box) has grown noticeably beginning from about 1950. That suggests a widening gap between the richest and poorest countries, even though overall rates have been on the rise throughout that period.
for period_start, period_end in period_buckets:
period_start = str(int(period_start) + 1)
merged_income_life_exp_df.loc[:,period_start:period_end].boxplot(figsize=(10,10))
plt.ylabel('Life expectancy(years)')
plt.xlabel('Year')
plt.title(f'Distribution of Life Expectancy from {period_start} to {period_end}')
plt.xticks(rotation='vertical')
plt.show();
The charts above show the distribution of life expectancy over the past 200 years, in 50 year buckets. The first striking observation is the relative flatness of life expectancy over the first 100 years of the period, followed by a sizeable rise (more than a doubling for the highest value!) in the second 100 years of the period. The second striking observation is that the number of outliers also drops noticeably in the second 100 years, suggesting that in recent years the majority of countries have benefitted from the driver's of higher life expectancy.
current_fertility = fertility_df[current_year]
current_fertility = current_fertility.set_axis(fertility_df.country, copy=False)
current_fertility = current_fertility.sort_values(ascending=False)
current_fertility.head()
country niger 6.86 somalia 5.71 congo,_dem._rep. 5.52 mali 5.51 chad 5.36 Name: 2022, dtype: float64
current_fertility_highest = current_fertility[:10]
plt.bar(x=np.arange(10), height=current_fertility_highest)
plt.xticks(np.arange(10), current_fertility_highest.keys(), rotation='vertical')
for i, val in enumerate(current_fertility_highest):
plt.text(i, val * 1.05, f'{val:.1f}', horizontalalignment='center', rotation='vertical')
plt.title(f'10 highest fertility rates in the world by country for the year {current_year}')
plt.ylim(0, current_fertility_highest.max() * 1.2)
plt.xlabel('Country')
plt.ylabel('Fertility (babies per woman)')
plt.show();
The chart above indicates high population growth rates in some countries, with 5 babies per woman being the smallest rate of the top 10 countries.
current_fertility_lowest = current_fertility[-10:]
current_fertility_lowest.sort_values(ascending=True, inplace=True)
plt.bar(x=np.arange(10), height=current_fertility_lowest)
plt.xticks(np.arange(10), current_fertility_lowest.keys(), rotation='vertical')
for i, val in enumerate(current_fertility_lowest):
plt.text(i, val * 1.05, f'{val:.1f}', horizontalalignment='center', rotation='vertical')
plt.title(f'10 lowest fertility rates in the world by country for the year {current_year}')
plt.ylim(0, current_fertility_lowest.max() * 1.2)
plt.xlabel('Country')
plt.ylabel('Fertility (babies per woman)')
plt.show();
The chart above indicates that some countries have declining populations. 2.1
is the approximate threshold needed to maintain a population's size, but all the countries shown in the chart have population growth rates of 1.4 or less.
fertility_df_highest_current = fertility_df.query('country in ["' + '","'.join(list(current_fertility_highest.keys())) + '"]')
fertility_df_highest_current = fertility_df_highest_current.loc[:, ['country'] + [str(year) for year in last_20_years]]
fertility_df_highest_current
country | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | ... | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | angola | 6.54 | 6.50 | 6.47 | 6.42 | 6.37 | 6.31 | 6.24 | 6.16 | 6.08 | ... | 5.92 | 5.84 | 5.77 | 5.69 | 5.62 | 5.55 | 5.48 | 5.41 | 5.34 | 5.28 |
12 | burundi | 6.82 | 6.76 | 6.69 | 6.62 | 6.53 | 6.45 | 6.35 | 6.26 | 6.16 | ... | 5.96 | 5.87 | 5.78 | 5.70 | 5.62 | 5.53 | 5.45 | 5.38 | 5.30 | 5.22 |
37 | congo,_dem._rep. | 6.81 | 6.76 | 6.71 | 6.66 | 6.61 | 6.57 | 6.52 | 6.47 | 6.42 | ... | 6.30 | 6.24 | 6.17 | 6.09 | 6.02 | 5.92 | 5.82 | 5.72 | 5.62 | 5.52 |
68 | gambia | 5.88 | 5.86 | 5.84 | 5.81 | 5.79 | 5.76 | 5.74 | 5.71 | 5.68 | ... | 5.60 | 5.54 | 5.49 | 5.43 | 5.36 | 5.29 | 5.22 | 5.14 | 5.07 | 4.99 |
118 | mali | 6.84 | 6.82 | 6.79 | 6.76 | 6.72 | 6.67 | 6.61 | 6.55 | 6.47 | ... | 6.32 | 6.23 | 6.15 | 6.06 | 5.97 | 5.88 | 5.79 | 5.70 | 5.60 | 5.51 |
132 | niger | 7.64 | 7.63 | 7.61 | 7.59 | 7.57 | 7.54 | 7.52 | 7.49 | 7.46 | ... | 7.38 | 7.34 | 7.29 | 7.24 | 7.18 | 7.13 | 7.07 | 7.00 | 6.93 | 6.86 |
133 | nigeria | 6.04 | 6.01 | 5.99 | 5.96 | 5.93 | 5.90 | 5.87 | 5.84 | 5.80 | ... | 5.71 | 5.65 | 5.59 | 5.53 | 5.46 | 5.39 | 5.32 | 5.25 | 5.18 | 5.11 |
164 | somalia | 7.45 | 7.38 | 7.31 | 7.23 | 7.14 | 7.06 | 6.96 | 6.87 | 6.77 | ... | 6.56 | 6.46 | 6.37 | 6.27 | 6.17 | 6.08 | 5.98 | 5.89 | 5.80 | 5.71 |
175 | chad | 7.21 | 7.15 | 7.07 | 6.99 | 6.90 | 6.80 | 6.70 | 6.59 | 6.48 | ... | 6.26 | 6.16 | 6.05 | 5.95 | 5.85 | 5.75 | 5.65 | 5.55 | 5.45 | 5.36 |
187 | uganda | 6.72 | 6.65 | 6.58 | 6.51 | 6.42 | 6.34 | 6.25 | 6.15 | 6.06 | ... | 5.87 | 5.78 | 5.68 | 5.59 | 5.50 | 5.41 | 5.32 | 5.24 | 5.15 | 5.06 |
10 rows × 21 columns
First, the fertility rates of the last 20 years are collected for the countries with the highest current fertility rates.
fertility_df_highest_current.transpose().iloc[1:,:].plot(figsize=(8,8)).legend(fertility_df_highest_current.country)
plt.title(f'20 year trend of fertility rates for the countries with the highest rates in {current_year}')
plt.xlabel('Year')
plt.ylabel('Fertility rate (babies per woman)');
The chart above suggests that fertility rates have recently been dropping for all the countries. Note that the dataframe had to be transposed (i.e. the axes had to be flipped) in order to generate this chart.
fertility_df_lowest_current = fertility_df.query('country in ["' + '","'.join(list(current_fertility_lowest.keys())) + '"]')
fertility_df_lowest_current = fertility_df_lowest_current.loc[:, ['country'] + [str(year) for year in range(int(current_year) - 19, int(current_year) + 1)]]
fertility_df_lowest_current
country | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | ... | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
44 | cyprus | 1.58 | 1.55 | 1.52 | 1.50 | 1.48 | 1.46 | 1.44 | 1.42 | 1.41 | ... | 1.37 | 1.36 | 1.35 | 1.34 | 1.34 | 1.34 | 1.33 | 1.33 | 1.34 | 1.34 |
71 | greece | 1.36 | 1.38 | 1.40 | 1.42 | 1.44 | 1.44 | 1.43 | 1.41 | 1.39 | ... | 1.35 | 1.33 | 1.31 | 1.31 | 1.30 | 1.30 | 1.30 | 1.30 | 1.30 | 1.31 |
78 | hong_kong,_china | 0.95 | 0.96 | 0.97 | 0.99 | 1.02 | 1.05 | 1.08 | 1.12 | 1.15 | ... | 1.22 | 1.24 | 1.27 | 1.30 | 1.32 | 1.34 | 1.36 | 1.38 | 1.40 | 1.42 |
99 | south_korea | 1.21 | 1.18 | 1.17 | 1.16 | 1.16 | 1.17 | 1.18 | 1.19 | 1.21 | ... | 1.24 | 1.26 | 1.28 | 1.30 | 1.31 | 1.33 | 1.35 | 1.36 | 1.38 | 1.39 |
105 | st._lucia | 1.83 | 1.74 | 1.68 | 1.63 | 1.60 | 1.57 | 1.56 | 1.54 | 1.52 | ... | 1.50 | 1.48 | 1.47 | 1.46 | 1.45 | 1.44 | 1.43 | 1.42 | 1.42 | 1.41 |
113 | moldova | 1.25 | 1.22 | 1.22 | 1.23 | 1.24 | 1.26 | 1.27 | 1.27 | 1.27 | ... | 1.26 | 1.26 | 1.25 | 1.24 | 1.23 | 1.23 | 1.23 | 1.23 | 1.23 | 1.24 |
145 | poland | 1.28 | 1.28 | 1.30 | 1.32 | 1.34 | 1.36 | 1.37 | 1.37 | 1.36 | ... | 1.33 | 1.32 | 1.31 | 1.30 | 1.29 | 1.29 | 1.29 | 1.29 | 1.29 | 1.29 |
148 | portugal | 1.44 | 1.43 | 1.42 | 1.40 | 1.38 | 1.36 | 1.34 | 1.33 | 1.31 | ... | 1.28 | 1.26 | 1.25 | 1.25 | 1.24 | 1.24 | 1.24 | 1.24 | 1.24 | 1.25 |
160 | singapore | 1.27 | 1.26 | 1.26 | 1.28 | 1.29 | 1.28 | 1.26 | 1.26 | 1.26 | ... | 1.25 | 1.25 | 1.24 | 1.25 | 1.25 | 1.26 | 1.27 | 1.27 | 1.28 | 1.28 |
185 | taiwan | 1.24 | 1.18 | 1.12 | 1.12 | 1.10 | 1.05 | 1.03 | 0.90 | 1.07 | ... | 1.07 | 1.15 | 1.17 | 1.19 | 1.21 | 1.22 | 1.24 | 1.25 | 1.27 | 1.28 |
10 rows × 21 columns
# set ylim to a range of 2.5 for uniformity with the plot for highest fertility countries
fertility_df_lowest_current.transpose().iloc[1:,:].plot(figsize=(8,8), ylim=(0.5, 3)).legend(fertility_df_lowest_current.country)
plt.title(f'20 year trend of fertility rates for the countries with the lowest rates in {current_year}')
plt.xlabel('Year')
plt.ylabel('Fertility rate (babies per woman)');
The chart above suggests a mix of trends for the countries with the current lowest fertility rates. Most notably, St. Lucia and Cyprus appear to still have droppping rates whereas Hong Kong's rate appears to be on the rise.
current_richest_regions = merged_income_country.groupby('sub_region_name').mean(numeric_only=True)[current_year].sort_values(ascending=False)
current_richest_regions_ticks = np.arange(len(current_richest_regions))
plt.bar(x=current_richest_regions_ticks, height=current_richest_regions)
plt.xticks(current_richest_regions_ticks, current_richest_regions.index, rotation='vertical')
for i, val in enumerate(current_richest_regions):
plt.text(i, val + 750, f'{val:.1f}', rotation='vertical', horizontalalignment='center', verticalalignment='bottom')
plt.ylim(0, current_richest_regions.max() * 1.2)
plt.title(f'Per capita income in worldwide sub-regions for {current_year}')
plt.xlabel('Sub-region')
plt.ylabel('Per capita income (inflation adjusted USD)');
The chart above suggests that the world's richest societies are concentrated in the area colloquially known as the "West" i.e. Europe, North America, Australia and New Zealand. Sub-Saharan Africa and island nations around Australia (Polynesia, Micronesia, and Melanesia) have by far the poorest societies, with Asia and the Middle East occupying the middle of the range alongside Latin America.
for period_start, period_end in period_buckets:
period_start = str(int(period_start) + 1)
period_mean_income_per_sub_region = merged_income_country.groupby('sub_region_name').mean(numeric_only=True).loc[:, period_start:period_end].mean(axis=1).sort_values(ascending=False)
bar_ticks = np.arange(len(period_mean_income_per_sub_region))
plt.bar(x=bar_ticks, height=period_mean_income_per_sub_region)
plt.xticks(bar_ticks, period_mean_income_per_sub_region.index, rotation='vertical')
for i, val in enumerate(period_mean_income_per_sub_region):
# add 2% of the value for some space above the top of the bar
plt.text(i, val * 1.02, f'{val:.1f}', verticalalignment='bottom', horizontalalignment='center', rotation='vertical')
# add 20% of the max to the y scale so the highest text label doesn't overflow the top of the chart
plt.ylim(0, period_mean_income_per_sub_region.max() * 1.2)
plt.title(f'Per capita income in worldwide sub-regions between {period_start} and {period_end}')
plt.xlabel('Sub-region')
plt.ylabel('Per capita income (inflation adjusted USD)')
plt.figure(figsize=(20,20))
plt.show();
<Figure size 2000x2000 with 0 Axes>
<Figure size 2000x2000 with 0 Axes>
<Figure size 2000x2000 with 0 Axes>
<Figure size 2000x2000 with 0 Axes>
The charts above indicate that the composition of the extremes have not changed much in the last 200 years. The "West" has always recorded the highest per capita income during that period, and Sub-Saharan Africa and Melanesia have recorded the lowest. However, some interesting changes occurred over that period.
Micronesia has experienced quite a swing in its fortunes compared to other regions. For the oldest and second oldest buckets (1823 - 1872 and 1873 - 1922, respectively) it was the second poorest and poorest region, respectively. However, from 1923 to 1972 it shot up to the top of the mid range, behind only the "West" and western Asia. The countries within Micronesia are shown below:
print(merged_income_country[merged_income_country.sub_region_name == 'micronesia'].country.unique())
['micronesia,_fed._sts.' 'kiribati' 'marshall_islands' 'nauru' 'palau']
A decent history lesson is likely necessary to fully investigate the reasons for such an impressive jump, nonetheless it's certainly noteworthy! More so, further investigation is also needed to determine the spread of that newfound wealth within the population of that region during that period.
Polynesia appears to have experienced the reverse change in relative fortunes. In the oldest bucket Polynesia was firmly in the middle of the pack, but it steadily fell lower until it became the second poorest region in the most recent bucket, ahead of only Sub-Saharan Africa. The countries in Polynesia are:
print(merged_income_country[merged_income_country.sub_region_name == 'polynesia'].country.unique())
['tonga' 'tuvalu' 'samoa']
Similar to Micronesia, detecting the full story of what happened here is beyond the scope of this exploration.
current_highest_life_exp = merged_life_exp_country_df.groupby('sub_region_name').mean(numeric_only=True).loc[:, current_year].sort_values(ascending=False)
current_highest_life_exp_ticks = np.arange(len(current_highest_life_exp))
plt.bar(x=current_highest_life_exp_ticks, height=current_highest_life_exp)
plt.xticks(current_highest_life_exp_ticks, labels=current_highest_life_exp.index, rotation='vertical')
plt.ylim((0, current_highest_life_exp.max() * 1.2))
for i, val in enumerate(current_highest_life_exp):
plt.text(i, val * 1.02, f'{val:.1f}', verticalalignment='bottom', horizontalalignment='center', rotation='vertical')
plt.title(f"Life expectancy of the world's sub-regions for {current_year}")
plt.xlabel('Sub-region')
plt.ylabel('Life expectancy (years)')
plt.show();
The chart above suggests that the distribution of life expectancy follows a similar pattern to what was seen with per capita income. The "West" has the highest life expectancy, and Sub-Saharan Africa, Melanesia and Micronesia have the lowest.
income_life_exp_period_size = 20
income_life_exp_periods = [
(str(int(current_year) - income_life_exp_period_size * 1), current_year),
(str(int(current_year) - income_life_exp_period_size * 2), str(int(current_year) - income_life_exp_period_size * 1)),
(str(int(current_year) - income_life_exp_period_size * 3), str(int(current_year) - income_life_exp_period_size * 2)),
(str(int(current_year) - income_life_exp_period_size * 4), str(int(current_year) - income_life_exp_period_size * 3)),
(str(int(current_year) - income_life_exp_period_size * 5), str(int(current_year) - income_life_exp_period_size * 4)),
]
income_life_exp_periods
[('2002', '2022'), ('1982', '2002'), ('1962', '1982'), ('1942', '1962'), ('1922', '1942')]
The code above creates 5 buckets with a size of 20 years each, starting from the current year and going back 100 years.
for period_start, period_end in income_life_exp_periods:
period_start = str(int(period_start) + 1)
life_exp_cols = [str(num) for num in np.arange(int(period_start), int(period_end) + 1)]
income_cols = [f'income_{str(num)}' for num in np.arange(int(period_start), int(period_end) + 1)]
mean_period_life_exp_df = merged_income_life_exp_df.loc[:, life_exp_cols].mean(axis=1)
mean_period_income_df = merged_income_life_exp_df.loc[:, income_cols].mean(axis=1)
plt.scatter(mean_period_income_df, mean_period_life_exp_df);
plt.title(f'Life expectancy vs per capita income from {period_start} to {period_end}')
plt.xlabel('Per capita income (inflation-adjusted USD)')
plt.ylabel('Life expectancy (years)')
lin_reg_1_deg = np.polyfit(mean_period_income_df, mean_period_life_exp_df, 1)
poly_1_deg = np.poly1d(lin_reg_1_deg)
predicted_1_deg = poly_1_deg(mean_period_income_df)
corr_matrix_1_deg = np.corrcoef(mean_period_life_exp_df, predicted_1_deg)
coef_1_deg = corr_matrix_1_deg[0,1]**2
plt.plot(mean_period_income_df, predicted_1_deg, 'r--')
plt.text(mean_period_income_df.max() * 0.8, mean_period_life_exp_df.max() * 1.05, f'r_sq={coef_1_deg:.6f}')
plt.show();
Looking at the R-square values in the charts above, it appears that there was never a strong linear correlation between per capita income and life expectancy during the last 100 years. In fact, in all the charts the highest life expectancy occurred around the middle of the range of income values. However, by visually inspecting the charts, it does seem as though a strong linear relationship exists at lower levels of per capita income, but the effect diminishes beyond a certain income level.
Interestingly, the charts also indicate that there are some notable outliers with income levels far above those of the other countries. In particular, the periods 1963-1982 and 2003-2022 each have a country with income levels above $100k per capita. That's very high, particularly from 1963 to 1982! What countries could those possibly be?
period_cols_1963_1982 = ['country'] + [f'income_{str(num)}' for num in np.arange(1963, 1983)]
period_income_life_exp_df_1963_1982 = merged_income_life_exp_df.loc[:, period_cols_1963_1982]
period_income_life_exp_df_1963_1982[period_income_life_exp_df_1963_1982.iloc[:, 1:].mean(axis=1) > 100000]
country | income_1963 | income_1964 | income_1965 | income_1966 | income_1967 | income_1968 | income_1969 | income_1970 | income_1971 | ... | income_1973 | income_1974 | income_1975 | income_1976 | income_1977 | income_1978 | income_1979 | income_1980 | income_1981 | income_1982 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
138 | qatar | 32700.0 | 40800.0 | 50900.0 | 63500.0 | 79100.0 | 98700.0 | 123000.0 | 153000.0 | 154000.0 | ... | 157000.0 | 153000.0 | 150000.0 | 153000.0 | 134000.0 | 137000.0 | 134000.0 | 121000.0 | 109000.0 | 90500.0 |
1 rows × 21 columns
For the 1963-1982 period, Qatar was the country! Some quick research reveals that this period coincided with the discovery of oil reserves in Qatar, so perhaps that boost in national revenue being applied to a small population was responsible for such a high number.
period_cols_2003_2022 = ['country'] + [f'income_{str(num)}' for num in np.arange(2003, 2023)]
period_income_life_exp_df_2003_2022 = merged_income_life_exp_df.loc[:, period_cols_2003_2022]
period_income_life_exp_df_2003_2022[period_income_life_exp_df_2003_2022.iloc[:, 1:].mean(axis=1) > 100000]
country | income_2003 | income_2004 | income_2005 | income_2006 | income_2007 | income_2008 | income_2009 | income_2010 | income_2011 | ... | income_2013 | income_2014 | income_2015 | income_2016 | income_2017 | income_2018 | income_2019 | income_2020 | income_2021 | income_2022 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
101 | luxembourg | 105000.0 | 108000.0 | 109000.0 | 113000.0 | 121000.0 | 118000.0 | 112000.0 | 114000.0 | 113000.0 | ... | 113000.0 | 113000.0 | 113000.0 | 116000.0 | 115000.0 | 115000.0 | 117000.0 | 112000.0 | 118000.0 | 119000.0 |
1 rows × 21 columns
For the 2003-2022 period, Luxembourg was the country! Some quick research reveals that Luxembourg has a large share of workers who work across borders and therefore are not considered residents even though their income contributes to the country's GDP (read here). In other words, the per capita income calculation unintentionally assigns the income of those cross-border workers to the remaining citizens who're full-time residents of the country.
This exploration has revealed some interesting trends in the levels of fertility, income, and life expectancy in different parts of the world. In summary:
It must be restated that this exploration was not intended to imply causation between any of the examined metrics. Any trends shown here are at best correlation rather than causation.