Exploratory Data Analysis

Note

See the beginning of this chapter to load the required libraries and the data.

A quick way to check the data types is as follows:

glimpse(dt)
Rows: 1,200
Columns: 10
$ black   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
$ educ    <int> 13, 14, 18, 13, 13, 16, 16, 18, 21, 14, 18, 12, 12, 12, 16, 16, 16, 16, 20, 14, 12, 12, 14, 12, 12, 12, 8, 12, 12, 14, 18, 16, 12, 18, 18, 18, 14, 13, 12, 12, 12, 16, 12, 12, 12, 12, 12, 13, 13, 13, 12, 12, 12, 16, 13, 16, 16, 12, 20, 16, 14, 16, 12, 14, 12, 16, 12, 21, 16, 18, 18, 13, 16, 12, 13, 21, 12, 14, 18, 12, 12, 12, 10, 12, 18, 16, 16, 16, 20, 12, 12, 12, 18, 18, 18, 11, 13, 16, 14, 12, 18, 16, 16, 16, 12, 16, 13, 16, 13, 12, 16, 16, 8, 18, 13, 18, 12, 16, 13, 16, 16, 16, 14, 18, 18, 16, 12, 13, 12, 13, 16, 13, 12, 12, 16, 21, 12, 16, 13, 13, 16, 13, 18, 13, 14, 16, 13, 9, 18, 16, 12, 12, 20, 12, 12, 16, 12, 18, 13, 16, 13, 12, 13, 13, 5, 13, 14, 11, 16, 12, 20, 5, 18, 14, 16, 13, 12, 18, 12, 18, 12, 14, 14, 16, 18, 14, 11, 10, 16, 12, 13, 13, 12, 21, 12, 13, 16, 16, 12, 12, 14, 14, 18, 12, 16, 12, 13, 16, 18, 13, 18, 12, 13, 21, 16, 13, 12, 9, 14, 16, 13, 16, 14, 12, 12, 12, 13, 16, 12, 16, 13, 12, 12, 12, 16, 16, 14, 12, 13, 18, 12, 16, 18, 16, 18, 13, 12,…
$ exper   <int> 45, 25, 27, 42, 41, 26, 11, 15, 32, 12, 8, 40, 43, 23, 19, 33, 20, 30, 0, 5, 37, 32, 5, 5, 20, 7, 41, 32, 24, 20, 36, 7, 40, 29, 3, 13, 28, 10, 42, 37, 49, 20, 46, 43, 47, 47, 26, 21, 33, 6, 36, 39, 25, 22, 34, 16, 4, 13, 35, 12, 10, 20, 26, 17, 25, 26, 10, 13, 4, 14, 6, 11, 46, 7, 8, 25, 47, 5, 49, 20, 29, 10, 52, 22, 25, 5, 30, 0, 27, 23, 39, 8, 19, 11, 22, 25, 40, 6, 35, 39, 28, 45, 41, 30, 24, 30, 2, 6, 35, 49, 10, 15, 26, 20, 41, 14, 19, 22, 34, 4, 22, 15, 7, 5, 21, 36, 19, 29, 0, 16, 52, 4, 27, 31, 1, 21, 43, 35, 37, 5, 11, 8, 19, 29, 38, 17, 13, 36, 35, 31, 22, 29, 29, 53, 36, 5, 30, 40, 35, 34, 44, 37, 30, 33, 46, 30, 45, 43, 36, 21, 12, 47, 34, 3, 5, 32, 46, 1, 35, 15, 41, 12, 31, 20, 40, 38, 13, 13, 29, 14, 24, 46, 12, 33, 31, 8, 11, 4, 26, 39, 17, 27, 14, 33, 7, 27, 4, 24, 19, 3, 24, 13, 25, 7, 23, 3, 20, 9, 21, 12, 7, 28, 28, 42, 42, 12, 14, 38, 39, 27, 28, 10, 17, 32, 7, 11, 19, 25, 14, 33, 33, 24, 50, 6, 12, 16, 5, 25, 22, 14, 20, 32, 13, 45, 36, 38, 33…
$ faminc  <int> 0, 45351, 91946, 48370, 10000, 151308, 110461, 0, 67084, 14000, 0, 25000, 0, 0, 50808, 63169, 162, 134370, 0, 77280, 80337, 23472, 18200, 86400, 0, 0, 0, 30669, 0, 0, 49400, 20400, 50000, 2395, 0, 0, 0, 17544, 0, 50169, 9000, 10000, 45993, 83258, 24019, 51475, 19906, 9904, 95000, 40002, 0, 0, 15100, 530, 5000, 25000, 30000, 3090, 0, 33162, 0, 58540, 5000, 0, 20000, 0, 34000, 0, 115500, 66000, 0, 0, 50500, 7700, 30000, 96056, 10799, 20000, 9599, 0, 32000, 0, 30000, 30003, 117240, 0, 112075, 117580, 66600, 30032, 25000, 0, 0, 90056, 24700, 104000, 0, 0, 1815, 33200, 0, 23886, 42992, 30000, 14240, 0, 0, 28962, 24002, 25002, 60000, 100081, 13600, 0, 65112, 0, 0, 68015, 70000, 0, 96010, 22679, 42500, 50500, 38062, 135088, 20000, 1220, 21256, 0, 44549, 0, 31400, 52002, 0, 60000, 41748, 112429, 118570, 20002, 35000, 30002, 24600, 71510, 113000, 40405, 3300, 23000, 42940, 130422, 8000, 71300, 61685, 5, 0, 41050, 113113, 0, 0, 100001, 0, 25000, 106309, 91820, 0, 13769, 22920,…
$ female  <int> 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, …
$ metro   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, …
$ midwest <int> 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, …
$ south   <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, …
$ wage    <dbl> 44.44, 16.00, 15.38, 13.54, 25.00, 24.05, 40.95, 26.45, 30.76, 34.57, 20.67, 21.50, 11.77, 24.35, 55.00, 11.00, 57.70, 58.00, 5.75, 11.50, 33.65, 10.50, 14.50, 8.25, 8.65, 13.00, 17.00, 16.33, 7.20, 10.00, 60.83, 17.80, 30.38, 12.50, 49.50, 23.08, 28.95, 9.20, 29.00, 15.00, 14.70, 23.66, 8.00, 13.66, 15.00, 10.50, 7.75, 11.64, 37.03, 9.15, 17.00, 15.42, 43.28, 19.23, 16.00, 31.92, 30.00, 9.50, 17.30, 17.92, 12.50, 16.00, 13.71, 18.12, 26.00, 19.30, 10.00, 38.08, 12.00, 42.16, 28.40, 9.75, 36.06, 18.75, 10.50, 33.65, 17.00, 24.47, 9.80, 22.72, 11.88, 11.00, 8.76, 24.02, 14.73, 13.33, 41.83, 15.83, 42.73, 31.25, 17.00, 15.00, 25.00, 34.63, 9.50, 16.60, 23.00, 16.48, 48.08, 17.23, 34.60, 27.68, 21.00, 16.83, 36.06, 57.70, 11.50, 8.00, 14.00, 21.63, 19.23, 19.44, 10.85, 23.55, 30.00, 40.47, 11.77, 27.50, 22.00, 9.48, 15.38, 17.30, 17.40, 21.63, 27.47, 18.00, 28.83, 20.00, 9.00, 21.00, 64.09, 14.00, 18.00, 11.35, 6.17, 26.22, 18.00, 18.65, 45.01, 11.00, 44.51, 32.00, 43.2…
$ west    <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
dt.info(verbose = True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200 entries, 0 to 1199
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   black    1200 non-null   int64  
 1   educ     1200 non-null   int64  
 2   exper    1200 non-null   int64  
 3   faminc   1200 non-null   int64  
 4   female   1200 non-null   int64  
 5   metro    1200 non-null   int64  
 6   midwest  1200 non-null   int64  
 7   south    1200 non-null   int64  
 8   wage     1200 non-null   float64
 9   west     1200 non-null   int64  
dtypes: float64(1), int64(9)
memory usage: 93.9 KB

And we can examine the summary statistics of each variable:

dt %>% summary() %>% print()
     black             educ          exper           faminc           female         metro           midwest           south            wage             west       
 Min.   :0.0000   Min.   : 0.0   Min.   : 0.00   Min.   :     0   Min.   :0.00   Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   :  3.94   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:12.0   1st Qu.:12.00   1st Qu.:     0   1st Qu.:0.00   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.: 13.00   1st Qu.:0.0000  
 Median :0.0000   Median :14.0   Median :24.00   Median : 23679   Median :0.00   Median :1.0000   Median :0.0000   Median :0.000   Median : 19.30   Median :0.0000  
 Mean   :0.0875   Mean   :14.2   Mean   :23.37   Mean   : 35304   Mean   :0.44   Mean   :0.8217   Mean   :0.2475   Mean   :0.325   Mean   : 23.64   Mean   :0.2525  
 3rd Qu.:0.0000   3rd Qu.:16.0   3rd Qu.:34.00   3rd Qu.: 53029   3rd Qu.:1.00   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.000   3rd Qu.: 29.80   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :21.0   Max.   :62.00   Max.   :469000   Max.   :1.00   Max.   :1.0000   Max.   :1.0000   Max.   :1.000   Max.   :221.10   Max.   :1.0000  
dt.describe()
             black         educ        exper         faminc       female        metro     midwest       south         wage         west
count  1200.000000  1200.000000  1200.000000    1200.000000  1200.000000  1200.000000  1200.00000  1200.00000  1200.000000  1200.000000
mean      0.087500    14.202500    23.374167   35304.421667     0.440000     0.821667     0.24750     0.32500    23.640042     0.252500
std       0.282684     2.890811    13.269296   45026.488224     0.496594     0.382953     0.43174     0.46857    15.216554     0.434628
min       0.000000     0.000000     0.000000       0.000000     0.000000     0.000000     0.00000     0.00000     3.940000     0.000000
25%       0.000000    12.000000    12.000000       0.000000     0.000000     1.000000     0.00000     0.00000    13.000000     0.000000
50%       0.000000    14.000000    24.000000   23679.000000     0.000000     1.000000     0.00000     0.00000    19.300000     0.000000
75%       0.000000    16.000000    34.000000   53029.000000     1.000000     1.000000     0.00000     1.00000    29.800000     1.000000
max       1.000000    21.000000    62.000000  469000.000000     1.000000     1.000000     1.00000     1.00000   221.100000     1.000000

We see that the black, female, metro, midwest, south and west are indicator variables, while educ, exper, faminc and wage are real-valued. What’s more, we note that:

A more compact summary of the data
dt %>% skimr::skim() %>% print()
── Data Summary ────────────────────────
                           Values    
Name                       Piped data
Number of rows             1200      
Number of columns          10        
_______________________              
Column type frequency:               
  numeric                  10        
________________________             
Group variables            None      

── Variable type: numeric ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   skim_variable n_missing complete_rate       mean        sd   p0 p25     p50     p75    p100 hist 
 1 black                 0             1     0.0875     0.283 0      0     0       0        1  ▇▁▁▁▁
 2 educ                  0             1    14.2        2.89  0     12    14      16       21  ▁▁▅▇▂
 3 exper                 0             1    23.4       13.3   0     12    24      34       62  ▆▇▇▃▁
 4 faminc                0             1 35304.     45026.    0      0 23679   53029   469000  ▇▁▁▁▁
 5 female                0             1     0.44       0.497 0      0     0       1        1  ▇▁▁▁▆
 6 metro                 0             1     0.822      0.383 0      1     1       1        1  ▂▁▁▁▇
 7 midwest               0             1     0.248      0.432 0      0     0       0        1  ▇▁▁▁▂
 8 south                 0             1     0.325      0.469 0      0     0       1        1  ▇▁▁▁▃
 9 wage                  0             1    23.6       15.2   3.94  13    19.3    29.8    221. ▇▁▁▁▁
10 west                  0             1     0.252      0.435 0      0     0       1        1  ▇▁▁▁▃
print(skimpy.skim(dt))
╭─────────────────────────────── skimpy summary ───────────────────────────────╮
│          Data Summary                Data Types                              │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                       │
│ ┃ dataframe         ┃ Values ┃ ┃ Column Type ┃ Count ┃                       │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                       │
│ │ Number of rows    │ 1200   │ │ int64       │ 9     │                       │
│ │ Number of columns │ 10     │ │ float64     │ 1     │                       │
│ └───────────────────┴────────┘ └─────────────┴───────┘                       │
│                                   number                                     │
│ ┏━━━━━━┳━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━┳━━━━━┳━━━━━━┳━━━━━━┳━━━━━━━┳━━━━━━┓  │
│ ┃ colu ┃    ┃      ┃      ┃      ┃     ┃     ┃      ┃      ┃       ┃      ┃  │
│ ┃ mn_n ┃    ┃      ┃      ┃      ┃     ┃     ┃      ┃      ┃       ┃      ┃  │
│ ┃ ame  ┃ NA ┃ NA % ┃ mean ┃ sd   ┃ p0  ┃ p25 ┃ p50  ┃ p75  ┃ p100  ┃ hist ┃  │
│ ┡━━━━━━╇━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━╇━━━━━╇━━━━━━╇━━━━━━╇━━━━━━━╇━━━━━━┩  │
│ │ blac │  0 │    0 │ 0.08 │ 0.28 │   0 │   0 │    0 │    0 │     1 │  ▇   │  │
│ │ k    │    │      │    7 │      │     │     │      │      │       │  ▁   │  │
│ │ educ │  0 │    0 │   14 │  2.9 │   0 │  12 │   14 │   16 │    21 │    ▇ │  │
│ │      │    │      │      │      │     │     │      │      │       │  ▇▃  │  │
│ │ expe │  0 │    0 │   23 │   13 │   0 │  12 │   24 │   34 │    62 │ ▇▇▇▇ │  │
│ │ r    │    │      │      │      │     │     │      │      │       │  ▂   │  │
│ │ fami │  0 │    0 │ 3500 │ 4500 │   0 │   0 │ 2400 │ 5300 │ 47000 │  ▇▁  │  │
│ │ nc   │    │      │    0 │    0 │     │     │    0 │    0 │     0 │      │  │
│ │ fema │  0 │    0 │ 0.44 │  0.5 │   0 │   0 │    0 │    1 │     1 │  ▇   │  │
│ │ le   │    │      │      │      │     │     │      │      │       │  ▆   │  │
│ │ metr │  0 │    0 │ 0.82 │ 0.38 │   0 │   1 │    1 │    1 │     1 │  ▂   │  │
│ │ o    │    │      │      │      │     │     │      │      │       │  ▇   │  │
│ │ midw │  0 │    0 │ 0.25 │ 0.43 │   0 │   0 │    0 │    0 │     1 │  ▇   │  │
│ │ est  │    │      │      │      │     │     │      │      │       │  ▃   │  │
│ │ sout │  0 │    0 │ 0.33 │ 0.47 │   0 │   0 │    0 │    1 │     1 │  ▇   │  │
│ │ h    │    │      │      │      │     │     │      │      │       │  ▃   │  │
│ │ wage │  0 │    0 │   24 │   15 │ 3.9 │  13 │   19 │   30 │   220 │  ▇▁  │  │
│ │ west │  0 │    0 │ 0.25 │ 0.43 │   0 │   0 │    0 │    1 │     1 │  ▇   │  │
│ │      │    │      │      │      │     │     │      │      │       │  ▃   │  │
│ └──────┴────┴──────┴──────┴──────┴─────┴─────┴──────┴──────┴───────┴──────┘  │
╰──────────────────────────────────── End ─────────────────────────────────────╯
None
Tip

We note that there are some outliers in our data - the 3rd quartile of wage is much smaller than its maximum value.

While the descriptive statistics give some initial insight into the magnitude of our data values - it says nothing about their relationship. This is why we will move on to analysing the pairwise scatter-plots for non-indicator variables:

#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
dt %>% 
  ggpairs(columns = c(2:4, 9), aes(alpha = 0.5),
          diag = list(continuous = "barDiag"),
          progress = FALSE) + theme_bw()

def reg_coef(x, y, label=None, color=None, **kwargs):
    # A modified version of https://stackoverflow.com/a/63433499
    ax = plt.gca()
    r,p = scipy.stats.pearsonr(x, y)
    val = 'r = {:.3f}'.format(r)
    if p <= 0.001:
        val = val + "***"  
    elif p <= 0.01:
        val = val + "**"  
    elif p <= 0.05:
        val = val + "*"    
    ax.annotate(val, xy=(0.5, 0.5), xycoords='axes fraction', ha='center')
    ax.set_axis_off()
# sns.pairplot(dt.iloc[:, [1, 2, 3, 8]])
g = sns.PairGrid(dt.iloc[:, [1, 2, 3, 8]], diag_sharey = False)
tmp_plt = g.map_diag(sns.histplot)
tmp_plt = g.map_lower(sns.scatterplot)
tmp_plt = g.map_upper(reg_coef)
plt.show()

The diagonal elements are the histogram of the variables, while the upper diagonal is the correlation and the lower triangle of the plot matrix contains the scatter-plots of the same variables.

From the plots we can say that:

Taking a closer look at the histogram of wage - it appears to be non-linear. Since a linear regression model is used to describe a linear relationship, it may be better to use \(\log(wage)\). Comparing the two:

p1 <- dt %>% ggplot(aes(x = wage)) + 
  geom_histogram(bins = 30) + theme_bw()
p2 <- dt %>% ggplot(aes(x = log(wage))) + 
  geom_histogram(bins = 30) + theme_bw()
print(p1)
print(p2)

p1 = (pl.ggplot(data = dt, mapping = pl.aes(x = "wage")) +
   pl.geom_histogram(bins = 30) + pl.theme_bw())
p2 = (pl.ggplot(data = dt, mapping = pl.aes(x = "np.log(wage)")) +
   pl.geom_histogram(bins = 30) + pl.theme_bw())
print(p1)
print(p2)

We see that \(\log(wage)\) is more bell-shaped and while it still does not resemble a normal distribution - it is much closer to it, compared to its initial values.

In order to identify, whether our wage is different depending on gender/race or other indicator variables - we can look at the following boxplots:

p3 <- dt %>% 
  dplyr::select(wage, black, female, metro, midwest, south, west) %>%
  pivot_longer(cols = !matches("wage"), names_to = "variable") %>%
  ggplot(aes(x = factor(value), y = wage)) + 
  geom_boxplot(aes(fill = variable)) +
  facet_wrap(~variable) + 
  theme_bw() + theme(legend.position = "none") +
  coord_cartesian(ylim = quantile(dt$wage, c(0.01, 0.99)))
print(p3)

#
p3 = dt[['wage', 'black', 'female', 'metro', 'midwest', 'south', 'west']]
p3 = p3.melt(id_vars = "wage", value_vars = set(p3.columns.to_list()) - set('wage'), var_name = "variable")
p3 = (pl.ggplot(data = p3, mapping = pl.aes(x ="factor(value)", y = "wage")) +\
        pl.geom_boxplot(mapping = pl.aes(fill = "variable")) +\
        pl.facet_wrap("~variable") +\
        pl.theme_bw() + pl.theme(legend_position = "none") +\
        pl.coord_cartesian(ylim = np.quantile(dt["wage"], q = [0.01, 0.99])))
print(p3)

We note that black, female and metro may have wage differences. However, these boxplots do not control for other variables - this is where our regression models will be quite useful.

We will quickly check if the regional indicator variables cover all of the regions in the dataset. Since we have 3 different geographic locations - we will create a single categorical variable, where each region will have an appropriate text name. If it so happens that there are more regions - we will assign them to the “other” category. If there aren’t - we should have zero other regions.

dt <- dt %>% 
  mutate(geography = case_when(midwest == 1 ~ "midwest",
                               south == 1 ~ "south",
                               west == 1 ~ "west",
                               TRUE ~ "other"
                               )
                    )
print(table(dt$geography))

midwest   other   south    west 
    297     210     390     303 
dt["geography"] = "other"
# Note: Series.case_when() is available starting from version 2.2.0.
dt["geography"] = np.where(dt["midwest"] == 1, "midwest", 
                           np.where(dt["south"] == 1, "south", 
                                    np.where(dt["west"] == 1, "west", "other")
                                   )
                          )
print(dt["geography"].value_counts())
geography
south      390
west       303
midwest    297
other      210
Name: count, dtype: int64

We see that we have more than the 3 regions. This means that if we decide to include west, south and midwest dummy variables - they will be compared to the base region. If we decide to use the geography variable - the base regions will be chosen automatically in an alphabetical order. To avoid this, we transform it into a factor variable:

dt <- dt %>% 
  mutate(geography = factor(geography, levels = c("other", "south", "west", "midwest")))
print(head(dt$geography, 10))
 [1] west    midwest other   midwest west    other   south   midwest midwest west   
Levels: other south west midwest
dt["geography"] = dt["geography"].astype("category")
dt["geography"] = dt["geography"].cat.reorder_categories(["other", "south", "west", "midwest"])
print(dt["geography"].head(10))
0       west
1    midwest
2      other
3    midwest
4       west
5      other
6      south
7    midwest
8    midwest
9       west
Name: geography, dtype: category
Categories (4, object): ['other', 'south', 'west', 'midwest']

Note the additional information on the Levels / Categories at the end of the output. This information provides the unique labels (categories) of this factor (categorical) variable. The very first category is now other - it will be used as a base group, if we ever decide to use this variable.

In fact, we can already use it to further enhance our plots:

For R, we will use the ggpairs() function.

(Note: the blank space in this tab is due to the code and explanations in the Python code tab)

For Python - we need to modify our previous reg_coef function. The sns.pairplot function doesn’t plot the correlation values in the upper corner - you can try this with the following code:

tmp_plt = sns.pairplot(dt.iloc[:, [1, 2, 3, 8, 10]], hue = "geography")
plt.show()

To make our plots more informative (similarly to R’s ggpairs()) - we need to modify our previous function as follows:

def reg_coef_grouped(x, y, label=None, color=None, **kwargs):
    ax = plt.gca()
    r,p = scipy.stats.pearsonr(x, y)
    val = 'r = {:.3f}'.format(r)
    if p <= 0.001:
        val = val + "***"  
    elif p <= 0.01:
        val = val + "**"  
    elif p <= 0.05:
        val = val + "*"
    b = next(tmp_groups)
    ax.annotate(val, xy=(0.5, 0.2 + 0.2 * b), xycoords='axes fraction', ha='center', color = color)
    ax.set_axis_off()

The key points are:

PairGrid can define the number of groups with the hue argument. However, the map_diag, map_upper and map_lower functions will call reg_coef for each color and each subplot separately. Unfortunately, those functions don’t pass any information on which variable and which group is passed to reg_coef - only its color value. Therefore, we a way to iterate through the number of groups and the number of subplots - we do this via the tmp_groups iterator variable, which we will define outside the plot function.

Unlike standard arrays, where we need to specify an index to access a specific value - we can use the next() function to traverse our iterator variable. Therefore, when our map_ functions call the reg_coef - we will take the next element from tmp_groups variable, which contains the same group order pattern for each upper diagonal subplot.

For example, the columns for which we want to generate the pairplots are:

print(dt.iloc[:, [1, 2, 3, 8]].columns)
Index(['educ', 'exper', 'faminc', 'wage'], dtype='object')
print(len(dt.iloc[:, [1, 2, 3, 8]].columns))
4

The number of unique categories for the georaphy variable are:

print(dt["geography"].unique())
['west', 'midwest', 'other', 'south']
Categories (4, object): ['other', 'south', 'west', 'midwest']
print(len(dt["geography"].unique()))
4

Therefore, for each subplot, we need to have the following values, indicating which category is going to be called:

# Get the number of unique groups:
tmp_groups = list(range(len(dt["geography"].unique())))
print(tmp_groups)
[0, 1, 2, 3]

The number of off-diagonal plots that we will have are:

tmp_count = int(len(dt.iloc[:, [1, 2, 3, 8]].columns) * (len(dt.iloc[:, [1, 2, 3, 8]].columns) - 1) / 2)
print(tmp_count)
6

Therefore, we will need to repeat the same geography group pattern for each subplot as follows:

tmp_groups = tmp_groups * tmp_count
print(tmp_groups)
[0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]

Finally, we can create this variable as an iterator:

tmp_groups = iter(tmp_groups)
print(tmp_groups)
<list_iterator object at 0x7f3a1a75fe20>

and we can access the values as follows:

print(next(tmp_groups))
0
print(next(tmp_groups))
1

Note that we can only iterate foreward.

We summarize the above explanation and create the iterator variable as follows:

n_vars = len(dt.iloc[:, [1, 2, 3, 8]].columns)
# Get the number of unique groups:
tmp_groups = list(range(len(dt["geography"].unique())))
# Since they will be added to the off-diagonal elements - we need the same ordering for each off-diagonal plot
# So, we repeat this pattern by the number of off-diagonal plots (N * (N-1) / 2):
tmp_groups = tmp_groups * int(n_vars * (n_vars - 1) / 2)
# Finally, we create an iterator, since the way this works is that seaborn makes repeated calls to the
# plot function, but does not pass any index values:
tmp_groups = iter(tmp_groups)

Important: we need to re-create this variable every time we want to re-draw the plot, since we cannot reset the iterator variable.

#
#
dt %>% 
  ggpairs(columns = c(2:4, 9), 
          aes(color = factor(geography), alpha = 0.5),
          progress = FALSE) + theme_bw()

g = sns.PairGrid(dt.iloc[:, [1, 2, 3, 8, 10]], diag_sharey = False, hue = "geography")
tmp_plt = g.map_diag(sns.kdeplot, fill = True)
tmp_plt = g.map_lower(sns.scatterplot)
tmp_plt = g.map_upper(reg_coef_grouped)
tmp_plt = g.add_legend()
plt.show()

Although, in this case they are hardly informative as it is difficult to see any differences between the regions.

We can also focus just on the variable (excluding the geography categorical variable) correlations by examining the correlation heatmap:

myPanel <- function(x, y, z, ...){
  lattice::panel.levelplot(x,y,z,...)
  my_text <- ifelse(!is.na(z), paste0(round(z, 4)), "")
  lattice::panel.text(x, y, my_text)
}
# base colors: terrain.colors(), rainbow(), heat.colors(), topo.colors() or cm.colors()
# Viridis: viridis, magma, inferno, plasma
lattice::levelplot(cor(dt %>% dplyr::select(-geography), use = "complete.obs"), panel = myPanel, 
                   col.regions = viridisLite::viridis(100), main = 'Correlation of numerical variables')

corr_mat = dt.drop(columns = ['geography']).corr()
fig = plt.figure(figsize = (10, 10))
tmp_plt = plt.title('Correlation of numerical variables', size = 25)
tmp_plt = sns.heatmap(corr_mat, annot = True)
tmp_plt = plt.ylim((len(corr_mat), 0)) # See bug on bottom cutoff: https://github.com/mwaskom/seaborn/issues/1773
#
plt.show()

Caution

south, midwest and west appear to be negatively correlated, however is to be expected - they are indicator variables for the geographical location, so an observation can only have one of these variables with a value of \(1\) at a time. You should avoid using the indicator variables when calculating the correlation - we should only be concerned with the correlation matrix of the actual numerical value variables.

To make it a bit more readable, we only plot the lower triangle of the numerical variables:

#
#
#
mask = cor(dt[, c('educ', 'exper', 'faminc', 'wage')], use = "complete.obs")
mask[upper.tri(mask, diag = TRUE)] <- NA
#
#
lattice::levelplot(mask, 
                   panel = myPanel, 
                   col.regions = viridisLite::viridis(100), 
                   main = 'Correlation of numerical variables')

corr_mat = dt[['educ', 'exper', 'faminc', 'wage']].corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr_mat, dtype = bool)
# indices for the upper-triangle
mask[np.triu_indices_from(mask)] = True 
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap = True)
#
fig = plt.figure(figsize = (5, 5))
tmp_plt = plt.title('Correlation of numerical variables', size = 15)
tmp_plt = sns.heatmap(corr_mat, mask = mask, cmap = cmap, annot = True)
tmp_plt = plt.ylim((len(corr_mat), 0)) # See bug on bottom cutoff: https://github.com/mwaskom/seaborn/issues/1773
plt.show()

We might suspect that metro dummy variable could also be part of the regional categorical variables. Since all category variables sum up to \(1\) (as each observation can only have characteristic at any time), we can verify this:

dt %>% reframe(south + west + midwest) %>% table() %>% print()
south + west + midwest
  0   1 
210 990 
dt %>% reframe(south + west + midwest + metro) %>% table() %>% print()
south + west + midwest + metro
  0   1   2 
 36 352 812 
print(dt[["south", "west", "midwest"]].sum(axis = 1).value_counts())
1    990
0    210
Name: count, dtype: int64
print(dt[["south", "west", "midwest", "metro"]].sum(axis = 1).value_counts())
2    812
1    352
0     36
Name: count, dtype: int64

We see that south, west and midwest always sum up to either \(0\) or \(1\), meaning that a person (i.e. an observation) can only be from one region at any time. So, south, west, midwest and some other region (or even multiple regions) are from the same categorical variable.

Since metro is not part of the regional categorical variables - a person can be from a specific region but not necessarily from a metropolitan area - there are cases where metro and some region dummy variable are both equal to 1. Hence, metro is from a different categorical variable.

Before continuing on, we will divide our data into a training and a test set.

One way to do this is to randomly take \(80\%\) of the data as follows:

set.seed(123)
#
dt_index   <- 1:nrow(dt)
smpl_index <- sample(dt_index, size = floor(0.8 * length(dt_index)), replace = FALSE)
print(head(smpl_index, 5))
[1] 415 463 179 526 195
rng = np.random.default_rng(123)
#
dt_index   = list(dt.index)
smpl_index = rng.choice(dt_index, size = int(np.floor(0.8 * len(dt_index))), replace = False)
print(smpl_index[:5])
[329  85 947 468 731]

We want the results to be comparable between R and Python. So, we will assume that the observation in our dataset are already randomly ordered. This means that we can simply take the first \(80\%\) of observations:

smpl_index <- 1:floor(0.8 * length(dt_index))
smpl_index = list(range(0, int(np.floor(0.8 * len(dt_index)))))

Next, we can create the train and test sets as follows:

dt_train <- dt[smpl_index, ]
dt_test  <- dt[setdiff(dt_index, smpl_index), ]
dt_train = dt.loc[smpl_index, :].reset_index(drop=True)
dt_test  = dt.loc[list(set(dt_index).difference(set(smpl_index))), :]
dt_test.sort_index(inplace=True)
dt_test  = dt_test.reset_index(drop=True)
print(nrow(dt_train))
[1] 960
print(nrow(dt_test))
[1] 240
print(len(dt_train.index))
960
print(len(dt_test.index))
240
print(head(dt_train, 5))
# A tibble: 5 × 11
  black  educ exper faminc female metro midwest south  wage  west geography
  <int> <int> <int>  <int>  <int> <int>   <int> <int> <dbl> <int> <fct>    
1     0    13    45      0      1     1       0     0  44.4     1 west     
2     0    14    25  45351      1     1       1     0  16       0 midwest  
3     0    18    27  91946      1     1       0     0  15.4     0 other    
4     0    13    42  48370      0     1       1     0  13.5     0 midwest  
5     0    13    41  10000      1     1       0     0  25       1 west     
print(head(dt_test, 5))
# A tibble: 5 × 11
  black  educ exper faminc female metro midwest south  wage  west geography
  <int> <int> <int>  <int>  <int> <int>   <int> <int> <dbl> <int> <fct>    
1     0    13    17   3894      0     1       0     0 27        0 other    
2     0    12    22  60646      1     1       1     0 24        0 midwest  
3     0    12    14      0      1     1       1     0  9        0 midwest  
4     0     5    52   8376      0     1       0     1  7.57     0 south    
5     0    12     1  29540      1     1       0     0  8        1 west     
print(dt_train.head(5))
   black  educ  exper  faminc  female  metro  midwest  south   wage  west geography
0      0    13     45       0       1      1        0      0  44.44     1      west
1      0    14     25   45351       1      1        1      0  16.00     0   midwest
2      0    18     27   91946       1      1        0      0  15.38     0     other
3      0    13     42   48370       0      1        1      0  13.54     0   midwest
4      0    13     41   10000       1      1        0      0  25.00     1      west
print(dt_test.head(5))
   black  educ  exper  faminc  female  metro  midwest  south   wage  west geography
0      0    13     17    3894       0      1        0      0  27.00     0     other
1      0    12     22   60646       1      1        1      0  24.00     0   midwest
2      0    12     14       0       1      1        1      0   9.00     0   midwest
3      0     5     52    8376       0      1        0      1   7.57     0     south
4      0    12      1   29540       1      1        0      0   8.00     1      west