Jason Greenberg

Research Question: Starting as early as their undergraduate career, do women face barriers to entry into or have strong preferences against entering STEM related fields, specifically the sciences and engineering, due to their gender? Also, does region within the United States impact this gender disparity?

Changing gender expectations and increasingly more equal societal treatment of women have led researchers from different disciplines to analyze what may contribute to the lingering gender gap in earned wages. Many factors influence a worker's salary, including job industry, experience, inherent ability, lifestyle preferences, and performance. Many of these predictors are subjective and difficult to measure. This presentation will not focus on explaining the causes of higher median incomes for men, but will instead examine the gender disparity in college major choice, which is an indicator of eventual career track and earnings. Systematic gender preferences for and against science and engineering majors are visibile after looking at undergraduate major data from the American Community Survey from 2015. Moreover, while there exists a strangely constant gap between the number of male versus female science and engineering majors across the country at the state level, the relative percentages of male degree holders with majors in science and engineering against their female parallel figures suggest that different parts of the country face varying levels of gender disparity in the sciences at the undergraduate level.

library(ggplot2)
library(maps)
library(RColorBrewer)

Warning message:
"package 'maps' was built under R version 3.3.3"

These r packages above are necessary to run the graphics that will support the argument developed.

bachelors <- read.csv("bachelors.csv", header = TRUE, stringsAsFactors = FALSE)
 dim(bachelors)
 head(bachelors)

The "bachelors.csv" file includes information on bachelor's degree holders from 2015 for men and women across the United States and Puerto Rico in various geographical regions. As this was a dataset used for Problem Set 2 of the class, no major data cleaning was necessary. For the purposes of this presentation, only the combined state figures and not the urban, rural, or city specific data will be used.

desired_columns <- c(3, 4, 16, 28, 40, 52, 64)
desired_rows <- seq(2,53) #all states, Washington DC, and Puerto Rico
subsetTotal <- bachelors[desired_rows, desired_columns] 
colnames(subsetTotal) <- c("State","Total", "SciEng", "SciEngRelated", "Business", "Education", "HumArts")
dim(subsetTotal)
subsetTotal

This dataframe presents the total number of bachelor's degrees for those aged 25 or older from the year 2015 for all 50 states, Washington DC, and Puerto Rico. Five categories of majors are included. The US Census Bureau American Community Survey defines science and engineering related majors to include nursing, architecture, and mathematics teacher education degrees, while the science and engineering category includes biology, chemistry, physics, mathematics, computer science, and social science degrees.

colnames(subsetTotal) <- c("State","Total", "SciEng", "SciEngRelated", "Business", "Education", "HumArts")
bandNames <- colnames(subsetTotal[,3:7])
par(mfrow = c(3,2))
par(mar = c(0,0,0,0))
 for(j in 1:5){ 
    hist(as.numeric(subsetTotal[,j+2])/as.numeric(subsetTotal[,2]),breaks = seq(0,0.7, by=0.05),ylim = c(0,50),
        axes = FALSE, main = "", xlab = "", ylab = "", col = "grey")
    box()
    text(x = .33, y=40, label = bandNames[j])
 }

Before beginning an analysis of gender and region bias in college major selection for men and women, it is helpful to see the distributions of majors for the combined figures that include men and women for all observation points. The visual above includes five histograms for each category of college major type. The x-axis measures percentage of degree holders for that major and ranges from 0 to 70% as defined by the "breaks = seq(0,0.7, by=0.05)" code, while the y-axis indicates frequency in terms of number of states, which ranges from 0 to 50. Each bar represents a bin range of 5%. The use of "par" and the forloop generate the grouped set of histograms, and the individual histogram titles connect to the original dataframe "subsetTotal" column names through the use of the "colnames" function. For this non region specific state major data for all degree holders over the age of 25, science and engineering majors represented the highest percentage of total degrees. This is signified by high median levels, over 20 states, being around 30% of all degrees in the states measured and a relatively even, bell-curve shaped distribution. Meanwhile, the lower median levels for science and engineering related fields, 30 states having between 5 and 10 percent of degree holders with this type of degree, and education major degrees, about 20 states having 10 to 15 percent of these degrees, signifies lower popularity.

Desired_columnsMale <- c(8, 20, 32, 44, 56, 68) #men totals
Desired_rowsMale <- seq(2,53) #all states
SubsetMale <- bachelors[Desired_rowsMale, Desired_columnsMale] 
colnames(SubsetMale) <- c("TotalMale", "SciEngMale", "SciEngRelatedMale", "BusinessMale", 
                          "EducationMale", "HumArtsMale")
bandNames <- colnames(SubsetMale[,-1])
par(mfrow = c(3,2))
par(mar = c(0,0,0,0))
 for(j in 1:5){ 
    hist(as.numeric(SubsetMale[,j+1])/as.numeric(SubsetMale[,1]),breaks = seq(0,0.7, by=0.05),ylim = c(0,50),
        axes = FALSE, main = "", xlab = "", ylab = "", col = "grey")
    box()
    text(x = .33, y=40, label = bandNames[j])
 }

Desired_columnsFemale <- c(12, 24, 36, 48, 60, 72) #women totals
Desired_rowsFemale <- seq(2,53) #all states
SubsetFemale <- bachelors[Desired_rowsFemale, Desired_columnsFemale] 
colnames(SubsetFemale) <- c("TotalFemale", "SciEngFemale", "SciEngRelatedFemale", "BusinessFemale", 
                            "EducationFemale", "HumArtsFemale")
bandNames <- colnames(SubsetFemale[,-1])
par(mfrow = c(3,2))
par(mar = c(0,0,0,0))
 for(j in 1:5){ 
    hist(as.numeric(SubsetFemale[,j+1])/as.numeric(SubsetFemale[,1]),breaks = seq(0,0.7, by=0.05),ylim = c(0,50),
        axes = FALSE, main = "", xlab = "", ylab = "", col = "grey")
    box()
    text(x = .33, y=40, label = bandNames[j])
 }

These two sets of five histograms divided by gender help display the differences in frequencies of major choice for men and women. The same coding technique and structure were used in the original non-gender specific set of histograms above. This time, new subsets "subsetMale" and "subsetFemale" were used as opposed to "subsetTotal," where the data came from gender specific columns from the original Excel file. One of the most drastic differences in the distributions exists in the science and engineering major histograms. The center of the male SciEng distribution is about 20% higher than the center of the female SciEng distribution. No other college major type faces this sort of gender disparity. Further statistical analysis will be able to help clarify on some aspects of the relationship between the number of men with science and engineering degrees and the number of women with science and engineering degrees.

desired_columns <- c(3,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72)
desired_rows <- seq(2,53) #all states
subset <- bachelors[desired_rows, desired_columns] 
colnames(subset) <- c("State","Total","TotalMale","TotalFemale",
                      "SciEngTotal", "SciEngMale", "SciEngFemale",
                      "SciEngRelatedTotal", "SciEngRelatedMale", "SciEngRelatedFemale",
                      "BusinessTotal","BusinessMale","BusinessFemale",
                      "EducationTotal","EducationMale","EducationFemale",
                     "HumanitiesTotal","HumanitiesMale","HumanitiesFemale")

subset$percentMaleTotal <- as.numeric(subset$TotalMale)/as.numeric(subset$Total)*100
subset$percentFemaleTotal <- as.numeric(subset$TotalFemale)/as.numeric(subset$Total)*100

subset$percentOfMaleInSciEng <- as.numeric(subset$SciEngMale)/as.numeric(subset$SciEngTotal)*100     #this pair of percentages will sum to 100%
subset$percentOfFemaleInSciEng <- as.numeric(subset$SciEngFemale)/as.numeric(subset$SciEngTotal)*100

subset$percentSciEngMale <- as.numeric(subset$SciEngMale)/as.numeric(subset$TotalMale)*100           #while this pair has no immediate, direct relationship
subset$percentSciEngFemale <- as.numeric(subset$SciEngFemale)/as.numeric(subset$TotalFemale)*100

subset$percentSciEngRelatedMale <- as.numeric(subset$SciEngRelatedMale)/as.numeric(subset$TotalMale)*100          
subset$percentSciEngRelatedFemale <- as.numeric(subset$SciEngRelatedFemale)/as.numeric(subset$TotalFemale)*100

subset$percentBusinessMale <- as.numeric(subset$BusinessMale)/as.numeric(subset$TotalMale)*100
subset$percentBusinessFemale <- as.numeric(subset$BusinessFemale)/as.numeric(subset$TotalFemale)*100

subset$percentEducationMale <- as.numeric(subset$EducationMale)/as.numeric(subset$TotalMale)*100
subset$percentEducationFemale <- as.numeric(subset$EducationFemale)/as.numeric(subset$TotalFemale)*100

subset$percentHumanitiesMale <- as.numeric(subset$HumanitiesMale)/as.numeric(subset$TotalMale)*100
subset$percentHumanitiesFemale <- as.numeric(subset$HumanitiesFemale)/as.numeric(subset$TotalFemale)*100

subset$SciEngRatio <- subset$percentSciEngFemale/subset$percentSciEngMale

head(subset)

By taking the number of degree holders for each major for men and women and then dividing them by the total number of degree holders for each gender, a percentage of degree holders for each state, major, and gender can be derived. Having access to both raw counts and relative figures is important for a more complete analysis. Clarifying on the annotations in the code for the second and third pairs of calculations above, the "percentOfMaleInSciEng" and "percentofFemaleinSciEng" indicate the percentage of men and women with science and engineering degrees compared to the sums of the two genders of degree holders. This is why the two values will sum to 100%. Meanwhile, "percentSciEngMale" and "percentSciEngFemale" signify the percentage of men and women who hold science and engineering degrees compared to other degrees, not the other gender, which is why these two percentages will most likely not sum to 100%. Both computations are important for understanding the relationship between men, women, and degree choice.

median(as.numeric(subset$TotalMale))
median(as.numeric(subset$TotalFemale))

median(as.numeric(subset$SciEngMale))
median(as.numeric(subset$SciEngFemale))

By calculating the median number of total degree holders per state for men and women, it can be seen that women have more degrees per state on average. The ratio of women degree holders to men degree holders is about 1.2 to 1.0 while the ratio of female science and engineering degree holders to male science and engineering degree holders is about 1.0 to 1.59, which indicates that even with higher female averages in general, men have many more degrees in science and engineering than women on average.

sum(as.numeric(subset$TotalMale))
sum(as.numeric(subset$TotalFemale))

sum(as.numeric(subset$SciEngMale))
sum(as.numeric(subset$SciEngFemale))

Computing the same ratios for all state counts summed together, the women to men total ratio is about 1.1 to 1.0, while the parallel science and engineering totals ratio is about 1.0 to 1.5. Therefore, for both the state median ratios and the total sum ratios, women have more degrees in general, but the relative difference between number of male and female science and engineering degrees is even greater and in the opposite relationship.

median(subset$percentMaleTotal)
median(subset$percentFemaleTotal)

median(subset$percentOfMaleInSciEng )
median(subset$percentOfFemaleInSciEng)

median(subset$percentSciEngMale)
median(subset$percentSciEngFemale)
median(subset$percentSciEngMale)/median(subset$percentSciEngFemale)

median(subset$percentSciEngRelatedMale)/median(subset$percentSciEngRelatedFemale) 

median(subset$percentBusinessMale)/median(subset$percentBusinessFemale) 

median(subset$percentEducationMale)/median(subset$percentEducationFemale) 

median(subset$percentHumanitiesMale)/median(subset$percentHumanitiesFemale)

Looking at the percentage calculations, women had about 52.5% of all bachelor's degrees in 2015 for those over the age of 25. Yet they only held about 39.4% of science and engineering degrees. In line with the ratios examined just previously, men had a stronger inclination to go into the sciences and engineering than women. On average, 42.3% of all degrees were science and engineering based for men, while only 24.1% of degrees for women were science and engineering degrees on average. The ratio of those percentages was 1.75, which was the largest of all the college majors, with the business major having the second greatest at 1.46. The discrepancy between total number of degrees, state averages, and state average percentages all point towards there being some societal trend for women to not enter the sciences or engineering. One paper from 2009 that analyzed a survey of 161 students from Norhtwestern University determined through econometric modeling based on survey results that the most significant reason for women deciding not to enter particular majors was linked to expectations of enjoyment of the coursework. The author suggested that the difference in expectations between men and women for particular department courses might be linked to gender discrimination in society (Zafar 29). In an article from 1984 written using data from the National Longitudinal Studies of the High School Class of 1972, the authors argued that "substantial differences appear in their preferences as of [the students'] senior year in high school, for various types of work and in their subsequent preparation for the labor market during college" (Daymont 414). Again, an economic regression model based on survey answers determined which factors weighed in most on what most contributed to the gender gap. If the preferences and impressions of students on their career choice impact college major selection and therefore career path and earnings, then the data seen in the American Community Survey from 2015 add support to the idea that the gender gap begins to take form even before students finish their education.

options(scipen=2000000) #converts scientific notation to regular decimals for numbers under two million
summary(lm(as.numeric(subset$SciEngFemale) ~ as.numeric(subset$SciEngMale)))

Call:
lm(formula = as.numeric(subset$SciEngFemale) ~ as.numeric(subset$SciEngMale))

Residuals:
   Min     1Q Median     3Q    Max 
-78668  -9063   -421   6422  97610 

Coefficients:
                                  Estimate   Std. Error t value
(Intercept)                   -3391.997658  4208.916304  -0.806
as.numeric(subset$SciEngMale)     0.676498     0.009741  69.447
                                         Pr(>|t|)    
(Intercept)                                 0.424    
as.numeric(subset$SciEngMale) <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 23820 on 50 degrees of freedom
Multiple R-squared:  0.9897,	Adjusted R-squared:  0.9895 
F-statistic:  4823 on 1 and 50 DF,  p-value: < 0.00000000000000022

The code above outputs a linear regression summary for the raw count of male held science and engineering degrees acting on the raw count of female held science and engineering degrees for each state, DC, and Puerto Rico. The same summary output is then performed for the four other types of majors. The important summary statistics that give evidence to support how women systematically avoid majoring in science or engineering are analyzed further on.

summary(lm(as.numeric(subset$SciEngRelatedFemale) ~ as.numeric(subset$SciEngRelatedMale)))

Call:
lm(formula = as.numeric(subset$SciEngRelatedFemale) ~ as.numeric(subset$SciEngRelatedMale))

Residuals:
   Min     1Q Median     3Q    Max 
-30447  -5203  -2189   6499  34268 

Coefficients:
                                       Estimate Std. Error t value
(Intercept)                          7139.04527 2119.41889   3.368
as.numeric(subset$SciEngRelatedMale)    2.32544    0.04122  56.413
                                                 Pr(>|t|)    
(Intercept)                                       0.00146 ** 
as.numeric(subset$SciEngRelatedMale) < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11470 on 50 degrees of freedom
Multiple R-squared:  0.9845,	Adjusted R-squared:  0.9842 
F-statistic:  3182 on 1 and 50 DF,  p-value: < 0.00000000000000022

summary(lm(as.numeric(subset$BusinessFemale) ~ as.numeric(subset$BusinessMale)))

Call:
lm(formula = as.numeric(subset$BusinessFemale) ~ as.numeric(subset$BusinessMale))

Residuals:
   Min     1Q Median     3Q    Max 
-42580  -3227    412   2940  56153 

Coefficients:
                                   Estimate  Std. Error t value
(Intercept)                     -1341.27859  2532.33558   -0.53
as.numeric(subset$BusinessMale)     0.80450     0.01122   71.68
                                           Pr(>|t|)    
(Intercept)                                   0.599    
as.numeric(subset$BusinessMale) <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13810 on 50 degrees of freedom
Multiple R-squared:  0.9904,	Adjusted R-squared:  0.9902 
F-statistic:  5138 on 1 and 50 DF,  p-value: < 0.00000000000000022

summary(lm(as.numeric(subset$EducationFemale) ~ as.numeric(subset$EducationMale)))

Call:
lm(formula = as.numeric(subset$EducationFemale) ~ as.numeric(subset$EducationMale))

Residuals:
   Min     1Q Median     3Q    Max 
-62856  -8567  -1395   7101  82946 

Coefficients:
                                    Estimate  Std. Error t value
(Intercept)                      -4432.65497  4627.01232  -0.958
as.numeric(subset$EducationMale)     3.35557     0.08909  37.666
                                            Pr(>|t|)    
(Intercept)                                    0.343    
as.numeric(subset$EducationMale) <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22270 on 50 degrees of freedom
Multiple R-squared:  0.966,	Adjusted R-squared:  0.9653 
F-statistic:  1419 on 1 and 50 DF,  p-value: < 0.00000000000000022

summary(lm(as.numeric(subset$HumanitiesFemale) ~ as.numeric(subset$HumanitiesMale)))

Call:
lm(formula = as.numeric(subset$HumanitiesFemale) ~ as.numeric(subset$HumanitiesMale))

Residuals:
   Min     1Q Median     3Q    Max 
-60060  -6516   2465   6738  59227 

Coefficients:
                                     Estimate  Std. Error t value
(Intercept)                       -8813.81804  2759.08699  -3.194
as.numeric(subset$HumanitiesMale)     1.39733     0.01402  99.692
                                              Pr(>|t|)    
(Intercept)                                    0.00243 ** 
as.numeric(subset$HumanitiesMale) < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15380 on 50 degrees of freedom
Multiple R-squared:  0.995,	Adjusted R-squared:  0.9949 
F-statistic:  9938 on 1 and 50 DF,  p-value: < 0.00000000000000022

The dependency of the number of female degree holders for each major on the number of male degree holders is positive for each of the five majors. Looking at the raw counts of bachelor's degree holders, this result is not surprising. A rising amount of degrees for each gender implies that states with more degree holders for one gender have relatively more for the other as well. However, the slope with the lowest magnitude was the science and engineering major relationship at 0.67, while the other slopes were 0.80, 1.40, 2.33, and 3.36. Based on the p-values of each regression being negligibly close to zero, the regression coefficients are all statistically significant.

ggplot(subset, aes(as.numeric(SciEngMale), as.numeric(SciEngFemale)))  +
      geom_point()+
      #scale_x_continuous(name="Total Male SciEng Degree Holders", limits=c(0, 150000)) +
      #scale_y_continuous(name="Total Female SciEng Degree Holders", limits=c(0, 150000))+
      labs(x= "Total Male SciEng Degree Holders by State") +
      labs(y = "Total Female SciEng Degree Holders by State")+ ylim(0,2000000)+
      labs(title= "Relationship Between Male and Female SciEng Degree Holders") + 
      stat_smooth(method = lm, se = FALSE, color = "black") +
      geom_vline(xintercept = 134349, linetype="dotted", colour="red")+
      geom_hline(yintercept =  84527, linetype="dotted", colour="red")+
      geom_vline(xintercept = 0)+
      geom_hline(yintercept = 0)+
      annotate("text", label = "r^2 == 0.9895", parse = TRUE,x= 1400000, y = 1500000) +
      annotate("text", label = "slope = 0.676634", x= 1475000, y = 1250000)



#qplot(as.numeric(SciEngMale), as.numeric(SciEngFemale), data = subset, color = I("darkblue"),
#      xlab = "Total Male SciEng Degree Holders", ylab = "Total Female SciEng Degree Holders", 
#      main = "Relationship Between Total Male and Female SciEng Degree Holders") + geom_smooth(method = "lm", se = FALSE)
#qplot version of the above ggplot

The ggplot above is a visual representation of the first linear regression run in the series of five regressions run earlier. The x-axis represents the total number of male science or engineering bachelor’s degree holders aged twenty-five and older from each of the 50 states, DC, and Puerto Rico. The y-axis represents the same figure for women. The median number of male degree holders with a major in science or engineering was 134,349, while the median for females was 84,527. The dotted line crosshair intercept indicates this point. Also, the r-squared value indicates that almost ninety-nine percent of the variability in female science and engineering degree holders is accounted for by variability in the number of male science or engineering degree holders. This value is incredibly high for a regression with just one independent variable, but looking at the other r-squared values for the regressions on other college major categories, it can be seen that similarly high values are present. Outside of data distortion possibilities, this indicates that the number of male degree holders in a state for a particular major for the year of 2015 was an incredibly precise indicator of how many female degree holders for that major there will be.

summary(lm(subset$percentSciEngFemale ~ subset$percentSciEngMale))
ggplot(subset, aes(percentSciEngMale, percentSciEngFemale))  +
      geom_point()+
      labs(x= "Male SciEng Degree Holders by State") + xlim(32.5,52.5)+ 
      labs(y = "Percentage of Female SciEng Degree Holders by State")+ ylim(15.25,35.25)+
      labs(title= "Relationship Between Male and Female SciEng Degree Holders by Percentage") + 
      stat_smooth(method = lm, se = FALSE, color = "black")

Call:
lm(formula = subset$percentSciEngFemale ~ subset$percentSciEngMale)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.9425 -1.5519  0.5676  1.1299  7.3309 

Coefficients:
                          Estimate Std. Error t value            Pr(>|t|)    
(Intercept)              -17.19788    3.77537  -4.555 0.00003383556251666 ***
subset$percentSciEngMale   0.99056    0.08823  11.227 0.00000000000000284 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.589 on 50 degrees of freedom
Multiple R-squared:  0.716,	Adjusted R-squared:  0.7103 
F-statistic: 126.1 on 1 and 50 DF,  p-value: 0.000000000000002836

Warning message:
"Removed 1 rows containing non-finite values (stat_smooth)."Warning message:
"Removed 1 rows containing missing values (geom_point)."

Looking at the percentage of degree holders who majored in science and engineering for each state, as opposed to the raw counts, a relatively one-to-one slope is seen, as computed in the regression above with a value of about 0.99. This relationship may at first appear to conflict with the interpretation of the raw counts, but due to the difference in observation values, the slope actually further supports the notion that the number of women in science and engineering is significantly lower than men. For every one percent point increase in the percentage of bachelor's degree holders who majored in science and engineering, the percentage of women degree holders with a major in science and engineering is expected to increase by one percent point as well. However, the range of percentage values for women, as seen by the y-axis, are all lower by around the 18% difference seen in the median state percentage values for the two genders, where the median for men was 42.3% for men and 24.1% for women. Importantly, this relationship indicates that states with higher percentages for both men and women having science degrees have more similar percentages than those states with lower figures, due to higher numerator and denominator values meaning a fraction closer to the value of one. This will be more visually apparent later on by taking the ratio of these two variables and mapping the state values with a choropleth.

Initializing state longitude and latitude data and creating choropleth graphs will help to get a better sense of the regional implications of these findings.

states <- map_data("state")
head(states)
dim(states)
head(subset)

names(subset) <- tolower(names(subset))
subset$region <- tolower(subset$state)
head(subset)

The above code modifies the original "subset" data by renaming the columns with lowercase titles and adds a final column named "region" that the state data shares with the given state as each entry.

choro_df <- merge(states, subset, by = "region") #merge(df1,df2,by="column vector")
head(choro_df)

After creating a column that matches both dataframes, they can be merged with the "merge" command and then ordered by the order column.

The next two maps entitled "Women Degree Holders with a Major in Science/Engineering" and "Men Degree Holders with a Major in Science/Engineering" display the findings of the raw count statistical regression analysis conducted earlier. States with more men with science and engineering degrees also have more women with science and engineering degrees. The legend next to each will indicate the relative gap between the two genders in number of degrees. In a regional context, because of the strong direct connection between the two counts of science and engineering degrees, the maps look very similar.

choro <- choro_df[order(choro_df$order),]  #order by "order" column
head(choro)

choro$breaks <- cut(as.numeric(choro$sciengfemale),breaks = seq(0,1400000, by = 100000), include.lowest = TRUE, 
                    labels = c("0-100,000","100,001-200,000","200,001-300,000","300,001-400,000","400,001-500,000",
                              "500,001-600,000","600,001-700,000","700,001-800,000","800,001-900,000","900,001-1,000,000",
                              "1,000,001-1,100,000","1,100,001-1,200,000","1,200,001-1,300,000","1,300,001-1,400,000"))
                              
#choro$breaks <- cut(as.numeric(choro$sciengfemale),breaks = seq(0,1500000, by = 250000), include.lowest = TRUE, 
#                    labels = c("0-250,000","250,000-500,000","250,000-500,000",
#                              "500,000-750,000","750,000-1,000,000","1,000,000-1,250,000"))
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon", 
      main = "Women Degree Holders with a Major in Science/Engineering") + 
    scale_fill_brewer(name = "Number of Degrees", palette = "Reds")

For the raw degree count map for women above, break separation units of 100,000 were used, while break separation units of 200,000 were used for the men. Even with the difference in intervals, the trend of all states having about 1.5 times as many degrees in the sciences for men is apparent through the coloration similarities.

choro$breaks <- cut(as.numeric(choro$sciengmale),breaks = seq(0,2200000, by = 200000), include.lowest = TRUE, 
                    labels = c("0-200,000","200,001-400,000","400,001-600,000","600,001-800,000","800,001-1,000,000",
                              "1,000,001-1,200,000","1,200,001-1,400,000","1,400,001-1,600,000","1,600,001-1,800,000","1,800,001-2,000,000",
                              "2,000,001-2,200,000"))
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon", 
      main = "Men Degree Holders with a Major in Science/Engineering") + 
    scale_fill_brewer(name = "Number of Degrees", palette = "Blues")

While the raw counts suggest that there is no regional difference in degree counts for men and women in the sciences, looking at the percentage of degrees for each state that are in science or engineering paints a different picture. Comparing the two, some parts of the country have more men in the sciences than they do women and vice versa. Some states do have high percentages for both, like California and New York, but others only have high percentages for men and relatively lower for women like Wyoming and Florida. Aside from the regional differences, the maps do further the argument that more men are in the sciences than women. The degree rates are significantly lower for women, as shown earlier by the median percentage of men and women with science and engineering degrees.

choro$breaks <- cut(choro$percentsciengfemale,breaks = seq(15,45, by = 5), include.lowest = TRUE, 
                    labels = c("15%-20%","20%-25%","25%-30%",
                              "30%-35%","35%-40%","40%-45%"))
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon", 
      main = "Percentage of Women Degree Holders with a Major in Science/Engineering") + 
    scale_fill_brewer(name = "Degree Rates", palette = "Reds")

choro$breaks <- cut(choro$percentsciengmale,breaks = seq(30,55, by = 5), include.lowest = TRUE, 
                    labels = c("30%-35%","35%-40%","40%-45%",
                              "45%-50%","50%-55%"))
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon", 
      main = "Percentage of Men Degree Holders with a Major in Science/Engineering") + 
    scale_fill_brewer(name = "Degree Rates", palette = "Blues")

Taking the ratio of the percentage of women with science and engineering degrees compared to other degrees to the percentage of men with science and engineering degrees, we can see from just one, rather than two, maps how region impacts the relative rates of science and engineering degrees for men and women. Darker states indicate higher ratios women, but the ratio never reaches the value of one. A general trend is that the west and east coasts have higher ratios than the rest of the country. Unlike the first two maps that displayed very similar regional patterns, this ratio map distinctly displays how different parts of the country have different magnitudes of college major gender bias.

choro$breaks <- cut(choro$percentsciengfemale/choro$percentsciengmale,breaks = seq(0.35,0.85,by = 0.05), include.lowest = TRUE, 
                    labels = c("0.35-0.40","0.40-0.45","0.45-0.50","0.50-0.55",
                              "0.55-0.60","0.60-0.65","0.65-0.70","0.70-0.75","0.75-0.80","0.80-0.85"))
#choro$breaks = cut(choro$percentsciengfemale/choro$percentsciengmale, 6)
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon", 
      main = "Ratio of Percentages for Women to Men with Major in Science/Engineering") + 
    scale_fill_brewer(name = "Ratio of Percents",
                       palette = "Purples")

desired_columns <- c(3,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72)  #re-inputting the origional "subset" dataframe not changed by the choropleth merging
desired_rows <- seq(2,53) #all states
subset <- bachelors[desired_rows, desired_columns] 
colnames(subset) <- c("State","Total","TotalMale","TotalFemale",
                      "SciEngTotal", "SciEngMale", "SciEngFemale",
                      "SciEngRelatedTotal", "SciEngRelatedMale", "SciEngRelatedFemale",
                      "BusinessTotal","BusinessMale","BusinessFemale",
                      "EducationTotal","EducationMale","EducationFemale",
                     "HumanitiesTotal","HumanitiesMale","HumanitiesFemale")

subset$percentSciEngMale <- as.numeric(subset$SciEngMale)/as.numeric(subset$TotalMale)*100           #while this pair has no immediate, direct relationship
subset$percentSciEngFemale <- as.numeric(subset$SciEngFemale)/as.numeric(subset$TotalFemale)*100

subset$SciEngRatio <- subset$percentSciEngFemale/subset$percentSciEngMale   

head(subset)

sort(as.numeric(subset$SciEngMale), decreasing = TRUE)[1:10]
subset[which(subset$SciEngMale == "2040615"), "State"]
subset[which(subset$SciEngMale == "1074765"), "State"]
subset[which(subset$SciEngMale == "912146"), "State"]

sort(as.numeric(subset$SciEngFemale), decreasing = TRUE)[1:10]
subset[which(subset$SciEngFemale == "1386950"), "State"]
subset[which(subset$SciEngFemale == "711283"), "State"]
subset[which(subset$SciEngFemale == "645017"), "State"]

By sorting the states with the highest degree count for the sciences, it is possible to indentify which states these counts belong to using the "which" function. California has the most male and female science and engineering major degree holders. New York and Texas also have the next highest counts of both.

sort(as.numeric(subset$percentSciEngMale), decreasing = TRUE)[1:10]
subset[which(subset$percentSciEngMale == "51.9724320009824"), "State"]
subset[which(subset$percentSciEngMale == "50.63696225976"), "State"]
subset[which(subset$percentSciEngMale == "49.9569814248343"), "State"]

sort(as.numeric(subset$percentSciEngFemale), decreasing = TRUE)[1:10]
subset[which(subset$percentSciEngFemale == "41.6149338278437"), "State"]
subset[which(subset$percentSciEngFemale == "33.1419517414361"), "State"]
subset[which(subset$percentSciEngFemale == "32.9437484466618"), "State"]

Washington DC has the highest percentage of male and female science and engineering degree holders. Washington (state) and Maryland have high concentration for men, while Massachusetts and Virginia have high levels for women.

subset[which(subset$percentSciEngMale > 46), "State"]
subset[which(as.numeric(subset$SciEngMale) > 410000), "State"]

order(subset$percentSciEngMale)[42:52]
order(as.numeric(subset$SciEngMale))[42:52]

subset[which(subset$percentSciEngFemale > 29), "State"]
subset[which(as.numeric(subset$SciEngFemale) > 290000), "State"]

order(subset$percentSciEngFemale)[42:52]
order(as.numeric(subset$SciEngFemale))[42:52]
order(subset$SciEngRatio)[42:52]

Looking at the state names and their row entry numbers through the "which" and "order" function, the states with the highest concentrations of science and engineering majors for both men and women do not necessarily match with the states with the highest degree counts. Likewise, the states with the higher female to male percentage ratios do not necessarily match with the states with the highest female percentage values. What these figures indicate is that although there appears to be a relatively constant 3:2 ratio of male to female science and engineering degree totals across the country, the percentages of men and women in the sciences at the state level is not nearly as consistent. Therefore, the level of gender disparity in the sciences depends on the region of the country. Note that the "which" commands are listing states alphabetically, while the order command is listing states with the desired units in ascending order, which is why the row counts of 42 to 52 are used for indexing the largest values.

summary(lm(log(as.numeric(subset$SciEngRatio)) ~ log(as.numeric(subset$SciEngTotal))))

Call:
lm(formula = log(as.numeric(subset$SciEngRatio)) ~ log(as.numeric(subset$SciEngTotal)))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.35757 -0.07210 -0.01002  0.07399  0.34999 

Coefficients:
                                    Estimate Std. Error t value    Pr(>|t|)    
(Intercept)                         -1.02118    0.18148  -5.627 0.000000827 ***
log(as.numeric(subset$SciEngTotal))  0.03825    0.01454   2.630      0.0113 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1118 on 50 degrees of freedom
Multiple R-squared:  0.1215,	Adjusted R-squared:  0.104 
F-statistic: 6.918 on 1 and 50 DF,  p-value: 0.01131

The regression above serves to establish that the ratio map is not strongly coorelated with total number of science and engineering degrees per state. This was an area of concern due to the seemingly similar appearance of the science and engineering major degree count maps and the ratio map. If the ratio map was simply suggesting states with more science and engineering degree holders had higher percentages of women in the sciences compared to the percentage of men in the sciences, this might have indicated nothing more than that more degree holders equates to more equitable percentages. The log was taken of both the dependent variable, the science and engineering major gender ratio, and the independent variable of total number of degrees per state, to determine what a percentage change in the total number of degrees would have on a percentage change in the ratio. Otherwise, the unit differences would have produced meaningless statistics. Having a slope of just 0.03825 indicates that there is almost no relationship between the number of degrees in a state and how much of a gender bias there is in the percentages of men and women majoring in the sciences and engineering. A one percent increase in the number of degrees in a state will only have about a 3.1% change on the ratio.

There is not a lot of literature that analyzes region, college major choice, and gender together in the same context. However, given that preferences are a signficant factor in why women choose not to major in fields like engineering, the next step in incorporating the findings of this presentation is bringing regional influence into the narrative by claiming that certain parts of the country create environments for women to have more positive impressions of the sciences. In his Federal Reserve Bank of New York Staff Report, Basit Zafar found through his econometric model mentioned earlier that "60% of the gender gap in engineering is due to differences in preferences, while 30% is due to differences in how much females and males believe they will enjoy studying engineering" (Zafar 4). There are other explanations offered for why students choose one major over another including a strong emphasis on the connection between major choice and political ideology. One 2006 paper found that "liberal students [are] more likely to choose a non-science major" (Porter 2006). This explanation seems to run counter to the findings that the typically more liberal coastal United States regions have high figures for percentage of science degrees compared to total degrees for both men and women. However, the survey used for Porter and Umbach's paper only tested one highly selective liberal arts college and agknowledged that the results cannot be extrapolated to a larger sample of students from different types of schools. Similarly, the 2015 data examined in this paper is a single snapshot in time of the relationship between gender, major, and region. Further temporal analysis should be considered to determine the changing landscape of gender imbalance in science and engineering major selection.

The complexities of gender gap analysis go beyond data limitations. The chosen scope of the inspection inherently changes the range of possible interpretations. In a study that examined not only gender but also socioeconomic status (SES), Ma found "that women from lower SES backgrounds are as likely as their male counterparts to choose a lucrative college major" and "the role of lucrative college major choice in potentially uplifting students’ and their families’ SES outweighs the traditional gender role socialization that contributes to the divergent career paths toward which men and women are oriented" (228 Ma). In a paper on citizenship status, the author found "a higher propensity to enroll in SEM [Science, Engineering and Math] fields for foreign-born populations and a lower propensity to enroll in social sciences compared to citizens" (Nores 138). In order to completely disagregate all of the possible effects on the gender gap in college major choice, all concievable variables would have to be included in the analysis.

Although the spatial maps and ratios calculated suggest that different parts of the country experience different magnitudes of gender bias in college major choice, the ability to prove geographic cause is not within the scope of this presentation. However, if government policies, educational backgrounds, or cultural differences are attached to region, then the analysis conducted may be a starting point in identifying why varying levels of women across the country are systematically choosing not to go into the sciences or engineering during their undergraduate careers. Additionally, college students do not necessarily come from the same state they study in. State bias might then indicate quality discrepancies in academic institutions in particular states rather than any gender equality differences. Better schools might have more resources for scientific and engineering research. What can be said for certain based on the data of the US Census Bureau American Community Survey is that there does exist a reason why women are not entering the sciences and engineering at the same rate as men and there is at least an indirect relationship between region and level of gender disparity.

Bibliography

Daymont, Thomas N., and Paul J. Andrisani. "Job Preferences, College Major, and the Gender Gap in Earnings." The Journal of Human Resources 19, no. 3 (1984): 408-28. doi:10.2307/145880.

Ma, Yingyi. "Family Socioeconomic Status, Parental Involvement, and College Major Choices—Gender, Race/Ethnic, and Nativity Patterns." Sociological Perspectives 52, no. 2 (2009): 211-34. doi:10.1525/sop.2009.52.2.211.

Nores, Milagros. "Differences in College Major Choice by Citizenship Status." The Annals of the American Academy of Political and Social Science 627 (2010): 125-41. http://www.jstor.org/stable/40607409.

Porter, Stephen R., and Paul D. Umbach. "College Major Choice: An Analysis of Person–Environment Fit." Research in Higher Education 47, no. 4 (2006): 429-49. doi:10.1007/s11162-005-9002-3

United States Census Bureau. (2015). American Community Survey [bachelors.csv]. Retrieved from http://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml

Zafar, Basit. "College Major Choice and the Gender Gap." SSRN Electronic Journal (2013): 1-50. doi:10.2139/ssrn.1348219.

GEO.id	GEO.id2	GEO.display.label	HC01_EST_VC01	HC01_MOE_VC01	HC02_EST_VC01	HC02_MOE_VC01	HC03_EST_VC01	HC03_MOE_VC01	HC04_EST_VC01	...	HC02_EST_VC27	HC02_MOE_VC27	HC03_EST_VC27	HC03_MOE_VC27	HC04_EST_VC27	HC04_MOE_VC27	HC05_EST_VC27	HC05_MOE_VC27	HC06_EST_VC27	HC06_MOE_VC27
Id	Id2	Geography	Total; Estimate; Total population 25 years and over with a Bachelor's degree or higher	Total; Margin of Error; Total population 25 years and over with a Bachelor's degree or higher	Percent; Estimate; Total population 25 years and over with a Bachelor's degree or higher	Percent; Margin of Error; Total population 25 years and over with a Bachelor's degree or higher	Males; Estimate; Total population 25 years and over with a Bachelor's degree or higher	Males; Margin of Error; Total population 25 years and over with a Bachelor's degree or higher	Percent Males; Estimate; Total population 25 years and over with a Bachelor's degree or higher	...	Percent; Estimate; DETAILED AGE - 65 years and over - Arts, Humanities and Others	Percent; Margin of Error; DETAILED AGE - 65 years and over - Arts, Humanities and Others	Males; Estimate; DETAILED AGE - 65 years and over - Arts, Humanities and Others	Males; Margin of Error; DETAILED AGE - 65 years and over - Arts, Humanities and Others	Percent Males; Estimate; DETAILED AGE - 65 years and over - Arts, Humanities and Others	Percent Males; Margin of Error; DETAILED AGE - 65 years and over - Arts, Humanities and Others	Females; Estimate; DETAILED AGE - 65 years and over - Arts, Humanities and Others	Females; Margin of Error; DETAILED AGE - 65 years and over - Arts, Humanities and Others	Percent Females; Estimate; DETAILED AGE - 65 years and over - Arts, Humanities and Others	Percent Females; Margin of Error; DETAILED AGE - 65 years and over - Arts, Humanities and Others
0400000US01	1	Alabama	792876	14677	(X)	(X)	366201	9531	(X)	...	17.8	1.6	14284	1696	16.4	1.8	14435	1926	19.4	2.4
0400000US02	2	Alaska	139416	4807	(X)	(X)	67843	3222	(X)	...	18.9	3.6	1951	565	17.1	4.4	2090	580	20.9	5.2
0400000US04	4	Arizona	1257449	16239	(X)	(X)	621477	10626	(X)	...	18.4	1	26996	2115	15.6	1.1	30518	2468	21.9	1.6
0400000US05	5	Arkansas	433381	7690	(X)	(X)	198339	5631	(X)	...	15.8	1.7	6777	1168	14	2.4	7330	1299	17.9	2.6
0400000US06	6	California	8415690	37555	(X)	(X)	4123037	23131	(X)	...	23.6	0.5	157982	5589	18.9	0.6	211503	6420	29.1	0.9

	State	Total	SciEng	SciEngRelated	Business	Education	HumArts
2	Alabama	792876	232948	79424	182750	135842	161912
3	Alaska	139416	50587	12854	22495	20385	33095
4	Arizona	1257449	416610	119456	271892	183272	266219
5	Arkansas	433381	126510	42437	95297	83480	85657
6	California	8415690	3427565	674278	1586921	563581	2163345
7	Colorado	1440776	553496	121176	286139	147138	332827
8	Connecticut	948044	342674	80873	187022	100734	236741
9	Delaware	201929	70332	21046	44771	28364	37416
10	District of Columbia	268345	125167	11869	33555	11676	86078
11	Florida	4092338	1283693	414601	995869	589792	808383
12	Georgia	2000113	641664	178239	485769	273021	421420
13	Hawaii	309194	110773	27619	61107	41200	68495
14	Idaho	276912	92786	30241	50346	45560	57979
15	Illinois	2853540	932408	269244	619650	384915	647323
16	Indiana	1088120	305153	134069	226250	185666	236982
17	Iowa	556591	169312	54940	116686	103090	112563
18	Kansas	599063	175589	64756	129649	105871	123198
19	Kentucky	696174	203689	78840	137156	119107	157382
20	Louisiana	718058	200267	88487	144280	121677	163347
21	Maine	289553	102732	28277	39879	45189	73476
22	Maryland	1591614	645663	138044	295580	157419	354908
23	Massachusetts	1951689	780836	158139	364241	175035	473438
24	Michigan	1870473	609561	194451	397446	277764	391251
25	Minnesota	1284007	429808	123550	252884	185440	292325
26	Mississippi	406599	100307	55009	84900	87213	79170
27	Missouri	1140860	342021	113391	246469	184485	254494
28	Montana	216174	71556	23757	35787	39252	45822
29	Nebraska	372288	103636	40424	84020	71497	72711
30	Nevada	463681	147490	42672	105846	61003	106670
31	New Hampshire	334313	123948	31664	63342	42482	72877
32	New Jersey	2318073	854811	195024	527062	259221	481955
33	New Mexico	364462	127327	32978	57214	58421	88522
34	New York	4778463	1623429	415315	895311	548835	1295573
35	North Carolina	1991057	687074	182823	402340	269287	449533
36	North Dakota	143403	40867	20868	27786	27259	26623
37	Ohio	2115116	650627	233599	448383	342930	439577
38	Oklahoma	630004	182916	61146	143828	122057	120057
39	Oregon	901667	347808	79655	131021	100430	242753
40	Pennsylvania	2641023	874686	276665	530252	396210	563210
41	Rhode Island	238818	80596	22368	45589	29037	61228
42	South Carolina	890241	283678	85236	197700	131685	191942
43	South Dakota	154885	47184	15976	29897	33840	27988
44	Tennessee	1151080	342842	117853	256625	171712	262048
45	Texas	4955374	1719782	445895	1164460	632224	993013
46	Utah	554712	180989	55741	104661	77687	135634
47	Vermont	162072	60132	12850	18942	22690	47458
48	Virginia	2102044	863355	159477	397325	201524	480363
49	Washington	1670893	685505	146243	274943	166411	397791
50	West Virginia	254414	74797	32048	44434	51203	51932
51	Wisconsin	1112458	340019	125112	225032	182504	239791
52	Wyoming	102034	35634	10228	14729	21140	20303
53	Puerto Rico	590228	146944	63591	190620	112350	76723

	State	Total	TotalMale	TotalFemale	SciEngTotal	SciEngMale	SciEngFemale	SciEngRelatedTotal	SciEngRelatedMale	SciEngRelatedFemale	...	percentSciEngFemale	percentSciEngRelatedMale	percentSciEngRelatedFemale	percentBusinessMale	percentBusinessFemale	percentEducationMale	percentEducationFemale	percentHumanitiesMale	percentHumanitiesFemale	SciEngRatio
2	Alabama	792876	366201	426675	232948	146493	86455	79424	20420	59004	...	20.26249	5.576173	13.82879	27.62663	19.12017	7.468576	25.42732	19.32518	21.36122	0.5065188
3	Alaska	139416	67843	71573	50587	29875	20712	12854	3428	9426	...	28.93829	5.052843	13.16977	18.57819	13.81946	9.125481	19.83150	23.20799	24.24098	0.6571582
4	Arizona	1257449	621477	635972	416610	262739	153871	119456	34583	84873	...	24.19462	5.564647	13.34540	25.39499	17.93601	7.067840	21.91087	19.69598	22.61310	0.5722941
5	Arkansas	433381	198339	235042	126510	78847	47663	42437	9224	33213	...	20.27850	4.650623	14.13067	27.88357	17.01526	8.265142	28.54256	19.44701	20.03302	0.5101041
6	California	8415690	4123037	4292653	3427565	2040615	1386950	674278	209270	465008	...	32.30986	5.075628	10.83265	20.86023	16.93233	3.194441	10.06075	21.37669	29.86442	0.6528166
7	Colorado	1440776	705075	735701	553496	331019	222477	121176	36832	84344	...	30.24014	5.223841	11.46444	23.05031	16.80261	5.171365	15.04361	19.60642	26.44920	0.6441191

long	lat	group	order	region	subregion
-87.46201	30.38968	1	1	alabama	NA
-87.48493	30.37249	1	2	alabama	NA
-87.52503	30.37249	1	3	alabama	NA
-87.53076	30.33239	1	4	alabama	NA
-87.57087	30.32665	1	5	alabama	NA
-87.58806	30.32665	1	6	alabama	NA

	State	Total	TotalMale	TotalFemale	SciEngTotal	SciEngMale	SciEngFemale	SciEngRelatedTotal	SciEngRelatedMale	SciEngRelatedFemale	...	percentSciEngFemale	percentSciEngRelatedMale	percentSciEngRelatedFemale	percentBusinessMale	percentBusinessFemale	percentEducationMale	percentEducationFemale	percentHumanitiesMale	percentHumanitiesFemale	SciEngRatio
2	Alabama	792876	366201	426675	232948	146493	86455	79424	20420	59004	...	20.26249	5.576173	13.82879	27.62663	19.12017	7.468576	25.42732	19.32518	21.36122	0.5065188
3	Alaska	139416	67843	71573	50587	29875	20712	12854	3428	9426	...	28.93829	5.052843	13.16977	18.57819	13.81946	9.125481	19.83150	23.20799	24.24098	0.6571582
4	Arizona	1257449	621477	635972	416610	262739	153871	119456	34583	84873	...	24.19462	5.564647	13.34540	25.39499	17.93601	7.067840	21.91087	19.69598	22.61310	0.5722941
5	Arkansas	433381	198339	235042	126510	78847	47663	42437	9224	33213	...	20.27850	4.650623	14.13067	27.88357	17.01526	8.265142	28.54256	19.44701	20.03302	0.5101041
6	California	8415690	4123037	4292653	3427565	2040615	1386950	674278	209270	465008	...	32.30986	5.075628	10.83265	20.86023	16.93233	3.194441	10.06075	21.37669	29.86442	0.6528166
7	Colorado	1440776	705075	735701	553496	331019	222477	121176	36832	84344	...	30.24014	5.223841	11.46444	23.05031	16.80261	5.171365	15.04361	19.60642	26.44920	0.6441191