Analyzing College Data About First-Generation Students in the United States

Introduction

I decided to look at the College Scorecard data, but focus on the columns that have information about first-generation students in college. As a first-generation student of color, I have noticed many other first-generation students take time off and sometimes not return to Bowdoin. This observation inspired me to look at the completion rates of first-generation students at various types of institutions. I chose to look at Colleges and Universities that only give out Bachelor's degrees. I look at four year institutions to narrow down the data and get rid of vocational type of institutions and institutions that give out Associate's degrees. Given that many students do not graduate within four years, I look at how likely first-generation students are to graduate within six years.

Information About The Topic

First-generation students have to overcome many obstacles when transitioning into college. Since many first-generation students tend to come from a disadvantaged background, it is often difficult to complete college within four years. For low-income first-generation students, moving into college can be quite the cultural shock. The demands from home and the financial constraints are often difficult to balance in college. These added stresses can lead first-generation students to feel distressed, feel like they do not belong, and encourage them to give up on school all together. The majority of students' stress comes from financial constraints so some studies have predicted that making college more affordable could help increase the retention and completion rates among first-generation students. Despite this suggestion, first-generation students continue to face disadvantages that prevent them from completing college.

Research Question

What states have the lowest rates of first-generation students that graduate with a bachelor's degree within six years?

Hypothesis

I think it is logical to assume that bigger states such as Texas and California will have more first-generation students because they have a bigger pool of college-age students to look at. Having a bigger pool of students means that it is more difficult to have higher completion rates when compared to other states. My prediction is that big states like Texas, California, along with the border states of New Mexico and Arizona, will have higher rates of first-generation students graduate with a Bachelors degree within six years. On a similar note, I feel that smaller states such as North Dakota, The District of Columbia, Vermont, and Connecticut will have lower completion rates because they have a smaller population.

Part One: Code and Subsetting the Data

\1. The code below is loading the different packages that I will be using in my notebook. This is especially important for my visuals and merging the College Scorecard data with the States data.

In [1]:
library(ggplot2)
library(maps)
library(RColorBrewer)
library(ggplot2)
library(rgdal)
library(sp)
library(rgeos)
library(maptools)
Warning message:
"package 'maps' was built under R version 3.3.3"Warning message:
"package 'rgdal' was built under R version 3.3.3"Loading required package: sp
Warning message:
"package 'sp' was built under R version 3.3.3"rgdal: version: 1.2-6, (SVN revision 651)
 Geospatial Data Abstraction Library extensions to R successfully loaded
 Loaded GDAL runtime: GDAL 2.0.1, released 2015/09/15
 Path to GDAL shared files: C:/Users/Karla/Documents/R/win-library/3.3/rgdal/gdal
 Loaded PROJ.4 runtime: Rel. 4.9.2, 08 September 2015, [PJ_VERSION: 492]
 Path to PROJ.4 shared files: C:/Users/Karla/Documents/R/win-library/3.3/rgdal/proj
 Linking to sp version: 1.2-4 
Warning message:
"package 'rgeos' was built under R version 3.3.3"rgeos version: 0.3-23, (SVN revision 546)
 GEOS runtime version: 3.5.0-CAPI-1.9.0 r4084 
 Linking to sp version: 1.2-4 
 Polygon checking: TRUE 

Warning message:
"package 'maptools' was built under R version 3.3.3"Checking rgeos availability: TRUE

\2. The code below creates a vector called states that uses the maps data and then shows us a table of the first six rows in the map data.

In [2]:
states <- map_data("state")
head(states)
longlatgrouporderregionsubregion
-87.4620130.38968 1 1 alabama NA
-87.4849330.37249 1 2 alabama NA
-87.5250330.37249 1 3 alabama NA
-87.5307630.33239 1 4 alabama NA
-87.5708730.32665 1 5 alabama NA
-87.5880630.32665 1 6 alabama NA

\3. I created a logical vector called csc that is loading a new excel spreadsheet I created that contains the following column variables:


INSTNM = Institution Name


Region = Abbreviated State Name


CONTROL = 1 for a Public School, 2 for a Private nonprofit, 3 for a Private for-profit School


LATITUDE


LONGITUDE


UGDS_HISP = Total Share of Enrollment of Undergraduate Degree-Seeking Students who are Hispanic


FIRSTGEN_COMP_ORIG_YR6_RT = Percent of First-Generation Students who Completed Within 6 Years at Original Institution


FIRST_GEN = Share/ Percentage of First-Generation Students


HIGHDEG = 1 for a Certificate Degree, 2 for an Associates Degree, 3 for Bachelors Degree, and 4 for a Graduates Degree


REGION2 = 1 for New England (CT, ME, MA, NH, RI, VT), 2 Mid East (DE, DC, MD, NJ, NY, PA), 3 Great Lakes (IL, IN, MI, OH, WI), 4 Plains (IA, KS, MN, MO, NE, ND, SD), 5 Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV), 6 Southwest (AZ, NM, OK, TX), 7 Rocky Mountains (CO, ID, MT, UT, WY), 8 Far West (AK, CA, HI, NV, OR, WA), 9 Outlying Areas (AS, FM, GU, MH, MP, PR, PW, VI)

In [3]:
csc <- read.csv("College_Data_FirstGen.csv", header = TRUE, stringsAsFactors = FALSE)

\4. The code below turns the abbreviated state names in the "region" column into the lowercase state names so that it can match with the "region" column in the map data.

In [4]:
#'x' is the column of a data.frame that holds 2 digit state codes
stateFromLower <-function(x) {
   #read 52 state codes into local variable [includes DC (Washington D.C. and PR (Puerto Rico)]
  st.codes<-data.frame(
                      state=as.factor(c("AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DC", "DE", "FL", "GA",
                                         "HI", "IA", "ID", "IL", "IN", "KS", "KY", "LA", "MA", "MD", "ME",
                                         "MI", "MN", "MO", "MS",  "MT", "NC", "ND", "NE", "NH", "NJ", "NM",
                                         "NV", "NY", "OH", "OK", "OR", "PA", "PR", "RI", "SC", "SD", "TN",
                                         "TX", "UT", "VA", "VT", "WA", "WI", "WV", "WY")),
                      full=as.factor(c("alaska","alabama","arkansas","arizona","california","colorado",
                                       "connecticut","district of columbia","delaware","florida","georgia",
                                       "hawaii","iowa","idaho","illinois","indiana","kansas","kentucky",
                                       "louisiana","massachusetts","maryland","maine","michigan","minnesota",
                                       "missouri","mississippi","montana","north carolina","north dakota",
                                       "nebraska","new hampshire","new jersey","new mexico","nevada",
                                       "new york","ohio","oklahoma","oregon","pennsylvania","puerto rico",
                                       "rhode island","south carolina","south dakota","tennessee","texas",
                                       "utah","virginia","vermont","washington","wisconsin",
                                       "west virginia","wyoming"))
                       )
     #create an nx1 data.frame of state codes from source column
  st.x<-data.frame(state=x)
     #match source codes with codes from 'st.codes' local variable and use to return the full state name
  refac.x<-st.codes$full[match(st.x$state,st.codes$state)]
     #return the full state names in the same order in which they appeared in the original source
  return(refac.x)
 
}

\5. I created a new column in the csc data called "region" that uses the lowecase names of the states. I then print out the first ten state names for the region column in the csc data.

In [5]:
csc$region <- stateFromLower(csc$STABBR)
csc$region[1:10]
  1. alabama
  2. alabama
  3. alabama
  4. alabama
  5. alabama
  6. alabama
  7. alabama
  8. alabama
  9. alabama
  10. alabama

\6. I created a new vector below called csc_df that merges the csc and states data so that their region column is the same. I then print out the first six rows in a table of the new csc_df vector.

In [6]:
csc_df <- merge(csc, states, by = "region")
head(csc_df)
regionUNITIDOPEIDOPEID6INSTNMCITYSTABBRZIPCONTROLLATITUDE...UGDS_HISPFIRSTGEN_COMP_ORIG_YR6_RTFIRST_GENHIGHDEGREGION2longlatgroupordersubregion
alabama 102076 103800 1038 Snead State Community CollegeBoaz AL 35957-0734 1 34.201247 ... 0.0825 0.070063694 0.545154911 2 5 -87.46201 30.38968 1 1 NA
alabama 102076 103800 1038 Snead State Community CollegeBoaz AL 35957-0734 1 34.201247 ... 0.0825 0.070063694 0.545154911 2 5 -87.48493 30.37249 1 2 NA
alabama 102076 103800 1038 Snead State Community CollegeBoaz AL 35957-0734 1 34.201247 ... 0.0825 0.070063694 0.545154911 2 5 -87.52503 30.37249 1 3 NA
alabama 102076 103800 1038 Snead State Community CollegeBoaz AL 35957-0734 1 34.201247 ... 0.0825 0.070063694 0.545154911 2 5 -87.53076 30.33239 1 4 NA
alabama 102076 103800 1038 Snead State Community CollegeBoaz AL 35957-0734 1 34.201247 ... 0.0825 0.070063694 0.545154911 2 5 -87.57087 30.32665 1 5 NA
alabama 102076 103800 1038 Snead State Community CollegeBoaz AL 35957-0734 1 34.201247 ... 0.0825 0.070063694 0.545154911 2 5 -87.58806 30.32665 1 6 NA

\7. The code below creates a new vector called csc2 that subsets the csc data by only including colleges that only give out Bachelors degrees. The head function prints out the first six rows of the subset of csc.

In [7]:
csc2 <- csc[csc$HIGHDEG == 3,]
head(csc2)
UNITIDOPEIDOPEID6INSTNMCITYSTABBRZIPCONTROLLATITUDELONGITUDEADM_RATE_ALLUGDS_HISPFIRSTGEN_COMP_ORIG_YR6_RTFIRST_GENHIGHDEGREGION2region
8100812 100800 1008 Athens State University Athens AL 35611 1 34.805625 -86.96514 NULL 0.0191 0.579741379 0.471594798 3 5 alabama
11100937 101200 1012 Birmingham Southern CollegeBirmingham AL 35254 2 33.515453 -86.853636 0.533935018 0.0195 0.238095238 0.2 3 5 alabama
13101073 1055400 10554 Concordia College Alabama Selma AL 36701 2 32.42443 -87.023531 0.532846715 0.0373 PrivacySuppressed 0.533477322 3 5 alabama
24101435 101900 1019 Huntingdon College Montgomery AL 36106-2148 2 32.350939 -86.285313 0.583855254 0.0252 0.524137931 0.327559055 3 5 alabama
31101541 102300 1023 Judson College Marion AL 36756 2 32.630526 -87.316127 0.652542373 0.016 0.314285714 0.460580913 3 5 alabama
36101675 102800 1028 Miles College Fairfield AL 35064-2621 2 33.481306 -86.908605 NULL 0.0028 0.193211488 0.42406015 3 5 alabama

\8. Here I created a tx vector that only looks at colleges from csc2 that are in Texas. The s vector subsets the tx data by only looking at the columns listed below. The first six rows are show in the table below.

In [8]:
tx <- csc2$region == "texas"
tx2 <- csc2[csc2$CONTROL == 2,]
s <- csc2[tx,c("UGDS_HISP", "FIRST_GEN", "FIRSTGEN_COMP_ORIG_YR6_RT", "INSTNM", "CONTROL")]

head(s)
UGDS_HISPFIRST_GENFIRSTGEN_COMP_ORIG_YR6_RTINSTNMCONTROL
36480.3275 0.402479339 0.382417582 The Art Institute of Houston 3
36590.4805 0.573844316 0.630573248 Remington College-Dallas Campus 2
36610.3714 0.502638522 PrivacySuppressed Brazosport College 1
36750.1845 0.415 0.302325581 Dallas Christian College 2
36800.3629 0.557563242 0.431472081 Career Point College 3
37110.2269 0.497285751 0.353021354 ITT Technical Institute-Arlington3

\9. I created a vector called complete that gets rid of the NAs and the non numeric values in the UGDS_HISP, FIRST_GEN, and the FIRSTGEN_COMP_IRIG_YR6_RT columns. I edit the s vector by using the vector called complete and then print the first six columns to check if I got rid of the nonnumeric values in the data.

In [9]:
complete <- complete.cases(cbind(as.numeric(s[,1]),as.numeric(s[,2]), as.numeric(s[,3], as.numeric(s[,4]))))
complete[1:5]

s <- s[complete, c("UGDS_HISP", "FIRST_GEN", "FIRSTGEN_COMP_ORIG_YR6_RT", "INSTNM", "CONTROL")]
head(s)
Warning message in cbind(as.numeric(s[, 1]), as.numeric(s[, 2]), as.numeric(s[, :
"NAs introduced by coercion"Warning message in cbind(as.numeric(s[, 1]), as.numeric(s[, 2]), as.numeric(s[, :
"NAs introduced by coercion"
  1. TRUE
  2. TRUE
  3. FALSE
  4. TRUE
  5. TRUE
UGDS_HISPFIRST_GENFIRSTGEN_COMP_ORIG_YR6_RTINSTNMCONTROL
36480.3275 0.402479339 0.382417582 The Art Institute of Houston 3
36590.4805 0.573844316 0.630573248 Remington College-Dallas Campus 2
36750.1845 0.415 0.302325581 Dallas Christian College 2
36800.3629 0.557563242 0.431472081 Career Point College 3
37110.2269 0.497285751 0.353021354 ITT Technical Institute-Arlington 3
37120.322 0.497285751 0.353021354 ITT Technical Institute-Houston West3

\10. Here I created a vector called cexValsthat repeat the size of the plotted values for every row in the csc2 data and I subset to look at schools in texas. The pchVals vector creates plus sign shapes of the plotted values for texas schools. The colVals vector creates light grey plots for the texas schools for all rows in the csc2 data.

In [10]:
cexVals <- rep(0.5, nrow(csc2))
cexVals[csc2$region == "texas"] = 1
pchVals <- rep(3, nrow(csc2))
pchVals[csc2$region == "texas"] = 19
colVals <- rep(grey(0.5), nrow(csc2))
colVals[csc2$region == "texas"] <- grey(0.1)

\11. Below I created two vectors to create a subset of the s vector that includes data for Texas colleges. Sub represents Public Texas colleges and sub2 represents Private forprofit Texas colleges.

In [11]:
sub <- s[s$CONTROL == 1, c("UGDS_HISP", "FIRST_GEN", "FIRSTGEN_COMP_ORIG_YR6_RT", "INSTNM", "CONTROL")]
head(sub)

sub2 <- s[s$CONTROL == 3, c("UGDS_HISP", "FIRST_GEN", "FIRSTGEN_COMP_ORIG_YR6_RT", "INSTNM", "CONTROL")]
head(sub2)
UGDS_HISPFIRST_GENFIRSTGEN_COMP_ORIG_YR6_RTINSTNMCONTROL
37340.5169 0.532457496 0.141333333 Midland College 1
48550.9401 0.633025431 0.110193974 South Texas College1
UGDS_HISPFIRST_GENFIRSTGEN_COMP_ORIG_YR6_RTINSTNMCONTROL
36480.3275 0.402479339 0.382417582 The Art Institute of Houston 3
36800.3629 0.557563242 0.431472081 Career Point College 3
37110.2269 0.497285751 0.353021354 ITT Technical Institute-Arlington 3
37120.322 0.497285751 0.353021354 ITT Technical Institute-Houston West3
37370.1546 0.347560976 0.581395349 Wade College 3
44710.2785 0.497285751 0.353021354 ITT Technical Institute-Austin 3

\12. Using the plot function, I created a scatterplot of the percentage of first-generation students against the percentage of first-generation students that complete a bachelors degree within six years at a private nonprofit college in Texas. I use the size, shape, and color established in the code above, I labeled the x and y-axis accordingly, labeled according to the names of the schools in Texas, and created a line with a slope of one. The points function creates red points for Public institutions in Texas and blue points for Private forprofit institutions.

In [17]:
plot(tx2$FIRST_GEN, tx2$FIRSTGEN_COMP_ORIG_YR6_RT, col=colVals, pch=pchVals, xlab="PercFirstGen", ylab="FirstGenComp6yr", main="First-Generation Students in Private Nonprofit Colleges in Texas")
text(as.numeric(s[,1]), as.numeric(s[,2]), as.numeric(s[,3])+0.001, labels = s$INSTNM, pos = 1, cex = 0.5)
abline(0,1)

points(sub$FIRST_GEN, sub$FIRSTGEN_COMP_ORIG_YR6_RT, col="red")
points(sub2$FIRST_GEN, sub2$FIRSTGEN_COMP_ORIG_YR6_RT, col="blue")
Warning message in xy.coords(x, y, xlabel, ylabel, log):
"NAs introduced by coercion"Warning message in xy.coords(x, y, xlabel, ylabel, log):
"NAs introduced by coercion"

Scatterplot Argument

The scatterplot above shows us that Public Texas Colleges have the highest percentage of first-generation students at around 55% and 63%, but completion rates under 20%. Private forprofit Texas colleges also have a high percentage of first-generation students, but they have a relatively high completion rate for first-generation students ranging from 20%-70%.

\13. The code below creates a vector called logic that creates NA for values that are not a number. The perc vector uses the tapply function that does not include the NAs.

In [18]:
#pg46
logic <- is.na(csc2$FIRSTGEN_COMP_ORIG_YR6_RT)
perc <- tapply(as.numeric(csc2$FIRSTGEN_COMP_ORIG_YR6_RT[!logic]), INDEX=csc2$region, FUN=mean, na.rm=TRUE)
perc
Warning message in tapply(as.numeric(csc2$FIRSTGEN_COMP_ORIG_YR6_RT[!logic]), INDEX = csc2$region, :
"NAs introduced by coercion"
alabama
0.328441442333333
alaska
0.514705882
arizona
0.411028257789474
arkansas
0.337667442333333
california
0.421706313458333
colorado
0.290303224066667
connecticut
0.427514162
delaware
NA
district of columbia
0.243660381
florida
0.379257108578125
georgia
0.320167089214286
hawaii
0.43164708625
idaho
0.371392428333333
illinois
0.39901491676
indiana
0.4464657747
iowa
0.493563651363636
kansas
0.309323142916667
kentucky
0.396936149875
louisiana
0.369227350666667
maine
0.331471098166667
maryland
0.353021354
massachusetts
0.389826531133333
michigan
0.393532005117647
minnesota
0.409438251233333
mississippi
0.2181226265
missouri
0.413737188961538
montana
0.312135579
nebraska
0.364716189
nevada
0.287939139857143
new hampshire
0.48343016425
new jersey
0.270175344
new mexico
0.303975701571429
new york
0.3443760653125
north carolina
0.37912340975
north dakota
0.3344215155
ohio
0.414234273634146
oklahoma
0.329145369833333
oregon
0.518339954
pennsylvania
0.528095511485714
puerto rico
0.14506740575
rhode island
0.474604966
south carolina
0.331111150916667
south dakota
0.2543162576
tennessee
0.340833050764706
texas
0.341924800833333
utah
0.3476715014375
vermont
0.504045520333333
virginia
0.390121723413793
washington
0.243661417461538
west virginia
0.335320375166667
wisconsin
0.407847932210526
wyoming
0.340229885

\14. I created a new data frame called df_perc using the perc vector in the code above. Then I created a coloumn called region in the new data frame that includes the row names of df_perc and then I create a table of df_perc to see how the data frame looks.

In [19]:
df_perc <- as.data.frame(perc)
df_perc$region <- rownames(df_perc)
df_perc
percregion
alabama0.3284414 alabama
alaska0.5147059 alaska
arizona0.4110283 arizona
arkansas0.3376674 arkansas
california0.4217063 california
colorado0.2903032 colorado
connecticut0.4275142 connecticut
delaware NA delaware
district of columbia0.2436604 district of columbia
florida0.3792571 florida
georgia0.3201671 georgia
hawaii0.4316471 hawaii
idaho0.3713924 idaho
illinois0.3990149 illinois
indiana0.4464658 indiana
iowa0.4935637 iowa
kansas0.3093231 kansas
kentucky0.3969361 kentucky
louisiana0.3692274 louisiana
maine0.3314711 maine
maryland0.3530214 maryland
massachusetts0.3898265 massachusetts
michigan0.3935320 michigan
minnesota0.4094383 minnesota
mississippi0.2181226 mississippi
missouri0.4137372 missouri
montana0.3121356 montana
nebraska0.3647162 nebraska
nevada0.2879391 nevada
new hampshire0.4834302 new hampshire
new jersey0.2701753 new jersey
new mexico0.3039757 new mexico
new york0.3443761 new york
north carolina0.3791234 north carolina
north dakota0.3344215 north dakota
ohio0.4142343 ohio
oklahoma0.3291454 oklahoma
oregon0.5183400 oregon
pennsylvania0.5280955 pennsylvania
puerto rico0.1450674 puerto rico
rhode island0.4746050 rhode island
south carolina0.3311112 south carolina
south dakota0.2543163 south dakota
tennessee0.3408331 tennessee
texas0.3419248 texas
utah0.3476715 utah
vermont0.5040455 vermont
virginia0.3901217 virginia
washington0.2436614 washington
west virginia0.3353204 west virginia
wisconsin0.4078479 wisconsin
wyoming0.3402299 wyoming

\15. The logic2 vector below gets rid of the NAs in the perc column in df_perc. The perc column subsetting the logic2 vector changes the NA values to 0.

In [20]:
logic2 <- is.na(df_perc$perc)
df_perc$perc[logic2] <- 0
df_perc
percregion
alabama0.3284414 alabama
alaska0.5147059 alaska
arizona0.4110283 arizona
arkansas0.3376674 arkansas
california0.4217063 california
colorado0.2903032 colorado
connecticut0.4275142 connecticut
delaware0.0000000 delaware
district of columbia0.2436604 district of columbia
florida0.3792571 florida
georgia0.3201671 georgia
hawaii0.4316471 hawaii
idaho0.3713924 idaho
illinois0.3990149 illinois
indiana0.4464658 indiana
iowa0.4935637 iowa
kansas0.3093231 kansas
kentucky0.3969361 kentucky
louisiana0.3692274 louisiana
maine0.3314711 maine
maryland0.3530214 maryland
massachusetts0.3898265 massachusetts
michigan0.3935320 michigan
minnesota0.4094383 minnesota
mississippi0.2181226 mississippi
missouri0.4137372 missouri
montana0.3121356 montana
nebraska0.3647162 nebraska
nevada0.2879391 nevada
new hampshire0.4834302 new hampshire
new jersey0.2701753 new jersey
new mexico0.3039757 new mexico
new york0.3443761 new york
north carolina0.3791234 north carolina
north dakota0.3344215 north dakota
ohio0.4142343 ohio
oklahoma0.3291454 oklahoma
oregon0.5183400 oregon
pennsylvania0.5280955 pennsylvania
puerto rico0.1450674 puerto rico
rhode island0.4746050 rhode island
south carolina0.3311112 south carolina
south dakota0.2543163 south dakota
tennessee0.3408331 tennessee
texas0.3419248 texas
utah0.3476715 utah
vermont0.5040455 vermont
virginia0.3901217 virginia
washington0.2436614 washington
west virginia0.3353204 west virginia
wisconsin0.4078479 wisconsin
wyoming0.3402299 wyoming

\16. I checked the summary of the percent of first-generation students that complete college within 6 years variable. The hist function creates a histogram with twenty breaks with the x-axis labeled and the creation of a title.

In [22]:
summary(as.numeric(csc2$FIRSTGEN_COMP_ORIG_YR6_RT))
hist(as.numeric(csc2$FIRSTGEN_COMP_ORIG_YR6_RT), breaks=20, xlab= "Percent of First-Gen Students", main="First-Gen Completion Rates Within Six Years")
Warning message in summary(as.numeric(csc2$FIRSTGEN_COMP_ORIG_YR6_RT)):
"NAs introduced by coercion"
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
0.02766 0.27360 0.35300 0.38000 0.47460 0.85420     195 
Warning message in hist(as.numeric(csc2$FIRSTGEN_COMP_ORIG_YR6_RT), breaks = 20, :
"NAs introduced by coercion"

\17. The histogram above shows the spread of the percentage of first-generation students that graduate from college with a Bachelors degree within 6 years. The spread looks relatively normal. Here is a decription of what states are in each region

1 for New England (CT, ME, MA, NH, RI, VT)


2 Mid East (DE, DC, MD, NJ, NY, PA)


3 Great Lakes (IL, IN, MI, OH, WI)


4 Plains (IA, KS, MN, MO, NE, ND, SD)


5 Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV)


6 Southwest (AZ, NM, OK, TX)


7 Rocky Mountains (CO, ID, MT, UT, WY)


8 Far West (AK, CA, HI, NV, OR, WA)


9 Outlying Areas (AS, FM, GU, MH, MP, PR, PW, VI)

In [32]:
ggplot(csc2, aes(x=factor(REGION2), y=as.numeric(FIRSTGEN_COMP_ORIG_YR6_RT), fill = factor(REGION2))) + geom_bar(stat='identity') +
    labs(x="Region") +
    labs(y="Count") +
    labs(title="Total Number of First-Gen Students Who Complete College in the U.S.")
Warning message in eval(expr, envir, enclos):
"NAs introduced by coercion"Warning message in eval(expr, envir, enclos):
"NAs introduced by coercion"Warning message:
"Removed 195 rows containing missing values (position_stack)."

/18. The histogram above show that region 5 has the most number of first-generation students complete college within 6 years, while Region 9 has the least amount of first-generation students who complete college within 6 years. This is an interesting observation considering that Region 5 contains AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, and WV.

\19. The code below attempts to get rid of any negative values by setting any percentage less than 0 equal to 0. The creation of the interval vector cuts the perc column into four intervals and prints them below.

In [33]:
df_perc$perc[df_perc$perc<0] = 0
interval <- unique(cut(df_perc$perc, 4))
interval
  1. (0.264,0.396]
  2. (0.396,0.529]
  3. (-0.000528,0.132]
  4. (0.132,0.264]

\20. The next set of code creates breaks from df_perc$perc with the following labels accoriding the intervals creates above.

In [90]:
df_perc$breaks = cut(df_perc$perc, 4, labels = c("0-.132", ".132-.264", ".264-.396", ".396-.529"))
head(df_perc)
percregionbreaks
alabama0.3284414 alabama .264-.396
alaska0.5147059 alaska .396-.529
arizona0.4110283 arizona .396-.529
arkansas0.3376674 arkansas .264-.396
california0.4217063 california.396-.529
colorado0.2903032 colorado .264-.396

\21. chor_df is created to merge the states data with the df_perc data according to region and then prints the first six rows of the data.

In [86]:
choro_df <- merge(states, df_perc, by = "region")
head(choro_df)
regionlonglatgroupordersubregionpercbreaks
alabama -87.4620130.38968 1 1 NA 0.3284414.264-.396
alabama -87.4849330.37249 1 2 NA 0.3284414.264-.396
alabama -87.5250330.37249 1 3 NA 0.3284414.264-.396
alabama -87.5307630.33239 1 4 NA 0.3284414.264-.396
alabama -87.5708730.32665 1 5 NA 0.3284414.264-.396
alabama -87.5880630.32665 1 6 NA 0.3284414.264-.396

\22. Next, choro is ordered and the first six rows are printed.

In [87]:
choro <- choro_df[order(choro_df$order), ]
head(choro)
regionlonglatgroupordersubregionpercbreaks
alabama -87.4620130.38968 1 1 NA 0.3284414.264-.396
alabama -87.4849330.37249 1 2 NA 0.3284414.264-.396
alabama -87.5250330.37249 1 3 NA 0.3284414.264-.396
alabama -87.5307630.33239 1 4 NA 0.3284414.264-.396
alabama -87.5708730.32665 1 5 NA 0.3284414.264-.396
alabama -87.5880630.32665 1 6 NA 0.3284414.264-.396

\23. After the data is cleaned, we are finally ready to plot the data on a map. I used a qplot that uses the longitude and latitude of the choro data and fills the states according to the breaks created earlier. I create a title using main, I border each state so that it is easier to find states, and I use the Spectral palette to color states by various colors.

In [111]:
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon", 
      main = "College Completion Rates for First-Generation Students") +  borders("state", size = 0.5) +
    scale_fill_brewer(name = "College Completion", palette = "Spectral")

Map Analysis

Red = Delaware


Orange = Washington, South Dakota, and Mississippi


Green = Montana, Idaho, Wyoming, North Dakota, Nevada, Utah, Colorado, New Mexico, Texas, Oklahoma, Kansas, Nebraska, Michigan, Maine, New York, Massachusetts, New Jersey, Maryland, Virgina, West Virginia, North Carolina, Tennessee, South Carolina, Georgia, Alabama, and Florida


Blue = Oregon, California, Arizona, Minnesota, Iowa, Missouri, Wisconsin, Illinois, Indiana, Kentucky, Ohio, Pennsylvania, Connecticut, Rhode Island, Vermont, and New Hampshire

I decided to focus my time on analyzing the red and orange states and looking into why states have rates between 0 and 26%. First-generation students tend to be racial minorities, and/or from a low-income family, and often headed by a single parent household. These characteristics make it more difficult for first-generaton students to complete college. Many first-generation students feel pressure to drop out of school because of family problems with money, stress and anxiety, a sense of not belonging, and off-campus employment. It is easier to get to the root of why completion rates for first-generation students, but it is difficult to look at why the low rates are specificly low in certain states.

Conclusion

After lookin closely at my data, Delaware does not have any colleges that give out Bachelors degrees. This could be the main reason why the state is seen to have the lowest rate of first-generation students completing college. As far as the oranges states that have completion rates between 13% and 26%, there is enough data in the College Scorecard data for 4-year institutions. The Robert B Miller College in Washington must have pulled the average completion rate at a rate of 53% while Seattle Central College has a completion rate less than 1%, but first-generation students make up 43% of the student population. In South Dakota, Presentation College has 30% of first-generation college students grduate from college. In Mississippi, one out of three colleges did not release information about the percentage of first-generation students who completed college, and Rust College has the lowest percentage of first-generation students to complete college at a rate of 15%. I definitely limited my data by only looking at 4-year institutions, but I think the pecentage averages of each states acurately express each state.

Bibliography

Boyd, Vivian S. Linda, K. Gast, Patricia F. Hunt, Alice Mitchell, and Wendy Wilson. "Why Some Students Leave College During Their Senior Year." Journal of College Student Development 53.5 (2012): 737-42. Web.


Riggs, Liz. "First-Generation College-Goers: Unprepared and Behind." The Atlantic, 31 Dec. 2014, http://www.theatlantic.com/education/archive/2014/12/the-added-pressure-faced-by-first-generation-students/384139/. Accessed 7 May 2017.


Wilbur, T. G., and V. J. Roscigno. "First-generation Disadvantage and College Enrollment/Completion." Socius: Sociological Research for a Dynamic World 2.0 (2016): 1-11. Web.


Wolfman-Arent, Avi. "First Year, First Generation: Overwhelmed by demands, buoyed by encouragement." newsworks, 28 Jun. 2016, http://www.newsworks.org/index.php/local/education/94947-first-year-first-generation-seans-spot. Accessed 7 May 2017.


Zinshteyn, Mikhail. "How to Help First-Generation Students Succeed." The Atlantic, 13 Mar. 2016, http://www.theatlantic.com/education/archive/2016/03/how-to-help-first-generation-students-succeed/473502/. Accessed on 7 May 2017.

In [ ]: