The Monitoring The Future surveys are a series of surveys that have emerged as a vital tool in measuring the values, behaviors, and lifestyle orientations of American youth. Through a comprehensive series of surveys, these studies offer a unique glimpse into the ever-changing landscape of 12th grade students. There are approximately 1,400 variables across all of the questionnaires; while recognizing the vastness of the available data, our exploration will be primarily focused on a subset of variables deemed particularly relevant to our research.
One of the critical aspects examined in the surveys pertains to the frequency of drug use among students, encompassing a wide array of illicit and recreational substances. In the context of this analysis, specific attention has been limited to surveys which observe instances where students engaged in the use of heroin or other opioid narcotics.
While the investigation of drug use patterns holds significance in understanding the landscape of contemporary American youth, the Monitoring The Future surveys provide a broader canvas for exploration. Our analysis incorporates other topics including, but not limited to, students' perspectives on religion, educational goals, family life dynamics, and work habits. By examining these multidimensional facets, we may gain a holistic understanding of the myriad factors influencing the lives, aspirations, and substance use risks of American students.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sys
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import recall_score, precision_score, accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.utils import resample
from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
As in our previous work, we exclude a number of variables due to their strong correlations. For instance, we observed among the survey recipients an expected connection between the number of alcoholic drinks consumed this year and the number of alcoholic drinks consumed in the last 30 days. We believe that the latter variable will serve as a more robust indicator for opiate use, so to streamline our analysis and enhance the veracity of our work, we made the decision to remove the former variable.
After removing a number of observations that were missing data, we were left with approximately 42,000 observations spanning the 10 years of survey data.
Most of the data you see below is either recorded as a binary outcome (1/yes or 0/no), as an ordinal value(i.e., 0 = no drug use, 1 = some drug use, 2 = frequent drug use), or as a categorical feature. The categorical data is encoded and represented as binary data, and the ordinal values were scaled down so that all of the data was within the range of 0 and 1.
random_seed=2023
variable_dict = {
'V1': 'Year',
'RESPONDENT_AGE': 'Over18',
'V13': 'SchoolRegion',
'V49': 'NumberOfSiblings',
'V2102': 'Cigarettes/30Days',
'V2106': 'AlcoholicDrinks/30Days',
'V2117': 'Marijuana/30Days',
'V2118': 'LSD/Lifetime',
'V2121': 'Psychedelics/Lifetime',
'V2124': 'Cocaine/Lifetime',
'V2127': 'Amphetamines/Lifetime',
'V2133': 'Sedatives/Lifetime',
'V2136': 'Tranquilizers/Lifetime',
'V2139': 'Heroine/Lifetime',
'V2142': 'OpioidNarcotics/Lifetime',
'V2150': 'Sex',
'V2151': 'Race',
'V2152': 'RaisedWhere',
'V2153': 'MaritalStatus',
'V2155': 'LivesWithFather',
'V2156': 'LivesWithMother',
'V2157': 'LivesWithSiblings',
'V2163': 'FatherEduLvl',
'V2164': 'MotherEduLvl',
'V2165': 'MotherHadPaidJobWhileGrowingUp',
'V2166': 'PoliticalPreference',
'V2167': 'PoliticalBeliefs',
'V2169': 'ReligiousServiceAttendenceWkly',
'V2170': 'ReligionImportance',
'V2172': 'HighSchoolProgram',
'V2174': 'SelfRateIntelligence',
'V2175': 'SchoolDaysMissedIllness/4Weeks',
'V2176': 'SchoolDaysMissedSkipped/4Weeks',
'V2177': 'SchoolDaysMissedOther/4Weeks',
'V2178': 'SkippedClass/4Weeks',
'V2179': 'AverageGradeHS',
'V2180': 'LikelyToAttendVocationalSchl',
'V2181': 'LikelyToServeInMilitary',
'V2182': 'LikelyToGraduate2YrCollege',
'V2183': 'LikelyToGraduate4YrCollege',
'V2184': 'LikelyToAttendGraduateSchl',
'V2185': 'WantToDoVocationalSchl',
'V2186': 'WantToServeInMilitary',
'V2187': 'WantToDo2YrCollege',
'V2188': 'WantToDo4YrCollege',
'V2189': 'WantToDoGradSchl',
'V2190': 'WantToDoNo2ndEd',
'V2191': 'HrsWorkedPerWeek',
'V2193': 'MoneyFromOtherSource',
'V2194': 'EveningsOutPerWeek',
'V2195': 'DatesHowOften',
'V2196': 'MilesDrivenPerWeek',
'V2197': 'DrivingTickets',
'V2201': 'CarAccidentsLast12Mo',
'V2459': 'Crack/Lifetime',
}
df_2012 = pd.read_csv('Grade12/ICPSR_34861/DS0001/34861-0001-Data.tsv', sep='\t')
df_2013 = pd.read_csv('Grade12/ICPSR_35218/DS0001/35218-0001-Data.tsv', sep='\t')
df_2014 = pd.read_csv('Grade12/ICPSR_36263/DS0001/36263-0001-Data.tsv', sep='\t')
df_2015 = pd.read_csv('Grade12/ICPSR_36408/DS0001/36408-0001-Data.tsv', sep='\t')
df_2016 = pd.read_csv('Grade12/ICPSR_36798/DS0001/36798-0001-Data.tsv', sep='\t')
df_2017 = pd.read_csv('Grade12/ICPSR_37182/DS0001/37182-0001-Data.tsv', sep='\t')
df_2018 = pd.read_csv('Grade12/ICPSR_37416/DS0001/37416-0001-Data.tsv', sep='\t')
df_2019 = pd.read_csv('Grade12/ICPSR_37841/DS0001/37841-0001-Data.tsv', sep='\t')
df_2020 = pd.read_csv('Grade12/ICPSR_38156/DS0001/38156-0001-Data.tsv', sep='\t')
df_2021 = pd.read_csv('Grade12/ICPSR_38503/DS0001/38503-0001-Data.tsv', sep='\t')
df_list = [df_2012, df_2013, df_2014, df_2015, df_2016, df_2017, df_2018, df_2019, df_2020, df_2021]
year_list = [2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]
vars = set(list(df_2012.columns))
for df in df_list:
vars = set(vars & set(list(df.columns)))
variables = list(vars)
df = pd.concat([x[variables] for x in df_list], ignore_index=True)
df = df[variable_dict.keys()]
# Remove missing data
missing_criteria = (df == -9).sum() < 0.3*len(df.index)
df = df[missing_criteria.index[missing_criteria]]
minimal_missing = []
for i in df.index:
cnt = sum(df.iloc[i, :] == -9)
if cnt < 1:
minimal_missing.append(i)
df = df[df.index.isin(minimal_missing)]
# Combine Opiate Use data
df['OpiateUse'] = ((df['V2142'] != 1) + (df['V2139'] != 1)).astype(int)
df = df.drop(['V2142', 'V2139'], axis=1)
# Rename columns using data dictionary
df.rename(columns=variable_dict, inplace=True)
# Factor categorical data
dummy_cols = ['SchoolRegion', 'Race', 'RaisedWhere', 'MaritalStatus', 'PoliticalPreference', 'PoliticalBeliefs', 'HighSchoolProgram']
dummies = pd.get_dummies(df[dummy_cols], columns=dummy_cols, drop_first=True)
df = pd.concat([df, dummies], axis=1)
df = df.drop(dummy_cols, axis=1)
# Normalize data
df.replace({False: 0, True: 1}, inplace=True)
years_vec = df['Year']
df = (df-df.min())/(df.max()-df.min())
df['Year'] = years_vec
df = df.reset_index(drop=True)
df.head()
Year | Over18 | NumberOfSiblings | Cigarettes/30Days | AlcoholicDrinks/30Days | Marijuana/30Days | LSD/Lifetime | Psychedelics/Lifetime | Cocaine/Lifetime | Amphetamines/Lifetime | ... | PoliticalPreference_8 | PoliticalBeliefs_2 | PoliticalBeliefs_3 | PoliticalBeliefs_4 | PoliticalBeliefs_5 | PoliticalBeliefs_6 | PoliticalBeliefs_8 | HighSchoolProgram_2 | HighSchoolProgram_3 | HighSchoolProgram_4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012 | 0.0 | 0.666667 | 0.333333 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 |
1 | 2012 | 0.0 | 1.000000 | 0.166667 | 0.166667 | 0.166667 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
2 | 2012 | 1.0 | 0.333333 | 0.000000 | 0.166667 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 2012 | 0.0 | 0.333333 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
4 | 2012 | 1.0 | 0.333333 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 79 columns
Here we find a steady decline in opiate use rates amongst the surveyed 12th graders. There is a significant drop in opiate abuse in 2021, however this abnormal decrease could be from a number of outlying conditions such as the COVID-19 pandemic and the subsequent stay at home orders and work from home school policies.
series = (df.groupby('Year')['OpiateUse'].sum() / df.groupby('Year')['Year'].count())
ax = sns.barplot(x=list(series.index), y=list(series.array), color='purple')
ax.set(xlabel='Year', ylabel='Opiate Use Rate')
plt.show()
Below we can see that more than half of all those surveyed who had ever used tranquilzers, cocaine, and pychedelics had also abused opiates or opioid narcotics at somepoint in their lives. Rates of coinciding sedative, amphetamine, and LSD use are not far behind.
drug_use_df = df[['Tranquilizers/Lifetime', 'Amphetamines/Lifetime', 'Sedatives/Lifetime',
'Marijuana/30Days', 'Cocaine/Lifetime', 'Cigarettes/30Days', 'Psychedelics/Lifetime',
'AlcoholicDrinks/30Days', 'LSD/Lifetime', 'OpiateUse']]
ratio_dict = {}
for col in drug_use_df.columns:
ratio = drug_use_df.loc[((drug_use_df[col] > 0) & (drug_use_df['OpiateUse'] > 0))].shape[0] / drug_use_df.loc[(drug_use_df[col] > 0)].shape[0]
ratio_dict[col] = ratio
ratio_dict.pop('OpiateUse', None)
ratio_dict
{'Tranquilizers/Lifetime': 0.5361467889908257, 'Amphetamines/Lifetime': 0.4222972972972973, 'Sedatives/Lifetime': 0.4824810606060606, 'Marijuana/30Days': 0.21971732274267025, 'Cocaine/Lifetime': 0.6027928626842514, 'Cigarettes/30Days': 0.2929608938547486, 'Psychedelics/Lifetime': 0.5102379634753735, 'AlcoholicDrinks/30Days': 0.14715653703341802, 'LSD/Lifetime': 0.4817658349328215}
If readers are familiar with the previous analysis on the HEAL platform: Monitoring The Future: 12th Grader Opiate Use - Part 1, an undeniable connection was identified by the machine learning models used in that notebook between opioid abuse and other drug use for 12th graders in 2019. While that connection appears to exist again, let's examine if the same descrease in opioid use is present with other drug use amongst 12th graders.
series = (df.groupby('Year')['Cocaine/Lifetime'].sum() / df.groupby('Year')['Year'].count())
ax = sns.barplot(x=list(series.index), y=list(series.array), color='purple')
ax.set(xlabel='Year', ylabel='Cocaine Use Rate')
plt.show()
series = (df.groupby('Year')['Tranquilizers/Lifetime'].sum() / df.groupby('Year')['Year'].count())
ax = sns.barplot(x=list(series.index), y=list(series.array), color='purple')
ax.set(xlabel='Year', ylabel='Tranquilizer Use Rate')
plt.show()
series = (df.groupby('Year')['Sedatives/Lifetime'].sum() / df.groupby('Year')['Year'].count())
ax = sns.barplot(x=list(series.index), y=list(series.array), color='purple')
ax.set(xlabel='Year', ylabel='Sedatives Use Rate')
plt.show()
series = (df.groupby('Year')['Amphetamines/Lifetime'].sum() / df.groupby('Year')['Year'].count())
ax = sns.barplot(x=list(series.index), y=list(series.array), color='purple')
ax.set(xlabel='Year', ylabel='Amphetamines Use Rate')
plt.show()
While there is a substantial decrease in the use of amphetamines, sedatives, tranquilizers, and cocaine in 2021, we do not see the same slow decline in their usage from 2012 to 2020. It is, however, worth noting the significant degree to which opioids are abused compared to the other illicit substances. Opiate use peaked at over 11% in 2012, while all of the other substances have peaks between 1% and 6%.
*Opiate use is a combined measurement of the heroin and opioid narcotic use survey results.
The work done in the first example analysis notebook in this series used SMOTE (Synthetic Minority Oversampling Technique) and multiple, simpler classification models, such as logistic regression and random forest classification. This notebook will use the XGB (Extreme Gradient Boosting) classification algorithm. Unlike with the simpler models used previously, the XGBClassifier does not require the use of SMOTE. Instead, we will use the XGBClassifier hyperparameter, 'scale_pos_weight'. When classes are imbalanced, 'scale_pos_weight' is typically set as the ratio of the number of overrepresented observations (no opiate use) to the number of underrepresented observations (opiate use). This hyperparameter scales the gradient of the positive, underrepresented class, thus scaling the errors of the positive class during the model training step.
While 'scale_pos_weight' is taken at the suggested value, most of the other hyperparameters to the XGBClassifier model were tuned using the grid search and 10-fold cross validation methods. In other words, we exhaustively tested multiple combinations of hyperparameters from a subset of possible values and verified the optimal combination of hyperparameters using 10-fold cross validation.
Below you will find the function used for cross validation and the final model used with the tuned hyperparameters.
def pipeline_cross_validation(data, k, pipeline_steps):
folds = np.array_split(data, k)
accuracySum = 0
precisionSum = 0
recallSum = 0
for i in range(k):
train = folds.copy()
test = folds[i]
del train[i]
train = pd.concat(train, sort=False)
y_train = train.OpiateUse
X_train = train.drop('OpiateUse', axis=1)
y_test = test.OpiateUse
X_test = test.drop('OpiateUse', axis=1)
pipeline = Pipeline(pipeline_steps).fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracySum += accuracy_score(y_test, y_pred)
precisionSum += precision_score(y_test, y_pred)
recallSum += recall_score(y_test, y_pred)
return [accuracySum/k, precisionSum/k, recallSum/k]
df1 = df.drop(['Year'], axis=1)
model = XGBClassifier(
learning_rate =0.01,
n_estimators=200,
max_depth=5,
min_child_weight=1,
gamma=0.6,
subsample=0.8,
colsample_bytree=0.7,
objective= 'binary:logistic',
seed=random_seed,
scale_pos_weight = len(df1)/sum(df1['OpiateUse']))
steps = [('model', model)]
score = pipeline_cross_validation(df1, k=10, pipeline_steps = steps)
print(f'Accuracy: {score[0]} | Precision: {score[1]} | Recall: {score[2]}')
Accuracy: 0.8616077788418954 | Precision: 0.31930083188943403 | Recall: 0.7636644739387797
This first iteration of model training, testing, and evaluation will use all 10 years of data spaning from 2012 to 2021. This model will be the same XGBClassifier that was tuned and validated above. After the model is trained, we will examine the model's performance by looking at the model's accuracy, precision, and recall metrics. Model accuracy will tell us the model's rate of correctly classifying both the students who have and have not ever abused opiates; model recall will tell us the model's rate of correctly identifying all of the students who have abused opiates, and model precision will tell us the model's rate of correctly identifying students who have abused opiates amongst all students the model identifies as having used opiates.
After examining the model's performance and the corresponding confusion matrix, we will look at what the model considers to be the most and least relevant features to capturing students' opiate use behavior.
X = df1.drop('OpiateUse', axis=1)
y = df1.OpiateUse
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_seed)
model = XGBClassifier(
learning_rate =0.01,
n_estimators=200,
max_depth=5,
min_child_weight=1,
gamma=0.6,
subsample=0.8,
colsample_bytree=0.7,
objective= 'binary:logistic',
seed=random_seed,
scale_pos_weight = len(df1)/sum(df1['OpiateUse']))
steps = [('model', model)]
boosted_model = model.fit(X_train, y_train)
y_pred = boosted_model.predict(X_test)
The following model scores are scaled between 0 and 1, with 1 being perfect performance. The scores and the corresponding confusion matrix tells us that the model performs relatively well at identifying students who have ever used opiates illicitly, identifying almost 78% of all students who have used opiates illicitly, and performs well at correctly classifying 85% of all students in either category. However, the model does a poor job at identifying only those who abused opioids amongst the student population. In fact, almost 70% of the students our model says have used opioids illicitly have not.
print(f'Accuracy: {accuracy_score(y_test, y_pred)} | Precision: {precision_score(y_test, y_pred)} | Recall: {recall_score(y_test, y_pred)}')
Accuracy: 0.8562589243217515 | Precision: 0.3153846153846154 | Recall: 0.7784810126582279
disp = ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred))
disp.plot()
disp.im_.colorbar.remove()
plt.show()
feature_importance = pd.DataFrame(index=df1.columns.drop('OpiateUse'))
feature_importance[f'Feature Importance'] = (boosted_model.feature_importances_ - min(boosted_model.feature_importances_)) / (max(boosted_model.feature_importances_) - min(boosted_model.feature_importances_))
feature_importance.sort_values(by=['Feature Importance'], ascending=False).head(15)
Feature Importance | |
---|---|
Amphetamines/Lifetime | 1.000000 |
Tranquilizers/Lifetime | 0.546011 |
Marijuana/30Days | 0.349006 |
Sedatives/Lifetime | 0.169479 |
AlcoholicDrinks/30Days | 0.159707 |
Cigarettes/30Days | 0.107511 |
LivesWithSiblings | 0.066077 |
Cocaine/Lifetime | 0.057510 |
LSD/Lifetime | 0.045208 |
Psychedelics/Lifetime | 0.039737 |
MilesDrivenPerWeek | 0.039215 |
SchoolDaysMissedSkipped/4Weeks | 0.031901 |
DrivingTickets | 0.029383 |
WantToServeInMilitary | 0.028945 |
LikelyToAttendGraduateSchl | 0.024677 |
feature_importance.sort_values(by=['Feature Importance'], ascending=False).tail(15)
Feature Importance | |
---|---|
WantToDo2YrCollege | 0.004603 |
HighSchoolProgram_3 | 0.003934 |
FatherEduLvl | 0.003796 |
MotherEduLvl | 0.003511 |
RaisedWhere_8 | 0.003493 |
SkippedClass/4Weeks | 0.003340 |
HighSchoolProgram_4 | 0.003050 |
PoliticalBeliefs_2 | 0.002843 |
MaritalStatus_2 | 0.002649 |
PoliticalPreference_2 | 0.002466 |
HighSchoolProgram_2 | 0.002207 |
RaisedWhere_6 | 0.002152 |
Race_2 | 0.000881 |
PoliticalPreference_3 | 0.000039 |
Race_3 | 0.000000 |
Above we see the 15 most and least important features ranked by their normalized feature importance scores. These feature scores are an indication of the relative amount each feature improves upon the performance of the model when making a decision along the given feature. As with our work in part one, we find that the students' use of other illicit substances are the most telling indicators as to whether a student will use opiates or opioid narcotics. We also find that the students race, political preference or beliefs, and where students are raised gives little to no indication towards whether they will abuse opioids.
Next we will rerun this analysis but focus on only one year of data at a time. We will use the same hyperparameters as before for these newly trained models, since the computation for hyperparameter tuning takes a significant amount of time. After training the 10 models for the 10 years of data, we will again examine the models' performance and ranked feature importance.
feature_importance = pd.DataFrame(index=df1.columns.drop('OpiateUse'))
classification_scores = pd.DataFrame({'Metric': ['Accuracy', 'Precision', 'Recall']})
for year in year_list:
df_annual = df[df['Year'] == year]
df_annual = df_annual.drop(['Year'], axis=1)
X = df_annual.drop('OpiateUse', axis=1)
y = df_annual.OpiateUse
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_seed)
model = XGBClassifier(
learning_rate =0.01,
n_estimators=200,
max_depth=5,
min_child_weight=1,
gamma=0.6,
subsample=0.8,
colsample_bytree=0.7,
objective= 'binary:logistic',
seed=random_seed,
scale_pos_weight = len(df_annual)/sum(df_annual['OpiateUse']))
boosted_model = model.fit(X_train, y_train)
y_pred = boosted_model.predict(X_test)
feature_importance[f'Feature Importance {year}'] = (boosted_model.feature_importances_ - min(boosted_model.feature_importances_)) / (max(boosted_model.feature_importances_) - min(boosted_model.feature_importances_))
classification_scores[f'Model Performance {year}'] = [accuracy_score(y_test, y_pred), precision_score(y_test, y_pred), recall_score(y_test, y_pred)]
Below we can see that for the first few years our model's performance is only slightly below the previous performance of the model using all 10 years of data. In 2020 and 2021, the models do not perform nearly as well as the models trained on prior annual data or the model trained using all of the data. The underperformance of the models from earlier years compared to our first model is to be expected, as we are training those models on a much smaller subset of data. However, the poor performance of the later models cannot be completely explained by the difference in size of the training data. The lower performance of the models from 2020 and 2021 may be explained by the disruption to behavior caused by the COVID-19 pandemic, but may also be exlpained by the massive decrease in the opioid use rate of the 12th graders surveyed in 2021.
classification_scores
Metric | Model Performance 2012 | Model Performance 2013 | Model Performance 2014 | Model Performance 2015 | Model Performance 2016 | Model Performance 2017 | Model Performance 2018 | Model Performance 2019 | Model Performance 2020 | Model Performance 2021 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Accuracy | 0.864261 | 0.875000 | 0.886261 | 0.900881 | 0.885468 | 0.918782 | 0.912470 | 0.920582 | 0.929688 | 0.962552 |
1 | Precision | 0.443439 | 0.431694 | 0.433566 | 0.357798 | 0.369748 | 0.494118 | 0.337079 | 0.392857 | 0.428571 | 0.344828 |
2 | Recall | 0.736842 | 0.669492 | 0.756098 | 0.661017 | 0.709677 | 0.666667 | 0.681818 | 0.622642 | 0.375000 | 0.555556 |
Like before, we see the same relation between between general illicit substance use and illicit opiate use. It is worth noting that the use of tranqulizers and amphetamines typically took the top 1 and 2 positions for most influential features until 2021, when they fell to the 4th and 5th ranking. This is another indicator that something significant shifted in our model's classification of 12th grader opiate use.
df12 = pd.DataFrame(feature_importance.sort_values(by=['Feature Importance 2012'], ascending=False)['Feature Importance 2012']).head(10).reset_index(names=['Feature'])
df12['Feature Importance 2012'] = df12['Feature'] +" - "+ df12["Feature Importance 2012"].astype(str)
df12 = df12.drop(['Feature'], axis=1)
df13 = pd.DataFrame(feature_importance.sort_values(by=['Feature Importance 2013'], ascending=False)['Feature Importance 2013']).head(10).reset_index(names=['Feature'])
df13['Feature Importance 2013'] = df13['Feature'] +" - "+ df13["Feature Importance 2013"].astype(str)
df13 = df13.drop(['Feature'], axis=1)
df14 = pd.DataFrame(feature_importance.sort_values(by=['Feature Importance 2014'], ascending=False)['Feature Importance 2014']).head(10).reset_index(names=['Feature'])
df14['Feature Importance 2014'] = df14['Feature'] +" - "+ df14["Feature Importance 2014"].astype(str)
df14 = df14.drop(['Feature'], axis=1)
df15 = pd.DataFrame(feature_importance.sort_values(by=['Feature Importance 2015'], ascending=False)['Feature Importance 2015']).head(10).reset_index(names=['Feature'])
df15['Feature Importance 2015'] = df15['Feature'] +" - "+ df15["Feature Importance 2015"].astype(str)
df15 = df15.drop(['Feature'], axis=1)
df16 = pd.DataFrame(feature_importance.sort_values(by=['Feature Importance 2016'], ascending=False)['Feature Importance 2016']).head(10).reset_index(names=['Feature'])
df16['Feature Importance 2016'] = df16['Feature'] +" - "+ df16["Feature Importance 2016"].astype(str)
df16 = df16.drop(['Feature'], axis=1)
df17 = pd.DataFrame(feature_importance.sort_values(by=['Feature Importance 2017'], ascending=False)['Feature Importance 2017']).head(10).reset_index(names=['Feature'])
df17['Feature Importance 2017'] = df17['Feature'] +" - "+ df17["Feature Importance 2017"].astype(str)
df17 = df17.drop(['Feature'], axis=1)
df18 = pd.DataFrame(feature_importance.sort_values(by=['Feature Importance 2018'], ascending=False)['Feature Importance 2018']).head(10).reset_index(names=['Feature'])
df18['Feature Importance 2018'] = df18['Feature'] +" - "+ df18["Feature Importance 2018"].astype(str)
df18 = df18.drop(['Feature'], axis=1)
df19 = pd.DataFrame(feature_importance.sort_values(by=['Feature Importance 2019'], ascending=False)['Feature Importance 2019']).head(10).reset_index(names=['Feature'])
df19['Feature Importance 2019'] = df19['Feature'] +" - "+ df19["Feature Importance 2019"].astype(str)
df19 = df19.drop(['Feature'], axis=1)
df20 = pd.DataFrame(feature_importance.sort_values(by=['Feature Importance 2020'], ascending=False)['Feature Importance 2020']).head(10).reset_index(names=['Feature'])
df20['Feature Importance 2020'] = df20['Feature'] +" - "+ df20["Feature Importance 2020"].astype(str)
df20 = df20.drop(['Feature'], axis=1)
df21 = pd.DataFrame(feature_importance.sort_values(by=['Feature Importance 2021'], ascending=False)['Feature Importance 2021']).head(10).reset_index(names=['Feature'])
df21['Feature Importance 2021'] = df21['Feature'] +" - "+ df21["Feature Importance 2021"].astype(str)
df21 = df21.drop(['Feature'], axis=1)
pd.concat([df12 ,df13, df14, df15, df16, df17, df18, df19, df20, df21], axis=1)
Feature Importance 2012 | Feature Importance 2013 | Feature Importance 2014 | Feature Importance 2015 | Feature Importance 2016 | Feature Importance 2017 | Feature Importance 2018 | Feature Importance 2019 | Feature Importance 2020 | Feature Importance 2021 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Amphetamines/Lifetime - 1.0 | Amphetamines/Lifetime - 1.0 | Amphetamines/Lifetime - 1.0 | Amphetamines/Lifetime - 1.0 | Amphetamines/Lifetime - 1.0 | Tranquilizers/Lifetime - 1.0 | Tranquilizers/Lifetime - 1.0 | Tranquilizers/Lifetime - 1.0 | Tranquilizers/Lifetime - 1.0 | Sedatives/Lifetime - 1.0 |
1 | Marijuana/30Days - 0.80361795 | Tranquilizers/Lifetime - 0.53648746 | Tranquilizers/Lifetime - 0.42579514 | Tranquilizers/Lifetime - 0.40697834 | Tranquilizers/Lifetime - 0.5532162 | Amphetamines/Lifetime - 0.8978286 | Amphetamines/Lifetime - 0.47637126 | Amphetamines/Lifetime - 0.5347534 | Amphetamines/Lifetime - 0.75638884 | Psychedelics/Lifetime - 0.96621937 |
2 | Tranquilizers/Lifetime - 0.7582158 | Sedatives/Lifetime - 0.33624035 | Marijuana/30Days - 0.26776016 | Cigarettes/30Days - 0.2702855 | Marijuana/30Days - 0.3660885 | Marijuana/30Days - 0.46705022 | Marijuana/30Days - 0.23403315 | LSD/Lifetime - 0.28337172 | Sedatives/Lifetime - 0.57344633 | LSD/Lifetime - 0.69477534 |
3 | AlcoholicDrinks/30Days - 0.24490957 | Cigarettes/30Days - 0.25384387 | Sedatives/Lifetime - 0.26318496 | Marijuana/30Days - 0.23505515 | Sedatives/Lifetime - 0.19532187 | Cigarettes/30Days - 0.339489 | Sedatives/Lifetime - 0.15149848 | AlcoholicDrinks/30Days - 0.27817625 | WantToDo2YrCollege - 0.34889314 | Tranquilizers/Lifetime - 0.6590325 |
4 | Sedatives/Lifetime - 0.24046983 | Marijuana/30Days - 0.20469971 | AlcoholicDrinks/30Days - 0.20253478 | Sedatives/Lifetime - 0.20434554 | LSD/Lifetime - 0.12840125 | Sedatives/Lifetime - 0.23474461 | LivesWithSiblings - 0.14690088 | Marijuana/30Days - 0.27359134 | LSD/Lifetime - 0.33024985 | Amphetamines/Lifetime - 0.5497037 |
5 | Cigarettes/30Days - 0.21149834 | AlcoholicDrinks/30Days - 0.13245568 | Cocaine/Lifetime - 0.14404559 | PoliticalBeliefs_6 - 0.141219 | AlcoholicDrinks/30Days - 0.12600905 | DrivingTickets - 0.21802866 | DrivingTickets - 0.14502884 | WantToDoVocationalSchl - 0.19407636 | Marijuana/30Days - 0.3265674 | Marijuana/30Days - 0.52923715 |
6 | LivesWithSiblings - 0.17189188 | Cocaine/Lifetime - 0.09957072 | Psychedelics/Lifetime - 0.13863444 | Psychedelics/Lifetime - 0.13314435 | PoliticalBeliefs_8 - 0.12189335 | Crack/Lifetime - 0.19706848 | LSD/Lifetime - 0.14053832 | WantToServeInMilitary - 0.17626849 | RaisedWhere_1 - 0.28275934 | WantToDoVocationalSchl - 0.40517214 |
7 | WantToServeInMilitary - 0.1589078 | SkippedClass/4Weeks - 0.09621134 | PoliticalPreference_7 - 0.12886047 | Cocaine/Lifetime - 0.09898375 | Crack/Lifetime - 0.12058827 | Cocaine/Lifetime - 0.19672854 | PoliticalPreference_6 - 0.13798892 | Psychedelics/Lifetime - 0.17309545 | SkippedClass/4Weeks - 0.27557352 | RaisedWhere_7 - 0.40456265 |
8 | Cocaine/Lifetime - 0.15065162 | Psychedelics/Lifetime - 0.09555236 | PoliticalBeliefs_6 - 0.110919036 | SchoolRegion_2 - 0.084943004 | LivesWithSiblings - 0.11986844 | LSD/Lifetime - 0.19137083 | Cocaine/Lifetime - 0.13504994 | SkippedClass/4Weeks - 0.16032512 | SchoolDaysMissedIllness/4Weeks - 0.27026558 | SchoolDaysMissedOther/4Weeks - 0.3853811 |
9 | Psychedelics/Lifetime - 0.15017651 | PoliticalPreference_7 - 0.09405832 | Cigarettes/30Days - 0.102482185 | SchoolDaysMissedIllness/4Weeks - 0.08461291 | RaisedWhere_6 - 0.10401539 | SkippedClass/4Weeks - 0.17320769 | MoneyFromOtherSource - 0.11899159 | PoliticalBeliefs_8 - 0.15851824 | LikelyToAttendVocationalSchl - 0.26334405 | DatesHowOften - 0.3577521 |
Using the XGBoost (extreme gradient boosting) classifier algorithm and cross-validation techniques, we successfully tuned, trained, and tested an XGBClassifier model on survey data from 2012 to 2021 for classifying whether a student surveyed had ever opioids illicitly. The model trained using all 10 years of data had a fairly high model accuracy score of 0.86 and a model recall score of 0.77 during cross-validation. In other words, the model correctly identified whether a student had or had not ever illicitly used opioids 86% of the time and correctly identified 77% of the students who had used opioids illicitly.
Following our model fitting, testing, and feature analysis on all 10 years of data, we refit our model on each year of data seperately, keeping the same hyperparameter tunings. While many of these annual models did not perform as well as the initial model with all 10 years worth of data, this can easily be explained by differences in the size of the training datasets. However, the models for 2020 and 2021 performed poorly even when compared to the models trained on the previous years data; this suggests some additioanl influence is taking place upon the survey data for 2020 and 2021. The two most obvious influences would be the significant decrease in the reported rate of opioid use in 2021 and the possible effects of the COVID-19 pandemic on the survey methodology and/or the changes in the actual behavior of the surveyed 12th graders.
Like with our previous work in part one of this series, many of the identified important variables aligned with expectations, both when modeling the annual datasets and the entire dataset spanning from 2012 to 2021. It was not surprising to find that students who engaged in various illicit drug use and underage drinking were at higher risk of using opiates and heroin. It is worth noting: from 2012 to 2016, use of amphetamine, followed by the use of tranquilizers, was considered by the models to be the best indicator of opioid use. From 2017 to 2020, those positions switched. In 2021, the use of amphetamine and tranquilizers became a significantly less influential indicator.
Future work is planned to further analyze the lifestyle, behaviors, and opiate use of the survey subjects from 2021. Particular emphasis will be placed on examining what other behavioral and lifestyle changes occured amongst 12th graders that might coincide with the decreased rate of illicit opiate use as found amongst the MTF survey participants.