Monitoring The Future: 12th Grader Opiate Use From 2012 To 2021 - Part 2¶

By J Montgomery Maxwell - Data Science, Sr. Analyst - CTDS¶

Analysis Introduction¶

About The Data¶

The Monitoring The Future surveys are a series of surveys that have emerged as a vital tool in measuring the values, behaviors, and lifestyle orientations of American youth. Through a comprehensive series of surveys, these studies offer a unique glimpse into the ever-changing landscape of 12th grade students. There are approximately 1,400 variables across all of the questionnaires; while recognizing the vastness of the available data, our exploration will be primarily focused on a subset of variables deemed particularly relevant to our research.

One of the critical aspects examined in the surveys pertains to the frequency of drug use among students, encompassing a wide array of illicit and recreational substances. In the context of this analysis, specific attention has been limited to surveys which observe instances where students engaged in the use of heroin or other opioid narcotics.

While the investigation of drug use patterns holds significance in understanding the landscape of contemporary American youth, the Monitoring The Future surveys provide a broader canvas for exploration. Our analysis incorporates other topics including, but not limited to, students' perspectives on religion, educational goals, family life dynamics, and work habits. By examining these multidimensional facets, we may gain a holistic understanding of the myriad factors influencing the lives, aspirations, and substance use risks of American students.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import sys

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
from sklearn.svm import SVC
from xgboost import XGBClassifier

from sklearn.metrics import recall_score, precision_score, accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.utils import resample

from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

Data Preperation¶

As in our previous work, we exclude a number of variables due to their strong correlations. For instance, we observed among the survey recipients an expected connection between the number of alcoholic drinks consumed this year and the number of alcoholic drinks consumed in the last 30 days. We believe that the latter variable will serve as a more robust indicator for opiate use, so to streamline our analysis and enhance the veracity of our work, we made the decision to remove the former variable.

After removing a number of observations that were missing data, we were left with approximately 42,000 observations spanning the 10 years of survey data.

Most of the data you see below is either recorded as a binary outcome (1/yes or 0/no), as an ordinal value(i.e., 0 = no drug use, 1 = some drug use, 2 = frequent drug use), or as a categorical feature. The categorical data is encoded and represented as binary data, and the ordinal values were scaled down so that all of the data was within the range of 0 and 1.

In [2]:
random_seed=2023

variable_dict = {
 'V1': 'Year', 
 'RESPONDENT_AGE': 'Over18', 
 'V13': 'SchoolRegion',  
 'V49': 'NumberOfSiblings', 
 'V2102': 'Cigarettes/30Days', 
 'V2106': 'AlcoholicDrinks/30Days', 
 'V2117': 'Marijuana/30Days',  
 'V2118': 'LSD/Lifetime', 
 'V2121': 'Psychedelics/Lifetime', 
 'V2124': 'Cocaine/Lifetime', 
 'V2127': 'Amphetamines/Lifetime', 
 'V2133': 'Sedatives/Lifetime', 
 'V2136': 'Tranquilizers/Lifetime', 
 'V2139': 'Heroine/Lifetime', 
 'V2142': 'OpioidNarcotics/Lifetime', 
 'V2150': 'Sex', 
 'V2151': 'Race', 
 'V2152': 'RaisedWhere', 
 'V2153': 'MaritalStatus', 
 'V2155': 'LivesWithFather', 
 'V2156': 'LivesWithMother', 
 'V2157': 'LivesWithSiblings', 
 'V2163': 'FatherEduLvl', 
 'V2164': 'MotherEduLvl', 
 'V2165': 'MotherHadPaidJobWhileGrowingUp', 
 'V2166': 'PoliticalPreference', 
 'V2167': 'PoliticalBeliefs', 
 'V2169': 'ReligiousServiceAttendenceWkly',  
 'V2170': 'ReligionImportance', 
 'V2172': 'HighSchoolProgram', 
 'V2174': 'SelfRateIntelligence', 
 'V2175': 'SchoolDaysMissedIllness/4Weeks', 
 'V2176': 'SchoolDaysMissedSkipped/4Weeks', 
 'V2177': 'SchoolDaysMissedOther/4Weeks', 
 'V2178': 'SkippedClass/4Weeks', 
 'V2179': 'AverageGradeHS', 
 'V2180': 'LikelyToAttendVocationalSchl',
 'V2181': 'LikelyToServeInMilitary', 
 'V2182': 'LikelyToGraduate2YrCollege', 
 'V2183': 'LikelyToGraduate4YrCollege', 
 'V2184': 'LikelyToAttendGraduateSchl', 
 'V2185': 'WantToDoVocationalSchl', 
 'V2186': 'WantToServeInMilitary', 
 'V2187': 'WantToDo2YrCollege', 
 'V2188': 'WantToDo4YrCollege', 
 'V2189': 'WantToDoGradSchl', 
 'V2190': 'WantToDoNo2ndEd', 
 'V2191': 'HrsWorkedPerWeek', 
 'V2193': 'MoneyFromOtherSource', 
 'V2194': 'EveningsOutPerWeek', 
 'V2195': 'DatesHowOften', 
 'V2196': 'MilesDrivenPerWeek', 
 'V2197': 'DrivingTickets', 
 'V2201': 'CarAccidentsLast12Mo',  
 'V2459': 'Crack/Lifetime', 
}

df_2012 = pd.read_csv('Grade12/ICPSR_34861/DS0001/34861-0001-Data.tsv', sep='\t')
df_2013 = pd.read_csv('Grade12/ICPSR_35218/DS0001/35218-0001-Data.tsv', sep='\t')
df_2014 = pd.read_csv('Grade12/ICPSR_36263/DS0001/36263-0001-Data.tsv', sep='\t')
df_2015 = pd.read_csv('Grade12/ICPSR_36408/DS0001/36408-0001-Data.tsv', sep='\t')
df_2016 = pd.read_csv('Grade12/ICPSR_36798/DS0001/36798-0001-Data.tsv', sep='\t')
df_2017 = pd.read_csv('Grade12/ICPSR_37182/DS0001/37182-0001-Data.tsv', sep='\t')
df_2018 = pd.read_csv('Grade12/ICPSR_37416/DS0001/37416-0001-Data.tsv', sep='\t')
df_2019 = pd.read_csv('Grade12/ICPSR_37841/DS0001/37841-0001-Data.tsv', sep='\t')
df_2020 = pd.read_csv('Grade12/ICPSR_38156/DS0001/38156-0001-Data.tsv', sep='\t')
df_2021 = pd.read_csv('Grade12/ICPSR_38503/DS0001/38503-0001-Data.tsv', sep='\t')

df_list = [df_2012, df_2013, df_2014, df_2015, df_2016, df_2017, df_2018, df_2019, df_2020, df_2021]
year_list = [2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]
In [3]:
vars = set(list(df_2012.columns))
for df in df_list:
    vars = set(vars & set(list(df.columns)))
    
variables = list(vars)
df = pd.concat([x[variables] for x in df_list], ignore_index=True)    

df = df[variable_dict.keys()]

# Remove missing data
missing_criteria = (df == -9).sum() < 0.3*len(df.index)
df = df[missing_criteria.index[missing_criteria]]

minimal_missing = []
for i in df.index: 
    cnt = sum(df.iloc[i, :] == -9)
    if cnt < 1:
        minimal_missing.append(i)
        
df = df[df.index.isin(minimal_missing)]

# Combine Opiate Use data 
df['OpiateUse'] = ((df['V2142'] != 1) + (df['V2139'] != 1)).astype(int)
df = df.drop(['V2142', 'V2139'], axis=1)

# Rename columns using data dictionary 
df.rename(columns=variable_dict, inplace=True)

# Factor categorical data
dummy_cols = ['SchoolRegion', 'Race', 'RaisedWhere', 'MaritalStatus', 'PoliticalPreference', 'PoliticalBeliefs', 'HighSchoolProgram']
dummies = pd.get_dummies(df[dummy_cols], columns=dummy_cols, drop_first=True) 
df = pd.concat([df, dummies], axis=1)
df = df.drop(dummy_cols, axis=1)

# Normalize data
df.replace({False: 0, True: 1}, inplace=True)
years_vec = df['Year']
df = (df-df.min())/(df.max()-df.min())
df['Year'] = years_vec

df = df.reset_index(drop=True)
df.head()
Out[3]:
Year Over18 NumberOfSiblings Cigarettes/30Days AlcoholicDrinks/30Days Marijuana/30Days LSD/Lifetime Psychedelics/Lifetime Cocaine/Lifetime Amphetamines/Lifetime ... PoliticalPreference_8 PoliticalBeliefs_2 PoliticalBeliefs_3 PoliticalBeliefs_4 PoliticalBeliefs_5 PoliticalBeliefs_6 PoliticalBeliefs_8 HighSchoolProgram_2 HighSchoolProgram_3 HighSchoolProgram_4
0 2012 0.0 0.666667 0.333333 0.000000 0.000000 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0
1 2012 0.0 1.000000 0.166667 0.166667 0.166667 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
2 2012 1.0 0.333333 0.000000 0.166667 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 2012 0.0 0.333333 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0
4 2012 1.0 0.333333 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 79 columns

Analysis And Visualization¶

Annual Oppiate Use Rate¶

Here we find a steady decline in opiate use rates amongst the surveyed 12th graders. There is a significant drop in opiate abuse in 2021, however this abnormal decrease could be from a number of outlying conditions such as the COVID-19 pandemic and the subsequent stay at home orders and work from home school policies.

In [4]:
series = (df.groupby('Year')['OpiateUse'].sum() / df.groupby('Year')['Year'].count())
ax = sns.barplot(x=list(series.index), y=list(series.array), color='purple')
ax.set(xlabel='Year', ylabel='Opiate Use Rate')
plt.show()

Rate Of Illicit Substance Use Coinciding With Opiate Abuse¶

Below we can see that more than half of all those surveyed who had ever used tranquilzers, cocaine, and pychedelics had also abused opiates or opioid narcotics at somepoint in their lives. Rates of coinciding sedative, amphetamine, and LSD use are not far behind.

In [5]:
drug_use_df = df[['Tranquilizers/Lifetime', 'Amphetamines/Lifetime', 'Sedatives/Lifetime',
    'Marijuana/30Days', 'Cocaine/Lifetime', 'Cigarettes/30Days', 'Psychedelics/Lifetime', 
    'AlcoholicDrinks/30Days', 'LSD/Lifetime', 'OpiateUse']]

ratio_dict = {}
for col in drug_use_df.columns:
    ratio = drug_use_df.loc[((drug_use_df[col] > 0) & (drug_use_df['OpiateUse'] > 0))].shape[0] / drug_use_df.loc[(drug_use_df[col] > 0)].shape[0]
    ratio_dict[col] = ratio
ratio_dict.pop('OpiateUse', None)
ratio_dict
Out[5]:
{'Tranquilizers/Lifetime': 0.5361467889908257,
 'Amphetamines/Lifetime': 0.4222972972972973,
 'Sedatives/Lifetime': 0.4824810606060606,
 'Marijuana/30Days': 0.21971732274267025,
 'Cocaine/Lifetime': 0.6027928626842514,
 'Cigarettes/30Days': 0.2929608938547486,
 'Psychedelics/Lifetime': 0.5102379634753735,
 'AlcoholicDrinks/30Days': 0.14715653703341802,
 'LSD/Lifetime': 0.4817658349328215}

If readers are familiar with the previous analysis on the HEAL platform: Monitoring The Future: 12th Grader Opiate Use - Part 1, an undeniable connection was identified by the machine learning models used in that notebook between opioid abuse and other drug use for 12th graders in 2019. While that connection appears to exist again, let's examine if the same descrease in opioid use is present with other drug use amongst 12th graders.

Annual Cocaine Use Rate¶

In [6]:
series = (df.groupby('Year')['Cocaine/Lifetime'].sum() / df.groupby('Year')['Year'].count())
ax = sns.barplot(x=list(series.index), y=list(series.array), color='purple')
ax.set(xlabel='Year', ylabel='Cocaine Use Rate')
plt.show()

Annual Tranquilzer Use Rate¶

In [7]:
series = (df.groupby('Year')['Tranquilizers/Lifetime'].sum() / df.groupby('Year')['Year'].count())
ax = sns.barplot(x=list(series.index), y=list(series.array), color='purple')
ax.set(xlabel='Year', ylabel='Tranquilizer Use Rate')
plt.show()

Annual Sedatives Use Rate¶

In [8]:
series = (df.groupby('Year')['Sedatives/Lifetime'].sum() / df.groupby('Year')['Year'].count())
ax = sns.barplot(x=list(series.index), y=list(series.array), color='purple')
ax.set(xlabel='Year', ylabel='Sedatives Use Rate')
plt.show()

Annual Amphetamines Use Rate¶

In [9]:
series = (df.groupby('Year')['Amphetamines/Lifetime'].sum() / df.groupby('Year')['Year'].count())
ax = sns.barplot(x=list(series.index), y=list(series.array), color='purple')
ax.set(xlabel='Year', ylabel='Amphetamines Use Rate')
plt.show()

While there is a substantial decrease in the use of amphetamines, sedatives, tranquilizers, and cocaine in 2021, we do not see the same slow decline in their usage from 2012 to 2020. It is, however, worth noting the significant degree to which opioids are abused compared to the other illicit substances. Opiate use peaked at over 11% in 2012, while all of the other substances have peaks between 1% and 6%.

*Opiate use is a combined measurement of the heroin and opioid narcotic use survey results.

Cross Validation, Hyperparameter Tuning, And Model Selection¶

The work done in the first example analysis notebook in this series used SMOTE (Synthetic Minority Oversampling Technique) and multiple, simpler classification models, such as logistic regression and random forest classification. This notebook will use the XGB (Extreme Gradient Boosting) classification algorithm. Unlike with the simpler models used previously, the XGBClassifier does not require the use of SMOTE. Instead, we will use the XGBClassifier hyperparameter, 'scale_pos_weight'. When classes are imbalanced, 'scale_pos_weight' is typically set as the ratio of the number of overrepresented observations (no opiate use) to the number of underrepresented observations (opiate use). This hyperparameter scales the gradient of the positive, underrepresented class, thus scaling the errors of the positive class during the model training step.

While 'scale_pos_weight' is taken at the suggested value, most of the other hyperparameters to the XGBClassifier model were tuned using the grid search and 10-fold cross validation methods. In other words, we exhaustively tested multiple combinations of hyperparameters from a subset of possible values and verified the optimal combination of hyperparameters using 10-fold cross validation.

Below you will find the function used for cross validation and the final model used with the tuned hyperparameters.

In [10]:
def pipeline_cross_validation(data, k, pipeline_steps):
    
    folds = np.array_split(data, k)      
    accuracySum = 0
    precisionSum = 0
    recallSum = 0
    
    for i in range(k):
        train = folds.copy() 
        test = folds[i]
        del train[i]
        train = pd.concat(train, sort=False)

        y_train = train.OpiateUse
        X_train = train.drop('OpiateUse', axis=1)

        y_test = test.OpiateUse
        X_test = test.drop('OpiateUse', axis=1)
        
        pipeline = Pipeline(pipeline_steps).fit(X_train, y_train)
        y_pred = pipeline.predict(X_test)

        accuracySum += accuracy_score(y_test, y_pred)
        precisionSum += precision_score(y_test, y_pred)
        recallSum += recall_score(y_test, y_pred)

    
    return [accuracySum/k, precisionSum/k, recallSum/k]
In [11]:
df1 = df.drop(['Year'], axis=1)

model = XGBClassifier(
 learning_rate =0.01,
 n_estimators=200,
 max_depth=5,
 min_child_weight=1,
 gamma=0.6,
 subsample=0.8,
 colsample_bytree=0.7,
 objective= 'binary:logistic',
 seed=random_seed,
 scale_pos_weight = len(df1)/sum(df1['OpiateUse']))

steps =  [('model', model)]   
score = pipeline_cross_validation(df1, k=10, pipeline_steps = steps)
print(f'Accuracy: {score[0]} | Precision: {score[1]} | Recall: {score[2]}')
Accuracy: 0.8616077788418954 | Precision: 0.31930083188943403 | Recall: 0.7636644739387797

Opioid Use Classification¶

This first iteration of model training, testing, and evaluation will use all 10 years of data spaning from 2012 to 2021. This model will be the same XGBClassifier that was tuned and validated above. After the model is trained, we will examine the model's performance by looking at the model's accuracy, precision, and recall metrics. Model accuracy will tell us the model's rate of correctly classifying both the students who have and have not ever abused opiates; model recall will tell us the model's rate of correctly identifying all of the students who have abused opiates, and model precision will tell us the model's rate of correctly identifying students who have abused opiates amongst all students the model identifies as having used opiates.

After examining the model's performance and the corresponding confusion matrix, we will look at what the model considers to be the most and least relevant features to capturing students' opiate use behavior.

In [12]:
X = df1.drop('OpiateUse', axis=1)
y = df1.OpiateUse
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_seed)

model = XGBClassifier(
 learning_rate =0.01,
 n_estimators=200,
 max_depth=5,
 min_child_weight=1,
 gamma=0.6,
 subsample=0.8,
 colsample_bytree=0.7,
 objective= 'binary:logistic',
 seed=random_seed,
 scale_pos_weight = len(df1)/sum(df1['OpiateUse']))

steps =  [('model', model)]  

boosted_model = model.fit(X_train, y_train)
y_pred = boosted_model.predict(X_test)

Model Performance¶

The following model scores are scaled between 0 and 1, with 1 being perfect performance. The scores and the corresponding confusion matrix tells us that the model performs relatively well at identifying students who have ever used opiates illicitly, identifying almost 78% of all students who have used opiates illicitly, and performs well at correctly classifying 85% of all students in either category. However, the model does a poor job at identifying only those who abused opioids amongst the student population. In fact, almost 70% of the students our model says have used opioids illicitly have not.

Model Scores¶

In [13]:
print(f'Accuracy: {accuracy_score(y_test, y_pred)} | Precision: {precision_score(y_test, y_pred)} | Recall: {recall_score(y_test, y_pred)}')
Accuracy: 0.8562589243217515 | Precision: 0.3153846153846154 | Recall: 0.7784810126582279

Confuion Matrix¶

In [14]:
disp = ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred))
disp.plot()
disp.im_.colorbar.remove()
plt.show()

Most Important Features¶

In [15]:
feature_importance = pd.DataFrame(index=df1.columns.drop('OpiateUse'))
feature_importance[f'Feature Importance'] = (boosted_model.feature_importances_ - min(boosted_model.feature_importances_)) / (max(boosted_model.feature_importances_) - min(boosted_model.feature_importances_))
feature_importance.sort_values(by=['Feature Importance'], ascending=False).head(15)
Out[15]:
Feature Importance
Amphetamines/Lifetime 1.000000
Tranquilizers/Lifetime 0.546011
Marijuana/30Days 0.349006
Sedatives/Lifetime 0.169479
AlcoholicDrinks/30Days 0.159707
Cigarettes/30Days 0.107511
LivesWithSiblings 0.066077
Cocaine/Lifetime 0.057510
LSD/Lifetime 0.045208
Psychedelics/Lifetime 0.039737
MilesDrivenPerWeek 0.039215
SchoolDaysMissedSkipped/4Weeks 0.031901
DrivingTickets 0.029383
WantToServeInMilitary 0.028945
LikelyToAttendGraduateSchl 0.024677

Least Important Features¶

In [16]:
feature_importance.sort_values(by=['Feature Importance'], ascending=False).tail(15)
Out[16]:
Feature Importance
WantToDo2YrCollege 0.004603
HighSchoolProgram_3 0.003934
FatherEduLvl 0.003796
MotherEduLvl 0.003511
RaisedWhere_8 0.003493
SkippedClass/4Weeks 0.003340
HighSchoolProgram_4 0.003050
PoliticalBeliefs_2 0.002843
MaritalStatus_2 0.002649
PoliticalPreference_2 0.002466
HighSchoolProgram_2 0.002207
RaisedWhere_6 0.002152
Race_2 0.000881
PoliticalPreference_3 0.000039
Race_3 0.000000

Above we see the 15 most and least important features ranked by their normalized feature importance scores. These feature scores are an indication of the relative amount each feature improves upon the performance of the model when making a decision along the given feature. As with our work in part one, we find that the students' use of other illicit substances are the most telling indicators as to whether a student will use opiates or opioid narcotics. We also find that the students race, political preference or beliefs, and where students are raised gives little to no indication towards whether they will abuse opioids.

Annual Opioid Use Classification¶

Next we will rerun this analysis but focus on only one year of data at a time. We will use the same hyperparameters as before for these newly trained models, since the computation for hyperparameter tuning takes a significant amount of time. After training the 10 models for the 10 years of data, we will again examine the models' performance and ranked feature importance.

Training And Testing Model On Annual Data¶

In [17]:
feature_importance = pd.DataFrame(index=df1.columns.drop('OpiateUse'))
classification_scores = pd.DataFrame({'Metric': ['Accuracy', 'Precision', 'Recall']}) 

for year in year_list:

    df_annual = df[df['Year'] == year]
    df_annual = df_annual.drop(['Year'], axis=1)

    X = df_annual.drop('OpiateUse', axis=1)
    y = df_annual.OpiateUse
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_seed)
    
    
    model = XGBClassifier(
        learning_rate =0.01,
        n_estimators=200,
        max_depth=5,
        min_child_weight=1,
        gamma=0.6,
        subsample=0.8,
        colsample_bytree=0.7,
        objective= 'binary:logistic',
        seed=random_seed,
        scale_pos_weight = len(df_annual)/sum(df_annual['OpiateUse']))
    
    boosted_model = model.fit(X_train, y_train)
    y_pred = boosted_model.predict(X_test)

    feature_importance[f'Feature Importance {year}'] = (boosted_model.feature_importances_ - min(boosted_model.feature_importances_)) / (max(boosted_model.feature_importances_) - min(boosted_model.feature_importances_))
    classification_scores[f'Model Performance {year}'] = [accuracy_score(y_test, y_pred),  precision_score(y_test, y_pred), recall_score(y_test, y_pred)]

Annual Accuracy And Precision Scores¶

Below we can see that for the first few years our model's performance is only slightly below the previous performance of the model using all 10 years of data. In 2020 and 2021, the models do not perform nearly as well as the models trained on prior annual data or the model trained using all of the data. The underperformance of the models from earlier years compared to our first model is to be expected, as we are training those models on a much smaller subset of data. However, the poor performance of the later models cannot be completely explained by the difference in size of the training data. The lower performance of the models from 2020 and 2021 may be explained by the disruption to behavior caused by the COVID-19 pandemic, but may also be exlpained by the massive decrease in the opioid use rate of the 12th graders surveyed in 2021.

In [18]:
classification_scores
Out[18]:
Metric Model Performance 2012 Model Performance 2013 Model Performance 2014 Model Performance 2015 Model Performance 2016 Model Performance 2017 Model Performance 2018 Model Performance 2019 Model Performance 2020 Model Performance 2021
0 Accuracy 0.864261 0.875000 0.886261 0.900881 0.885468 0.918782 0.912470 0.920582 0.929688 0.962552
1 Precision 0.443439 0.431694 0.433566 0.357798 0.369748 0.494118 0.337079 0.392857 0.428571 0.344828
2 Recall 0.736842 0.669492 0.756098 0.661017 0.709677 0.666667 0.681818 0.622642 0.375000 0.555556

Most important features each year¶

Like before, we see the same relation between between general illicit substance use and illicit opiate use. It is worth noting that the use of tranqulizers and amphetamines typically took the top 1 and 2 positions for most influential features until 2021, when they fell to the 4th and 5th ranking. This is another indicator that something significant shifted in our model's classification of 12th grader opiate use.

In [19]:
df12 = pd.DataFrame(feature_importance.sort_values(by=['Feature Importance 2012'], ascending=False)['Feature Importance 2012']).head(10).reset_index(names=['Feature'])
df12['Feature Importance 2012'] = df12['Feature'] +" - "+ df12["Feature Importance 2012"].astype(str)
df12 = df12.drop(['Feature'], axis=1)
df13 = pd.DataFrame(feature_importance.sort_values(by=['Feature Importance 2013'], ascending=False)['Feature Importance 2013']).head(10).reset_index(names=['Feature'])
df13['Feature Importance 2013'] = df13['Feature'] +" - "+ df13["Feature Importance 2013"].astype(str)
df13 = df13.drop(['Feature'], axis=1)
df14 = pd.DataFrame(feature_importance.sort_values(by=['Feature Importance 2014'], ascending=False)['Feature Importance 2014']).head(10).reset_index(names=['Feature'])
df14['Feature Importance 2014'] = df14['Feature'] +" - "+ df14["Feature Importance 2014"].astype(str)
df14 = df14.drop(['Feature'], axis=1)
df15 = pd.DataFrame(feature_importance.sort_values(by=['Feature Importance 2015'], ascending=False)['Feature Importance 2015']).head(10).reset_index(names=['Feature'])
df15['Feature Importance 2015'] = df15['Feature'] +" - "+ df15["Feature Importance 2015"].astype(str)
df15 = df15.drop(['Feature'], axis=1)
df16 = pd.DataFrame(feature_importance.sort_values(by=['Feature Importance 2016'], ascending=False)['Feature Importance 2016']).head(10).reset_index(names=['Feature'])
df16['Feature Importance 2016'] = df16['Feature'] +" - "+ df16["Feature Importance 2016"].astype(str)
df16 = df16.drop(['Feature'], axis=1)
df17 = pd.DataFrame(feature_importance.sort_values(by=['Feature Importance 2017'], ascending=False)['Feature Importance 2017']).head(10).reset_index(names=['Feature'])
df17['Feature Importance 2017'] = df17['Feature'] +" - "+ df17["Feature Importance 2017"].astype(str)
df17 = df17.drop(['Feature'], axis=1)
df18 = pd.DataFrame(feature_importance.sort_values(by=['Feature Importance 2018'], ascending=False)['Feature Importance 2018']).head(10).reset_index(names=['Feature'])
df18['Feature Importance 2018'] = df18['Feature'] +" - "+ df18["Feature Importance 2018"].astype(str)
df18 = df18.drop(['Feature'], axis=1)
df19 = pd.DataFrame(feature_importance.sort_values(by=['Feature Importance 2019'], ascending=False)['Feature Importance 2019']).head(10).reset_index(names=['Feature'])
df19['Feature Importance 2019'] = df19['Feature'] +" - "+ df19["Feature Importance 2019"].astype(str)
df19 = df19.drop(['Feature'], axis=1)
df20 = pd.DataFrame(feature_importance.sort_values(by=['Feature Importance 2020'], ascending=False)['Feature Importance 2020']).head(10).reset_index(names=['Feature'])
df20['Feature Importance 2020'] = df20['Feature'] +" - "+ df20["Feature Importance 2020"].astype(str)
df20 = df20.drop(['Feature'], axis=1)
df21 = pd.DataFrame(feature_importance.sort_values(by=['Feature Importance 2021'], ascending=False)['Feature Importance 2021']).head(10).reset_index(names=['Feature'])
df21['Feature Importance 2021'] = df21['Feature'] +" - "+ df21["Feature Importance 2021"].astype(str)
df21 = df21.drop(['Feature'], axis=1)

pd.concat([df12 ,df13, df14, df15, df16, df17, df18, df19, df20, df21], axis=1)
Out[19]:
Feature Importance 2012 Feature Importance 2013 Feature Importance 2014 Feature Importance 2015 Feature Importance 2016 Feature Importance 2017 Feature Importance 2018 Feature Importance 2019 Feature Importance 2020 Feature Importance 2021
0 Amphetamines/Lifetime - 1.0 Amphetamines/Lifetime - 1.0 Amphetamines/Lifetime - 1.0 Amphetamines/Lifetime - 1.0 Amphetamines/Lifetime - 1.0 Tranquilizers/Lifetime - 1.0 Tranquilizers/Lifetime - 1.0 Tranquilizers/Lifetime - 1.0 Tranquilizers/Lifetime - 1.0 Sedatives/Lifetime - 1.0
1 Marijuana/30Days - 0.80361795 Tranquilizers/Lifetime - 0.53648746 Tranquilizers/Lifetime - 0.42579514 Tranquilizers/Lifetime - 0.40697834 Tranquilizers/Lifetime - 0.5532162 Amphetamines/Lifetime - 0.8978286 Amphetamines/Lifetime - 0.47637126 Amphetamines/Lifetime - 0.5347534 Amphetamines/Lifetime - 0.75638884 Psychedelics/Lifetime - 0.96621937
2 Tranquilizers/Lifetime - 0.7582158 Sedatives/Lifetime - 0.33624035 Marijuana/30Days - 0.26776016 Cigarettes/30Days - 0.2702855 Marijuana/30Days - 0.3660885 Marijuana/30Days - 0.46705022 Marijuana/30Days - 0.23403315 LSD/Lifetime - 0.28337172 Sedatives/Lifetime - 0.57344633 LSD/Lifetime - 0.69477534
3 AlcoholicDrinks/30Days - 0.24490957 Cigarettes/30Days - 0.25384387 Sedatives/Lifetime - 0.26318496 Marijuana/30Days - 0.23505515 Sedatives/Lifetime - 0.19532187 Cigarettes/30Days - 0.339489 Sedatives/Lifetime - 0.15149848 AlcoholicDrinks/30Days - 0.27817625 WantToDo2YrCollege - 0.34889314 Tranquilizers/Lifetime - 0.6590325
4 Sedatives/Lifetime - 0.24046983 Marijuana/30Days - 0.20469971 AlcoholicDrinks/30Days - 0.20253478 Sedatives/Lifetime - 0.20434554 LSD/Lifetime - 0.12840125 Sedatives/Lifetime - 0.23474461 LivesWithSiblings - 0.14690088 Marijuana/30Days - 0.27359134 LSD/Lifetime - 0.33024985 Amphetamines/Lifetime - 0.5497037
5 Cigarettes/30Days - 0.21149834 AlcoholicDrinks/30Days - 0.13245568 Cocaine/Lifetime - 0.14404559 PoliticalBeliefs_6 - 0.141219 AlcoholicDrinks/30Days - 0.12600905 DrivingTickets - 0.21802866 DrivingTickets - 0.14502884 WantToDoVocationalSchl - 0.19407636 Marijuana/30Days - 0.3265674 Marijuana/30Days - 0.52923715
6 LivesWithSiblings - 0.17189188 Cocaine/Lifetime - 0.09957072 Psychedelics/Lifetime - 0.13863444 Psychedelics/Lifetime - 0.13314435 PoliticalBeliefs_8 - 0.12189335 Crack/Lifetime - 0.19706848 LSD/Lifetime - 0.14053832 WantToServeInMilitary - 0.17626849 RaisedWhere_1 - 0.28275934 WantToDoVocationalSchl - 0.40517214
7 WantToServeInMilitary - 0.1589078 SkippedClass/4Weeks - 0.09621134 PoliticalPreference_7 - 0.12886047 Cocaine/Lifetime - 0.09898375 Crack/Lifetime - 0.12058827 Cocaine/Lifetime - 0.19672854 PoliticalPreference_6 - 0.13798892 Psychedelics/Lifetime - 0.17309545 SkippedClass/4Weeks - 0.27557352 RaisedWhere_7 - 0.40456265
8 Cocaine/Lifetime - 0.15065162 Psychedelics/Lifetime - 0.09555236 PoliticalBeliefs_6 - 0.110919036 SchoolRegion_2 - 0.084943004 LivesWithSiblings - 0.11986844 LSD/Lifetime - 0.19137083 Cocaine/Lifetime - 0.13504994 SkippedClass/4Weeks - 0.16032512 SchoolDaysMissedIllness/4Weeks - 0.27026558 SchoolDaysMissedOther/4Weeks - 0.3853811
9 Psychedelics/Lifetime - 0.15017651 PoliticalPreference_7 - 0.09405832 Cigarettes/30Days - 0.102482185 SchoolDaysMissedIllness/4Weeks - 0.08461291 RaisedWhere_6 - 0.10401539 SkippedClass/4Weeks - 0.17320769 MoneyFromOtherSource - 0.11899159 PoliticalBeliefs_8 - 0.15851824 LikelyToAttendVocationalSchl - 0.26334405 DatesHowOften - 0.3577521

Conclusion¶

Using the XGBoost (extreme gradient boosting) classifier algorithm and cross-validation techniques, we successfully tuned, trained, and tested an XGBClassifier model on survey data from 2012 to 2021 for classifying whether a student surveyed had ever opioids illicitly. The model trained using all 10 years of data had a fairly high model accuracy score of 0.86 and a model recall score of 0.77 during cross-validation. In other words, the model correctly identified whether a student had or had not ever illicitly used opioids 86% of the time and correctly identified 77% of the students who had used opioids illicitly.

Following our model fitting, testing, and feature analysis on all 10 years of data, we refit our model on each year of data seperately, keeping the same hyperparameter tunings. While many of these annual models did not perform as well as the initial model with all 10 years worth of data, this can easily be explained by differences in the size of the training datasets. However, the models for 2020 and 2021 performed poorly even when compared to the models trained on the previous years data; this suggests some additioanl influence is taking place upon the survey data for 2020 and 2021. The two most obvious influences would be the significant decrease in the reported rate of opioid use in 2021 and the possible effects of the COVID-19 pandemic on the survey methodology and/or the changes in the actual behavior of the surveyed 12th graders.

Like with our previous work in part one of this series, many of the identified important variables aligned with expectations, both when modeling the annual datasets and the entire dataset spanning from 2012 to 2021. It was not surprising to find that students who engaged in various illicit drug use and underage drinking were at higher risk of using opiates and heroin. It is worth noting: from 2012 to 2016, use of amphetamine, followed by the use of tranquilizers, was considered by the models to be the best indicator of opioid use. From 2017 to 2020, those positions switched. In 2021, the use of amphetamine and tranquilizers became a significantly less influential indicator.

Future Work¶

Future work is planned to further analyze the lifestyle, behaviors, and opiate use of the survey subjects from 2021. Particular emphasis will be placed on examining what other behavioral and lifestyle changes occured amongst 12th graders that might coincide with the decreased rate of illicit opiate use as found amongst the MTF survey participants.

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: