Building an Artificial Intelligence tool to predict test scores

You can find the tool here.

Using different variables, about how reliably can we predict test scores? For this project, the answer was around 40% reliable.

Project Explanation

This project is focused on using three different datasets to predict the ScaleScore for students on their Tennessee Comprehensive Assessment Program (TCAP) exam. The data sources used for this project include NWEA MAP Data, Attendance Data, and TCAP data from the years 2020-2021 and 2021-2022. NWEA MAP is a benchmarking test series that provides a snapshot of student progress through a grade-level, and is also used to predict achievement levels. Attendance data shows how many days a student attended school in a year, and includes demographic data that adds additional variables for multiple regression analysis. Finally, TCAP data includes the ScaleScore for students. The data files are merged using the student's state ID as the Primary Key.

The datasets used for this project include:

  • 2020-2021 Absentee Data: 2021_abs_data.csv

  • 2021-2022 Absentee Data: 2022_abs_data.csv

  • 2020-2021 MAP Test Data: MAP2021W_results.csv

  • 2021-2022 MAP Test Data: MAP2122W_results.csv

  • 2020-2021 TCAP Test Data: TCAP2021_results.csv

  • 2021-2022 TCAP Test Data: TCAP2122_results.csv

This document will cover the following sections:

  1. Project Explanation (this section)

  2. Data Cleaning

  3. Data Visualization

  4. Data Training

  5. Data Predictions

  6. Website Encoding

  7. Reflection

Data Cleaning

When cleaning the data, I opted to do some of it manually in Excel by removing unnecessary columns and creating a new column in the absentee data files for the number of days present. I did this to ensure that any identifying information was removed from the files and to get a more accurate picture of the relationship between attendance and Scale Scores. However, I understand that it's best practice to do all the cleaning in the code, and I did the remaining cleaning in Python.

When I tried to merge the files together, I ran into some challenges because I overcomplicated the process. At first, I thought an outer join would work, then I considered a left join. But both approaches resulted in a lot of NaN values and over 140k rows of data. I eventually realized that I only needed to merge the Attendance, MAP, and TCAP files together after concatenating them, because the year of the test was not important for the final result.

To prepare the data for analysis, I had to make some changes. Firstly, I converted the Course and ContentAreaCode columns from strings into numeric data. Next, I removed rows with Science and Social Studies because my analysis was only interested in Math and English scores. After that, I realized that I didn't need both the Course and ContentAreaCode columns, so I deleted the latter. Additionally, I converted Math scores to 1 and English scores to 2. However, the Course column didn't show a strong correlation with the target variable, so I ultimately decided to exclude it from the analysis.

To clean the data, I deleted more than 100 columns from the original files using Excel. While best practices suggest working with fewer files and keeping them consistent, for this project, I only needed one-off files. Therefore, it was easier for me to clean the data in Excel than to type it all into Python. Although this method may not be ideal for automation, it suited my project's purposes.

Data Visualization

In the visualization stage of the project, I performed a Pearson correlation to determine the correlation between each variable and the TCAP Scale Score, which is the target variable. Based on the results (see below), only a few variables showed any correlation with the Scale Score. These variables were TestDurationMinutes, TestRITScore, EnrolledGrade, ED, SWD, and n_days_present. However, I excluded TestDurationMinutes as it was not logical to use data from the MAP test for predicting the TCAP score. Ultimately, I focused on RIT Score, Grade Level, the number of days present, the economically disadvantaged status, and the student with disabilities status. I chose to work with these variables as they showed the most significant correlation with the Scale Score.

Course                 0.013974
TestDurationMinutes    0.199303
TestRITScore           0.523997
SchoolNumber          -0.048202
EnrolledGrade         -0.125896
TestGrade             -0.126014
ScaleScore             1.000000
school                -0.067635
n_absences            -0.166537
Black                 -0.037727
Hispanic              -0.029100
Native                 0.029105
BHN                   -0.041469
HPI                   -0.012874
Asian                  0.033111
White                  0.026005
ED                    -0.158510
SWD                   -0.316250
EL                    -0.029312
n_days_present         0.172471

After I selected the variables I wanted to focus on, I created scatterplots to visualize the relationships between the data. I specifically focused on the attendance and MAP data, as grade level, SWD, and ED wouldn't be suitable for a scatterplot. The scatterplot for TestRITScore and ScaleScore displayed a clear linear relationship, while the attendance data was more scattered, with more outliers for ScaleScore than RITScore. These visualizations helped me confirm that my data was suitable for analysis and gave me confidence in my approach.

Scatterplot

For the other data, since ED and SWD had binary (0 or 1) choices, and since grade levels are on a 3-8 scale, I chose to use BoxPlots to visualize those. A 0 means that the student isn’t Economically Disadvantaged or a Student with Disabilities.

Economically Disadvantaged BoxPlot

Students with Disabilities BoxPlot

Grade-Level BoxPlot

I also did a heat map that shows how the correlates for each of these compare.

Heat Map

Data Training

To prepare the data for machine learning, I needed to split it into two sets: the data training set and the data testing set. With 63,655 rows of data, the training set will be composed of 20% or 12,731 rows, and the testing set will be composed of 80% or 50,924 rows. I plan to experiment with these numbers to see how different splits affect the results of the machine learning model.

During the process of splitting the data into training and testing sets, I needed to ensure that there were no NaN values present in either set. It took me a while to troubleshoot the issue because a particular variable kept showing NaN and Infinite values, and I couldn't locate them for some reason. Eventually, I realized that I had forgotten to remove the NaN values from the x_train and x_pred variables, even though they were not present in other sets. This was a valuable lesson for me to learn in terms of checking all relevant variables for data inconsistencies.

I ended up running a multiple regression, and here is the data it returned:

Training/Testing Coefficients Intercept R-Squared
20%/80% 0.26169216 -7.37082399 1.17469616 -2.25918833 -4.18241256] 70.8358039289374 0.39895715901860773

The coefficients represent the weights assigned to each variable in the multiple regression model that was trained on the 20% data set. In other words, the coefficients indicate the relative importance of each variable in predicting the target variable (TCAP Scale Score).

In this case, the coefficients are for n_days_present, EnrolledGrade, TestRITScore, ED, and SWD in that order. A positive coefficient indicates that the variable has a positive effect on the target variable, while a negative coefficient indicates that the variable has a negative effect on the target variable.

For example, the coefficient for n_days_present is 0.2617, which means that for each additional day a student is present in school, their predicted TCAP Scale Score will increase by 0.2617 points. Similarly, the coefficient for EnrolledGrade is -7.3708, which means that as a student's enrolled grade increases, their predicted TCAP Scale Score will decrease by 7.3708 points.

The intercept value of 70.8358 represents the predicted TCAP Scale Score when all of the other variables in the model are equal to zero.

The R-squared value of 0.3990 indicates that the model explains 39.9% of the variance in the target variable, which means that the model is moderately accurate in predicting TCAP Scale Scores based on the selected variables.

Website Encoding

Although I had prior experience with HTML dating back to the late 1990s, I struggled to implement my project on my website using Squarespace due to my limited coding skills. Consequently, I sought assistance from ChatGPT to create a JavaScript feature that could take input from a form and use the coefficients to generate a predicted score. The feature turned out to be a great addition to my website, and it worked seamlessly.

Reflection

As I reflect upon the completion of this project, I have gained valuable insights in a relatively short span of time about utilizing Python for data analysis and visualization. This experience has been truly enjoyable and has revealed an intriguing parallel that I had never considered before. Just as English is my native language, Microsoft Excel is my primary data language. I tend to visualize all data in Excel sheets in the same way that I perceive foreign languages through the framework of English. When learning Spanish, I contemplate its relation to French and subsequently to English. Even while studying French, despite being proficient enough to think in the language, I occasionally revert to my native tongue during the thought process. This phenomenon is identical to my experience with Excel, which is why I opted to modify my CSV files in Excel prior to working with them in Python. If I had learned Python first, I might have preferred it for handling data. This unexpected mental revelation has left me wondering when I will begin to perceive the world through Python. Experiencing my first Python dream last night was a noteworthy milestone, as dreaming in a foreign language is often an indicator of the formation of appropriate neural pathways in the brain.

I have thoroughly enjoyed this project and eagerly anticipate creating more with an expanded range of variables. This approach can provide substantial insights into the mathematical dynamics of a student cohort, and incorporating additional years and data types will enable us to further train and test the model, ultimately achieving a greater degree of certainty in our predictions. The moderate correlation (r = 0.52) between RIT and TCAP is already promising, indicating the reliability of RIT as a benchmark exam. I am enthusiastic about broadening the scope of this project over time and discovering new possibilities in the realm of school data analysis.