Web-based Clustering Application for Determining and Understanding Student Engagement Levels in Virtual Learning Environments

ABSTRACT

During the peak of the pandemic, many non-essential professionals were required to work from home. 4 Consequently, higher educational institutions had to transition from traditional in-person learning to online learning. 5This transition necessitated students to rely more heavily on VLE than ever before. 6UNESCO estimated that approximately 1.5 billion students worldwide were unable to attend universities or schools during this period, with over 91% of students being affected by nationwide closures. 7Consequently, both educators and students had to become more accustomed to using VLEs and other applications such as Microsoft Teams, Zoom, and Google Meet. 8 VLE is a web-based platform designed for educational purposes. 9VLEs enable educators to disseminate learning materials, conduct student surveys and assessments, create collaborative glossaries, and manage grades, among other functions. 10In addition to these educational services, VLEs also collect and store the online behavior data of each user, including both educators and students, resulting in extensive and diverse datasets. 11This opens opportunities for analyzing and interpreting student behavior patterns within virtual learning environments. 12One significant challenge associated with VLEs is the characterization of new student behavior patterns that emerge due to changes in instructional methods. 13rior research has employed EDM techniques, specifically unsupervised learning algorithms, to characterize student behavior patterns within VLEs. 14However, there have been limited efforts to apply unsupervised learning algorithms to web-based applications to characterize student learning behavior amidst evolving instructional methods.In this context, unsupervised learning algorithms refer to the use of clustering methods for knowledge discovery. 15Clustering methods represent a collection of techniques that group observations in a way that similarities exist within the same group, while differences distinguish them from observations in other groups. 16Conversely, EDM emerged as a response to the necessity of analyzing large and diverse datasets derived from educational systems. 17he use of clustering methods within EDM to characterize student learning behavior has demonstrated its effectiveness in discerning student engagement levels, providing personalized interventions for students, and offering insights to enhance the efficacy of IM within VLEs. 18Consequently, these methods are increasingly being adopted in studies aiming to characterize student behavioral patterns, specifically in terms of engagement levels, within VLEs.
In summary, IM within Higher Educational Institutions are not static but rather dynamic, reflecting the varying levels of student engagement within VLEs. 19Recognizing this dynamism in IM and student engagement on VLEs underscores the necessity for solutions capable of capturing this dynamic nature.Therefore, this research aims to develop a web-based application that employs cluster analysis to capture the fluctuations in student engagement levels based on individual student behavior in VLEs.The objectives of this study align with this aim and include: 1. Identifying a clustering method for use in a web-based application, determined through considerations of clustering time, Silhouette coefficient, Calinski-Harabasz, and Davies Bouldin indexes.2. Determining and understanding student engagement levels in a virtual learning environment using the most effective clustering method.3. Recognizing IM and student characteristics associated with the identified student engagement levels.4. Developing a clustering web-based application that adapts to the dynamic nature of virtual learning environment data.

LITERATURE REVIEW Educational Data Mining in Web-based Educational Platforms
Web-based educational platforms were initially created for online learning, but presently, many colleges and universities are integrating them as a supplementary tool for in-person instruction. 20These platforms are swiftly becoming integrated into higher education to improve student learning in diverse formats, including E-learning, VLEs, Massive Open Online Courses (MOOCs), and Learning Management Systems (LMS). 21Learning Management Systems, in particular, prioritize the development of VLEs for educational purposes, so web-based platforms like Moodle, Blackboard, and Canvas can be regarded as VLEs as well. 22umerous factors advocate the adoption of VLEs for educational purposes.VLEs offer flexibility concerning both time and space, promote resource reusability, and facilitate enhanced interaction between educators and students. 23Additionally, VLE platforms enable functions like content management, curriculum mapping and planning, learner engagement and administration, communication, and collaboration, as well as real-time interaction between educators and students. 24Among the array of services provided by VLE platforms, they also accumulate extensive and diverse data, including system logs documenting student activities within the platform (such as browsing time, login times, and click counts). 25This data collection extends to personal information like user profiles and academic performance. 26The escalating volume of data generated by VLEs frequently necessitates the extraction of valuable insights from this vast dataset. 27DM represents a knowledge discovery approach designed to transform raw data from VLEs into valuable insights.Its primary objective is to assist Higher Educational Institutions in resource management enhancement, optimization of learning processes, and the refinement of instructional methods, including monitoring, evaluation, and personalization of teaching procedures. 28EDM places a strong emphasis on developing data mining algorithms that can delve into educational data, uncover hidden patterns, and use these discoveries to make predictions and informed decisions within educational settings. 29Applications of EDM encompass activities such as data clustering in education, the creation of e-learning systems, and predicting student dropouts and performance within VLEs.Among these applications, clustering stands out as the most widely employed. 30EDM assumes a critical role within HEIs, evolving in response to the necessity of extracting value from the data generated on web-based educational platforms.

The Use of Clustering for Student Engagement Levels
Student learning behaviors within VLEs are primarily characterized by indicators linked to actions, specifically, information regarding students' activities in VLEs. 31These action-based indicators are often presented in summarized formats, encompassing metrics such as clicks per session, the number of file downloads, session duration, login frequencies, artifact production quantities, and time allocated to specific tasks. 32One can determine student engagement levels by examining these action-based indicators of interest. 33Among these indicators, a frequently employed measure for gauging student engagement levels in VLEs is clicks per session or the aggregated sum of clicks. 34The task of identifying student engagement levels within a VLE is commonly treated as an unsupervised machine learning task. 35In unsupervised machine learning, the data lacks labels and contains only input information. 36ue to remarkable shifts in student learning patterns or the expansion of educational data, the composition of data within VLEs undergoes changes. 37Typically, it becomes challenging to determine the precise level of engagement within VLEs. 38The established standards for characterizing engagement levels, as exemplified by the COVID-19 pandemic, often lose their relevance.Consequently, the primary dataset available for constructing classification models to assess student engagement levels is the input data, which comprises of action-based indicators. 39In such scenarios, the most suitable approach involves the application of clustering methods. 40These methods excel in situations where only input data is accessible, as they aim to identify natural groupings through similarity metrics. 41

METHODOLOGY
To determine and understand student engagement levels and create a web-based clustering application, the Knowledge Discovery in Database (KDD) methodology is adopted.The methodology involves the following steps: (a) data selection and understanding; (b) data pre-processing and transformation; (c) modelling; (d) evaluation; and (d) web-based application development.KDD is a widely employed methodology within. 42

Data Selection and Understanding
This study relies on the Open University Learning Analytics Dataset (OULAD), which was developed by Kuzilek, Hlosta, and Zdrahal to facilitate research in EDM. 43What sets this dataset apart from other educational datasets is its inclusion of demographic information combined with aggregated clickstream data, detailing student interactions within the VLE.This unique combination enables the analysis of student behavior as reflected in their actions within the VLE.The dataset encompasses a total of 22 distinct modules and covers 32,593 students for the years 2013 and 2014.Kuzilek, Hlosta, and Zdrahal meticulously constructed this dataset in compliance with the ethical and privacy guidelines of the Open University. 44They rigorously anonymized the data to eliminate any personally identifiable information concerning the students.The structure and tables of the OULAD dataset are depicted in Figure 1   The dataset is categorized into three distinct groups: student demographics, performance, and learning behavior, further subdivided into seven tables (Figure 1, on the right).As illustrated in Figure 1 (on the left), the dataset is centered around students, making it well-suited for this study, as a part of the research objective is to determine and understand student engagement levels.To assess student engagement levels, the focal metric is drawn from Table 1, specifically, the cumulative clicks recorded in the student Virtual Learning Environment (VLE).This metric represents the frequency of a student's interactions with the instructional materials presented within the VLE.Table 1 offers an overview of the dataset, including the associated quantity of observations.

Source: Open University Learning Analytics dataset
Given that this research is structured as an unsupervised machine learning task, the training dataset will be denoted as  = {  } =1  , where  represents the sum of clicks (input data) extracted from the student VLE table,  signifies the count of training instances, and  constitutes the training set.Within  , each observation indicates how frequently a student interacted with VLE materials (such as quizzes, forums, URLs, etc.).In the context of this study, the student engagement level(s) corresponds to cluster(s).The training data is presented in Table 2.

Source: Open University Learning Analytics dataset
Data Pre-processing and Transformation Data pre-processing and transformation represent crucial initial phases conducted prior to data modeling. 46hese procedures yield clean data, potentially improving the performance of machine learning models. 47The action-based indicator of interest (sum of clicks) underwent the following data pre-processing and transformation steps.Initially, missing values and outliers were identified and addressed.Subsequently, the data underwent standardization to ensure that all observations were on a consistent scale.Standardization is mathematically defined as follows: Here,  denotes the observations of the indicator, while   and   represents the sample mean and standard deviation, respectively.

Modelling
Two distinct clustering methods, namely the GMM and K-means, were chosen for the clustering of the preprocessed and transformed input data  (sum of clicks).Firstly, K-means, a conventional clustering method often employed in EDM, was included due to its simplicity in visualization and interpretation. 48Its widespread use in the EDM community signifies its utility as a baseline for comparison and a straightforward means to characterize clusters.In contrast, the GMM was introduced as it represents a probabilistic model belonging to the soft clustering approach, which is the counterpart of hard clustering, to which K-means belongs. 49In hard clustering, each data point belongs to exactly one cluster, while in soft clustering, like GMM, data points can belong to multiple clusters with associated probabilities.By incorporating probabilistic modelling, GMM offers a more nuanced understanding of complex data structures, which can be particularly advantageous when dealing with intricate and overlapping patterns, thereby enhancing the depth of the analysis.
To determine the optimal number of clusters, which directly translates to the number of engagement levels in this study, well-established methods were relied upon.The application of the elbow method for Kmeans aligns with recommendations from previous research solidifying its selection as a clustering option. 50dditionally, the Bayesian Information Criterion (BIC) emerged as a robust measure for determining the optimal number of classes within GMMs, further reinforcing the choice of GMM for this analysis. 51

Elbow Method
The Elbow method is employed to identify the optimal number of clusters in K-means through data visualization. 52It involves locating a point on a plot where the distortion value experiences the most significant decline, resembling an elbow or bend.This point serves as an indicator for determining the appropriate number of clusters. 53The distortion value is mathematically defined as follows: Here,   represents a data point belonging to the cluster   , and   is the mean value of the data points assigned to the cluster   .

Bayesian Information Criterion (BIC)
The Bayesian Information Criterion is an analytical method used to assess the goodness-of-fit of statistical models when compared to each other, given a specific dataset. 54It also quantifies the model's ability to generalize and accurately represent future data generated by the same process that produced the current data. 55enerally, models with lower BIC scores are preferred.The BIC score is calculated as follows: Here,  represents the number of data points,  signifies the number of parameters, and  ̂ is the maximum likelihood of the model.

K-means Clustering
K-means is a method for clustering that groups data points based on their proximity to a central point known as the cluster centroid. 56In this study, K-means clustering is applied to analyze the observations in  and create distinct groupings referred to as student engagement levels (clusters).These clusters represent student VLE data with similar characteristics.Given that  is a numerical variable, each cluster is characterized by a centroid, which is essentially the mean of the sum of clicks within that cluster.To measure the similarity between student engagement levels, the squared Euclidean distance is employed, defined as follows: In this equation,  and  represent observations within the D-dimensional training dataset.The determination of the number of student engagement levels involved minimizing the Sum of Squared Errors (SSE), which comprises the squared error between each observation and its nearest centroid.The SSE is expressed as follows: Here,   represents the centroid of the  ℎ student engagement level, and  , = 0 if an observation   does not belong to the student engagement level , while  , = 1 if   is part of the student engagement level .

Gaussian Mixture Model
The Gaussian Mixture Model (GMM) is a clustering method that assigns data points to clusters in a probabilistic manner, with each cluster being characterized by a distinct Gaussian Distribution. 57This Gaussian distribution is mathematically defined as: In this equation, |∑| represents the determinant of the covariance matrix ∑,  is a D-dimensional vector, and the shape of the Gaussian is determined by ∑, which is a  ×  covariance matrix.Since each student engagement level is modeled as a Gaussian distribution (GD), the GMM can be represented as a linear combination of these fundamental Gaussian probability distributions, defined as: Here,   denotes the mixing coefficient, which approximates the density of each Gaussian student engagement level, and  represents the number of student engagement levels in the mixture model (MM).The student engagement level within the MM is characterized by  (|, ∑  ), which is the Gaussian density.Ultimately, each student engagement level  is expressed as a GD with covariance ∑  , mean   and   serving as the mixing coefficient. 55Patel and Kushwaha, "Clustering Cloud Workloads: K-Means vs Gaussian Mixture Model." 56Patel and Kushwaha, "Clustering Cloud Workloads: K-Means vs Gaussian Mixture Model." 57Patel and Kushwaha, "Clustering Cloud Workloads: K-Means vs Gaussian Mixture Model."

Evaluation
The evaluation of the two clustering methods, namely GMM and K-means, was based on their execution time and three cluster evaluation metrics: the Calinski-Harabasz index, the Davies-Bouldin index, and the Silhouette coefficient.These cluster metrics gauge the degree of separation between clusters and the variation within each cluster, 58 while the execution time quantifies the duration it takes for the clustering method to fit the training data.

Calinski-Harabasz Index
The Calinski-Harabasz index score is a measure of the ratio between the sum of within-cluster dispersion and between-cluster dispersion for all clusters.A higher score indicates a model with well-defined clusters. 59athematically, the Calinski-Harabasz index score is expressed as: Here,  represents a dataset of size   grouped into k clusters.The parameters (  ) and (  ) denote the trace of within-cluster and between-cluster dispersion matrices, respectively.These matrices are defined as follows: Where   is the number of points in cluster ,   represents the center of cluster ,   is the center of , and   is the set of points in cluster .

Davies Boudin index
The Davies-Bouldin index metric calculates the average similarity between each cluster   for  = 1, … ,  and its most similar one   .Values closer to zero indicate better partitioning of clusters [24].It is defined as: Where   is a similarity measure that considers the trade-off between   and   , calculated as: Here,   represents the distance between cluster centroids  and , and   is the average distance between the centroid of cluster  and each point in cluster .

Silhouette Coefficient
The silhouette coefficient computes the average distance between data points to assess the density and separation of clusters.It ranges from -1 to 1, with values closer to 1 indicating appropriate cluster configuration. 60The formula for the Silhouette coefficient is: Where  is the average distance between a data point and all other data points in the nearest neighboring cluster, and  is the average distance between a data point and all other data points in the same cluster.

Web-based Application Development
Following the identification of a clustering method with favorable results during the evaluation phase, a webbased application was developed using the Shiny framework.Shiny is an open-source library within the R programming language, offering a robust web framework for creating interactive web applications. 61A Shiny application is composed of two fundamental components: a user interface (UI) and a server.The server acts as the backend of the application, housing a set of instructions for executing tasks such as data processing, cluster model creation, and data visualizations, among others.On the other hand, the UI serves as the front end, encompassing instructions for presenting results to users within a web browser. 62The server segment of the Shiny application was programmed to provide the following interactive functionalities: • Offer a step-by-step user guide tutorial on clustering VLE data and comprehending cluster analysis.
• Present a user information modal on application tabs.
• Enable users to filter VLE data and select the desired number of clusters (engagement levels).
• Generate visualizations and tabular outputs regarding student engagement levels based on user selections in cluster analysis.• Generate interactive visualizations that offer insights into IM and student characteristics associated with each student engagement level.• Allow users to produce a downloadable report delivering insights into student engagement levels derived from cluster analysis.• Permit user(s) to update cluster analysis with new inputs.The UI aspect of the Shiny application was designed to display results and insights generated from the cluster analysis to the user(s).

FINDINGS AND DISCUSSION Data Pre-processing and Transformation
The sum of clicks, extracted from the student VLE table, was employed as an action-based indicator to depict student engagement.Descriptive details regarding the sum of clicks are presented in Tables 3 and 4, both before and after aggregation.To prevent student redundancy and alleviate computational load, the sum of clicks per student since the beginning of the semester was aggregated (summed).

Student Engagement Clustering
Before determining the distinct engagement levels, it was essential to establish the optimal number of engagement levels for K-means and GMM.The optimal number of engagement levels for K-means and GMM is depicted in Figure 2. In Figure 2 (left), the graph reveals a significant reduction in distortion until it reaches 7, after which it stabilizes at a constant distortion value.Consequently, the optimal number of engagement levels for K-means is determined to be 7.By analyzing Figure 2 (right), it becomes apparent that the optimal number of engagement levels for GMM is 25, as the lowest BIC score is associated with 25 components.Both K-means and GMM were modeled with 7 and 25 engagement levels, facilitating a fair comparison of engagement level means.These means signify the average number of times a student engages with the VLE, indicating that students within the same engagement level exhibit similar VLE engagement patterns.
The engagement level means, as presented in Table 5, have been arranged in ascending order to enhance interpretability.Consequently, lower engagement levels correspond to reduced engagement, while higher levels signify heightened engagement.The engagement level means, as derived from K-means for 7 and 25 levels, fall within the ranges of [299, 16,330] and [79, 26,602], respectively.For GMM, these values span [276, 13,153]  and [49, 26,570] for 7 and 25 levels.Notably, these ranges align closely with the minimum and maximum values of Sum of Clicks (Table 4) when modeling more levels for both methods.It's worth noting that GMM, with its optimal number of engagement levels substantially higher than Kmeans, adeptly captures a broader spectrum of engagement patterns, including both minimal and maximal engagement.Consequently, GMM offers the potential for more comprehensive groupings of VLE engagement.

Student Engagement Clustering Evaluation
The and GMM methods were compared using execution time and three clustering metrics.The results for both clustering methods are presented in Table 6.Notably, the GMM method demonstrated superior performance in terms of execution time for the OULAD dataset.This suggests that GMM is notably more efficient in the task of categorizing 26,074 students into 7 and 25 engagement levels compared to K-means, which require more time.However, it's worth noting that despite the execution time of K-means being slightly longer, it excels in creating well-separated engagement levels, as evident from the higher Calinski-Harabasz index scores.When modeling 7 engagement levels, K-means proves to be particularly effective at generating well-defined engagement levels, boasting a slightly higher Silhouette Coefficient and a lower Davies-Bouldin Index score.This trend is however not consistent with 25 engagement levels, as both K-means and GMM exhibit similar Silhouette coefficients and Davies-Bouldin index scores.As demonstrated in Table 5, a higher number of engagement levels accurately captures the full range of VLE engagement, further emphasizing the importance of a method capable of efficiently establishing well-defined and separated engagement levels.Considering the lack of a clear winner between K-means and GMM using the objective approach, the GMM method was ultimately chosen for implementation in the web-based application.This choice was driven by several factors, including its probabilistic nature and its capacity to provide comprehensive groupings of VLE engagements.The probabilistic nature of GMMs allows for modeling the uncertainty associated with the number of engagement levels selected and the probability of a student belonging to each engagement level. 63his capability addresses the complexities arising from students potentially belonging to multiple engagement levels.

Web-based Clustering Application
A web-based clustering application using the Gaussian Mixture Model (GMM) was developed through the R Shiny framework.The GMM clustering application is accessible online via the following link: https://dsanalytics.shinyapps.io/Student-Segmentation/. The application comprises three distinct tabs: (1) the GMM Data Analysis tab, (2) the IM tab, and (3) the Student Characteristics tab.Additionally, an informative modal feature has been incorporated, offering users detailed insights into each tab's functionality, as visually depicted in Figure 3.

GMM Data Analysis Tab
Upon initial loading, the application defaults to the GMM Data Analysis tab, where users can engage in cluster analysis of student VLE data using the GMM method.This tab is equipped with a user guide feature designed to offer step-by-step instructions on performing GMM cluster analysis and interpreting the resultant outcomes.This instructional element has been integrated to assist users who may not possess technical expertise.
As illustrated in Figure 4, the query box prompts users to input specific parameters, including the academic year and semester, a range for the number of days from the start to the end of a course, and the desired number of engagement levels.Various combinations of these inputs empower the GMM method to unveil diverse engagement patterns, allowing for the dynamic nature of student VLE data to be captured effectively.This is particularly valuable as it enables continuous querying of new VLE data, facilitating the extraction of fresh insights as they become available.

Figure 5. GMM data analysis tab
To visually represent student engagement data, two bar charts are included, illustrating the average interaction frequency of students within each engagement level with the VLE (left bar chart) and the distribution of students across the different engagement levels (right bar chart).Lastly, a table is presented, featuring columns that display: (1) the probability of a student's affiliation with each engagement level, (2) the specific engagement level to which a student belongs, (3) the percentage of uncertainty regarding a student's engagement level assignment, and (5) the count of VLE interactions undertaken by a student during the selected academic period.

Instructional Methods Tab
The charts in Figure 6 are presented to visually represent the activities and modules that students within each engagement level predominantly engage with, as well as those they engage with less frequently on the VLE.This analysis aids in the identification of the most frequently accessed modules and activities within the VLE.

Student Characteristics Tab
Additional exploration can be conducted based on the profiling of engagement levels, examining various dimensions of student data such as (1) gender, (2) age group, (3) disability status, (4) previous course attempts, (5) final academic performance, and (6) geographic region of residence.This extended analysis aims to provide deeper insights into the characteristics of students within each engagement level.Figure 7 illustrates the distribution of engagement levels across these different student attributes using bar charts.

RECOMMENDATIONS
This research emphasizes the utilization of the Gaussian Mixture Model (GMM) as the preferred method for integration into web-based applications designed to analyze student engagement levels within VLEs.The choice of GMM is substantiated by its probabilistic nature, efficiency in clustering time, and its capacity to offer comprehensive characterizations of student engagement.
A key recommendation arising from this study is the determination of student engagement levels into three primary categories: low-engaged (level 1), mid-engaged (level  2 ), and high-engaged (level ).This categorization simplifies the process of distinguishing between students with varying degrees of engagement in VLEs, providing a practical framework for educators and institutions.
In addition to determining engagement levels, it is advised to expand the analysis by incorporating various facets of student information.This may include considering the types of VLE activities and modules accessed, demographic factors such as gender, age group, disability status, historical course attempts, final academic performance, and geographic region of residence.Examining the distribution of engagement levels across these dimensions can yield valuable insights into instructional strategies and student characteristics.
Given the dynamic nature of VLE data, it is suggested that web-based clustering applications incorporate query boxes.This feature enables academic institutions to proactively extract fresh insights into student engagement over time.By facilitating ongoing exploration of engagement patterns, institutions can make timely adjustments to improve the learning experience.
Future research directions should focus on combining the GMM method with interpretable machine learning models.This approach goes beyond descriptive analytics and aims to provide diagnostic analytics for students within each engagement level.This would enable educators to gain deeper insights into the factors influencing student engagement and take targeted actions.
While this study primarily employed K-means and GMM, there is a call for future studies to explore a broader range of clustering techniques.These may include Latent Class Analysis, DBSCAN, K-medians, Meanshift, K-prototyping, Fuzzy C-means, and Hierarchical clustering.Diversifying the clustering methods can offer alternative perspectives on student interactions within VLEs.
Lastly, there is an opportunity for research to delve into the application of cloud platforms within the domain of EDM.Investigating best practices for deploying EDM applications on cloud infrastructure can enhance scalability, accessibility, and the overall effectiveness of using VLE data for educational improvement.This avenue holds the potential to transform the landscape of data-driven decision-making in education.

CONCLUSION
This research showcases the creation of a web-based clustering application tailored for the purpose of determining and understanding student engagement levels within a virtual learning environment (VLE).The findings and insights presented herein stand as a valuable asset to educational institutions, practitioners in the field of EDM and researchers.This research not only exemplifies the practical application of EDM principles but also provides essential guidance for the development and implementation of EDM applications using VLE data.Its contributions serve to enrich the landscape of data-driven decision-making in education, fostering improved learning experiences and outcomes for students.

Figure 2 .
Figure 2. Elbow method for K-means (left) and BIC scores for GMMs (right)

Figure 3 .
Figure 3. Web-based application with a modal

Figure 4 .
Figure 4. Web-based application with a user guide pop-up

Table 4 . Sum of clicks statistics before and after aggregation.
61Rachma Hermawati and Imas Sukaesih Sitanggang, "Web-Based Clustering Application Using Shiny Framework and DBSCAN Algorithm for Hotspots Data in Peatland in Sumatra," Procedia Environmental Sciences 33 (2016): 317-23.62Hermawatiand Sitanggang, "Web-Based Clustering Application Using Shiny Framework and DBSCAN Algorithm for Hotspots Data in Peatland in Sumatra."