Data analytics steps for supporting project planning
Analytical Steps is a tool developed at CSC for identifying data analytics capabilities. The tool has been inspired by the maturity models used in software development. In Analytical Steps, the stages of data analytics are divided into five steps: collect, describe, discover, predict and interact. The steps illustrate the different levels of analytics and help identify both the current state and objectives of an analytics project. The underlying principle is that you cannot move to a higher step until the lower steps are in place.
It is a good idea to start the planning of an analytics project by considering what the problem is that you want to solve using analytics and artificial intelligence. Do you want to get an understanding of sales trends, predict device failures or identify in advance students who are likely to discontinue their studies? Perhaps the problem to be solved is the social exclusion of young people and the aim is to better understand the indicators that predict exclusion. Once the problem is known, it is possible to consider how to achieve the desired result. The Analytical Steps tool can be used to support this stage of the process. The steps make it easier to consider both the level at which the problem should be examined and the level it is currently at.
Collect data
The first step is data collection. At this level, it is essential to consider what data is needed to solve the problem. In addition to traditional numerical data, it is also worth considering text, image and audio data as possible data sources. Next, one examines whether or not the data has already been collected. At the same time, it is important to plan an appropriate way of storing the source material. Small amounts of data can perhaps be stored on one’s own computer, but larger datasets require more efficient solutions. Also, material that includes sensitive personal data requires a different solution to publicly available data. If the data collection and storage are not yet in place, it may be necessary to establish a data warehouse project alongside the analytics project.
An example of data collection would be the collection and storage of student study data or the results of schools participating in a sports project. The sports project could, for example, involve activity or performance measurements for several different sports. The lowest step does not yet involve analytics, but with the data collected it is already possible to answer simple questions such as the grade a student has received in an English course or whether a school has been involved in the sports project.
Describe the data
The next analytical step is describing the data. The description phase includes a wide range of data pre-processing, data aggregation and the calculation of simple indicators, such as the mean and range. The final output of the description phase could be a graph of the current year’s sales or an interactive report on the users of the system, also known as a dashboard. This phase often takes more time than anticipated. According to the traditional rule, up to 80% of analytics time goes into the pre-processing of data, including detecting and correcting errors in the data, stripping out overlapping variables and aggregating data. Once this stage is complete, on the other hand, the data can be used to obtain a lot of useful information on the phenomenon being studied, and this can be utilised in operational planning.
Student data can be used, for example, to calculate the number of students discontinuing their studies each year, course grade distributions and students’ key results for course progress and completion. Similarly, calculations can be made from the sports project results based on the proportion of young people from each participating school who engage in physical activity for more than an hour per day. The data can also be aggregated at the municipal level and further enriched with other municipal data, such as municipal finance figures, the number of mental health client relationships among young people, the results of the School Health Promotion Study or even the utilisation rate of libraries. In this way it is possible to calculate for each municipality key figures on the well-being of young people and to identify potential development targets.
Discover the phenomenon
The aim of the third step is to discover in more depth the phenomenon being studied and to detect abnormalities, connections and groupings in the data. This phase may include both statistical analyses and machine learning. Typically, the final output is some kind of situational picture of the current state of the phenomenon and its properties.
Statistical methods can be used to observe the behaviour of the phenomenon and to give alerts of events that deviate from the norm, such as if a student receives a rejected grade in several courses in succession. Correlation analysis increases understanding of how the phenomenon is related to other variables. For example, physical activity or performance measurements can be used to calculate the correlation between a person’s physical condition and various background variables, or to examine at the municipal level if there is a connection between sports and exercise expenditure and young people’s level of physical activity.
Clustering can be used to group the studied variables into separate clusters. The physical activity and well-being data of young people can be examined using municipal-level clustering, which helps with identifying similar municipalities and their characteristics. Are young people in some municipalities doing particularly well? If so, what other characteristics do those municipalities have? Clustering can also be carried out at the individual level by grouping young people based on their responses. This can be done with anonymised data, meaning that an individual respondent cannot be identified, but cluster analysis nevertheless provides information and understanding of the factors influencing young people’s well-being.
Predict
The fourth step is predict. This stage involves machine learning and statistical modelling. Prediction is based on a mathematical model of the phenomenon which describes the phenomenon and its interactions as well as possible, or at least to a sufficient level. The preparation of the predictive model typically requires both historical data and expertise.
In machine learning, the model learns the characteristics of the phenomenon being studied based on the data fed to it. In other words, if we want to make a model that predicts the risk of exclusion among young people, we need to train it using data on both marginalised and non-marginalised young people. If the training data does not contain enough examples from both groups, prediction becomes more difficult and the model is not reliable. Similarly, if the goal is to classify youth organisations’ grant applications into five pre-defined categories, comprehensive examples of all five categories are needed. On the other hand, there was no historical data on the spread of coronavirus, for example, which means that the modelling of the infection situation has been based on expert knowledge obtained from the behaviour of other virus and vaccination processes.
As the machine learning model is based on previously collected data on the phenomenon being examined, it is important to make sure that the data used for the modelling actually describes the phenomenon that is being predicted. In many cases, challenges arise because the training data does not contain enough observations of rare events. If you want to predict the risk of a student discontinuing their studies, but there are actually only very successful students represented in the training data, the predictive model will probably not be reliable. Problems also arise if there is a systematic error in the training data. For example, if a group of people has been discriminated against in grant decisions, the predictive model also learns this behaviour.
Interact and create something new
The fifth step aims to interact the phenomenon and shape new ways of acting. In the field of technology, this is called a feedback mechanism – a process in which activities are modified on the basis of the results obtained from the previous step. If the prediction step was used to identify young people who may be at risk of exclusion, this step is used to implement support measures to prevent exclusion.
Such interacting can be either automatic or based on human action. For example, a self-driving vehicle stops automatically when it detects that a person is stepping onto a pedestrian crossing. Similarly, another automated function could concern grant decisions. The machine learning model could be trained with previous applications and decisions to prepare automatic decision proposals to be checked by government officials. This could speed up decision-making and, at best, reshape the whole process. The means of exerting influence are not always easy to define. Even if the number of coronavirus infections is predicted to increase, there is no certainty as to which measures could affect the numbers.
It is important to note that decisions that affect the phenomenon can also be taken as part of the other four steps. Often, getting simply some rough information about the phenomenon helps orientate decisions in the right direction. For example, if an increase in marginalised young people has been observed in the description phase, decisions can be made on this basis to influence future figures. On the other hand, the predictive model can provide more detailed information on the phenomenon and its causes, and can even assess the risk of exclusion of an individual young person, enabling support measures to be better targeted.
Finally
There are many levels to analytics work, and it is not possible to jump directly from the bottom step to the top one. When planning an analytics project, it is a good idea to consider which level is required to achieve sufficient benefits. Moving from one level to another can be expensive, and success is not guaranteed, so it makes sense to consider whether a lower level would be sufficient. Small analytical experiments are a good way to map out the benefits and opportunities of the next level. They can also be used to develop one’s understanding and competence in relation to the opportunities and risks of data analytics.
Aino Ropponen
The author is a data analytics specialist in CSC’s AI and data analytics group. Email aino.ropponen@csc.fi.
Aleksi Kallio
The author is the manager of CSC’s data analytics group, coordinating development of machine learning and data engineering based services. Email aleksi.kallio@csc.fi.