Using IBM’s AutoAI: How I build a salary prediction system

Developing a prediction model is not an easy task — There is a lot of processes to do to produce the best model. Data plays an important role in building the best prediction model since the essence of machine learning is learning from data. However, real-world data often has unorganized, missing, or noisy elements. Consequently, we need to clean, prepare, and manipulate our data to make sure it would be decent to be the sample.

Then, we need to conduct some experiments with different algorithms. Moreover, we also required to calculate errors, try some optimization techniques, and compare the results until we get the most accurate model. Based on the above reasons, we need some alternatives to automate this set of tasks.

Thereby, a data scientist can be focused on discovering useful insight rather than spending too much time on repetitive tasks.
AutoAI — Drake’s Meme

AutoAI with IBM Watson Studio provides an alternative that addresses this issue. Not just an automation for AI experiment, but also providing a lot of useful tools to fulfill industrial needs for the fast-paced and highly collaborative work environment.

With this product, the process of the experiments and deployment could easily run in minutes. In this article, I will elaborate on how I utilize AutoAI with IBM Cloud’s Watson Studio comprehensively.

The goals of this article are:

(1) demonstrating how to utilize AutoAI to build the most accurate prediction model;

(2) discovering some useful insight from IT Professionals salary survey data to support career decision in the EU region;

(3) demonstrating how to integrate the deployed machine learning prediction model with a Progressive Web Application (PWA).

Web Application Preview

Before we get into the process, I will show you a demo clip of the end product (web application), it shows the flow of the app, input data, and the expected prediction result.

This web application will ask 8 multiple choice questions, a textbox input, and a date-picker input, which will become the inputs of the prediction model. This web app is developed with NodeJs, Nuxt, and Vuetify. (Demo: techpath |source code: GitHub repository).

End-product: Web Application — Preview

Users can also download the PDF File of their prediction summary, as shown in the snapshot below.

Prediction Summary in PDF

I. Project Overview

1.1. Background

Shortages of workers with relevant qualifications have become a major challenge affecting European competitiveness — In the context of rapid technological change, Europe’s declining population and aging workforce mean that labor shortages are expected to increase in the future (source: EMN).

The demand for Information and Technology (IT) professionals in the EU region is very high. Despite rapid growth in the IT sector creates some 120,000 new jobs a year, Europe could face a shortage of more than 800,000 skilled IT workers (source: gov.ie).

Recruiting non-EU skilled workers becomes one of the solutions for this problem, as Angela Dorothea Merkel, the Chancellor of Germany said in this event.

Some countries experience an oversupply of IT professionals — For instance, According to the ILO, there was an oversupply of IT professionals as compared with job opportunities in India. Labor migration has played a major role in the Indian economy.

Case Study: Salary Prediction to Support IT Professionals decision if they expecting to work in EU

This project will address the socio-economic issue above by giving insight to millions of IT Professionals, so they can have a mature decision for their future career if they expecting to start a career in the EU region.

Knowing their predicted salary will help them to plan their future personal finance. Moreover, they can also compare some results and gather other useful insights to support their decision. For example, they can find a city with a highest salary for certain parameters.

Salary prediction is also useful as a reference for the empolyer.

1.2. Data

This project uses the following data: IT Salary Survey for EU region, Annual Anonymous IT Salary Survey for the European region conducted in 2018–2020. The ready-to-use data (after the data preparation step) is also available here.

Columns below will become features (X):

Column of X (features)

The Yearly salary column will become the prediction target (Y):

Column of Y (target)

Data condition: Need Cleansing (contains empty cell, wrong type, and outliers)

1.3. Methodology

This project follows the steps from Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology, AutoAI Workflow with Watson Studio is perfectly in line with it. Every step of this process are shown in the following graphic:

II. Using AutoAI with IBM Watson Studio

In this section, I will show you a step-by-step guide on how I use AutoAI to support this project. It will cover all the steps that showed in the methodology section, from creating a project to deployment.

If you have an IBM Cloud account, you can try to use AutoAI on Watson Studio for free with a Lite Plan.

2.1. Creating a project

To create a new project, go to your IBM Watson Studio console. Then click on create a project.

IBM Watson Studio Console

Afterward, click on Create an empty project and fill the forms accordingly.

Creating a project

Once the project has been created, go to the settings page, scroll down to the associated service, then add the Watson Machine Learning service. Adding this service is mandatory to build the prediction model.

Settings Tab in Project Page
Service Added Successfully

After that, let’s step into the Data Preparation process.

2.2. Data Preparation

Despite AutoAI will automate the pre-processing, we can also refine our data with a lot of useful refine tools for an advanced touch-up. To use these tools, we need to add the dataset to the project first. Go to the Assets tab, then Simply browse or drop the files to the right dialog.

Dataset Ready

Once the dataset is ready, now we can perform the data cleansing operation. Go to the data refine page by hovering the data assets row > click on the three-dots action button > then click Refine.

Refine Data

You can choose any operation you need. IBM provided a lot of useful operations to make this step much easier.

For this project, I remove any rows that contain empty cells, converting any string column that actually contains integers value, and removing any outliers data. To do that, simply select any operation you need and follow the guidance.

Clip of data cleansing

Once the cleansing process is finished, you can save it. Afterward, you will see a new file with the name ending with “_shaped” as shown in the picture below. We will use this new dataset to build the prediction model.

Shaped Data Asset

We have passed the data preparation process! let’s step up to the Modeling step.

2.3. Modeling

To make a new experiment, click the New AutoAI experiment button in the Assets tab, fill the required form, then select the Data Assets.

Once the Data Asset is loaded, you need to choose what column that you want to predict. It will automatically detect the prediction type according to your target (Y) column. For example, in this case, I want to predict Yearly salary which is an integer. Therefore, Regression is automatically chosen.

Experiment Settings demo

If you want to set up more advanced prediction settings, then click the Experiment settings tab. You will find more useful choices like optimized metrics, algorithms to include, etc.

You can also choose how many algorithms to use. AutoAI will test the specified algorithms and use the top performance to create model pipelines.

Each algorithm generates 4 pipelines and more algorithms increase the runtime. Moreover, you can optionally adjust the percentage of your data source for creating, optimizing, and validating pipelines. You can also select what features to include.

AutoAI — Regression: Experiment Progress (with Cross Validation Scoring and ranked by RMSE)

When you start the experiment, you will see the progress map and the relationship map as shown in the clip above. You will realize how beautiful the visualization is.

After the experiment process is finished, AutoAI will rank the model so you can choose which model to save. The leaderboard of pipelines will be elaborated in the next section.

2.4. Evaluation

The image below is showing the leaderboard of every pipeline. It shows that according to the Cross Validation scoring, Pipeline 7 with LGBMRegressor algorithm with HPO-1 and FE optimization has the highest (which is the best) value of . It indicates that this model’s accuracy is perfectly fits the data.

It also has the smallest (which is the best) value of Root Mean Squared Error (RSME). So I choose to save this model to be deployed to the deployment space.

On pipeline comparison, you can choose different testing methods and error estimators, the leaderboard will change according to your preference.

To see the detailed evaluation measure, you can click on any pipeline, then you can see more detailed evaluation measure, model information, feature transformation, and other useful evaluation data.

Pipeline Comparison Preview

To create the model, hover any model on the pipeline leader board. then click on the Save As button.

Saving as a Model
Saving the model (2)

Until this checkpoint, we can conclude how AutoAI from IBM Cloud’s Watson Studio could save up to 80% of our time, this experiment automation surely the best. We as data scientists, computer scientists, or developers just need to conduct a small adjustment to the data using provided tools by IBM, start the experiment then wait a couple of minutes for the awesome results.

2.5. Deployment

Deployment of the selected model is fast and without hassle. After you save the model, you can create the deployment space then deploy the model to that space in just a few clicks.

To deploy the model, you need to go to your project page then scroll down to the Model section, Then open the model by clicking it.

Saved Model

Afterward, click on the Promote to deployment space as shown in the picture below.

Then, select the target space. If you haven’t created any space, you can create a new one.

Promote to space

To create the deployment space, simply fill the form accordingly.

Create a deployment space

Then scroll down and Select storage service and Select machine learning service as shown on the snapshot below. Once it’s done, you can click the create button.

Selecting storage and service

After the deployment space ready, now we can promote the model to that space.

Deployed Model

Just like that! 🎉 our model is ready to use now.

The images below are showing the deployed model with API endpoint and the code snippet so developers can easily copy the code or adjust it to match their own programming language, framework, or plugins.

Online Deployed Model

2.6. Testing

In this step, I tried to test the deployed model. The image below is showing the prediction input and output. In this case, I demonstrate if a person is a Senior Backend Developer with 6 years of experience, using NodeJS as his mainly used programming language, using English as the main language, working in a Product Company with 51–100 employees in Berlin, is predicted will have a Yearly Brutto salary of €64.647.

Testing Result

III. Web App Integration

To achieve an easy input yet user-friendly experience to test the prediction model, I developed a Progressive Web Application (PWA) called techpath. This PWA is using NodeJS, Nuxt, and Vuetify. In this section, I will demonstrate how I integrate the deployed web service with this PWA. You can also visit this repository if you want to view the full code of this Web App. The integration flow is shown in the image below.

Integration flow of the Deployed Model and Web Application

To request Prediction Result to the endpoint of the deployed model, we need to retrieve an access_token first. We should send a POST request to this endpoint: https://iam.cloud.ibm.com/identity/token with an API_KEY included.

3.1. Creating an API KEY

To get an API_KEY, go to your IBM Cloud console > On the navigation bar, click on Manage > Then click on Access (IAM). Or you can simply visit https://cloud.ibm.com/iam.

The snapshot below might help you to find the IAM.

Access (IAM) Location

After opening the IAM page, then on the left sidebar, click on API keys. Then click on Create an IBM Cloud API key as shown on the snapshot below.

Creating an API Key (1)

Afterward, you can fill in the name and description as shown in the image below.

Creating an API Key (2)

Once the API KEY is successfully created, you can copy the API KEY and I recommend you download it because you might be unable to see the API KEY again due to security reasons.

API key: successfully created

Great! Now we have an API KEY so we can use it to retrieve access_token.

3.2. Setting Server Middleware

Since I developed this project using NuxtJS, I will utilize the serverMiddleware feature to retrieve the token, NuxtJS internally creates a connect instance that you can add your own custom middleware to. This allows us to register additional routes (in this case is an API route) without the need for an external server. To start with serverMiddleware, first, we need to create a new directory named server-middleware, then add a JS file named rest.js (you can use your own name).

NuxtJS Scaffolding with serverMiddleware

Then we need to register it on the nuxt.config.js as shown in the gist embed below.

Registering path: ‘/server-middleware ‘

After creating the serverMiddleware, then we need to import all the required package to the rest.js file. For this project, I use Express, body-parser, Axios, and qs as shown in the gist embed below (you need to install it all before).

3.3. Server Middleware Endpoint: Creating an API route for Retrieving access_token and request for AutoAI prediction

Access_token is mandatory to access the endpoint of the deployed prediction model. In this demo, I used the Axios module to handle the HTTP request.

Due to security reasons, we need to store our API KEY in the environment variable (.env), then use it in our web application. If you are using NuxtJS or Vue-CLI, the environment variable's name needs to start with “VUE_APP”.

Next, we need to create an endpoint that purposed to retrieve the access_token and request for AutoAI prediction. The following gist embed is showing how I retrieve the token from serverMiddleware of NuxtJs.

3.4. On NuxtJS PageComponent: POST input data to serverMiddleware endpoint and retrieve the prediction result

After the serverMiddleware endpoint is ready, then we can consume it on the NuxtJS Page Component. First, we need to define the variable of input data. This variable must contain the fields and values as shown in the embed below.

defining input data

Then we can send the POST request to the ‘/server-middleware/request-autoai-prediction

It’s nicely done! 🎉 here is the result:

Prediction Result

IV. End-Product Experiment

In this section, I tried to conduct 3 parts of the product experiment (product: techpath web app). Every part of the experiment will produce an interesting insight and showing how a different feature value would affect the salary prediction.

4.1. Experiment 1: If Andi moves to Cologne, how much is his projected salary? Andi is a Senior Mobile Developer (Swift) with 7 years of experience.

In this experiment, I tried the following input:

Experiment 1: Input
Experiment 1: Result

Insight: According to the prediction results, if a 28 years old Senior Mobile Developer that uses Swift as his main programming language, has 7 years of experience, planned to work in a medium-large scale Product company that use English for daily communication, then he predicted will get €66.822 of yearly Brutto salary in Cologne.

4.2. Experiment 2: Which city between Frankfurt and Amsterdam that a Senior Backend Developer (PHP) projected will get the top salary?

In this experiment, I tried 2 different input case comprising of the following sets:

Fixed sets:

Experiment 2: fixed sets

Variation value of City with Experiment results for each case:

Experiment 2: Results

Insight: According to the prediction results, if a Senior Backend Developer that uses PHP as his main programming language, has 10 years of experience, planned to work in a small-medium scale Consulting company that use English for daily communication are expecting to work in the EU, then The City of Frankfurt, Germany is more recommended over Amsterdam, predicted will get €75.063 of yearly Brutto salary.

4.3. Experiment part 3: In Berlin, how is the salary difference between Junior, Middle, Senior, and Lead of Machine Learning Engineer that uses Python?

In this experiment, I tried 4 different cases regarding the Machine Learning engineer position.

Fixed sets:

Experiment 3: Fixed Sets

Variation value of Years of experience, Age, and Seniority Level with Experiment results:

Experiment 3: Results

Insight: According to the prediction results, for the Machine Learning Engineer in Berlin, the difference of predicted salary between Junior (with no Experience) and Lead (with 25 Years of Experience) is €135.093. Predicted results for Junior with no experience is €21.671 and for Lead with 25 Years of experience is €156.764. We can conclude that years of experience, age, and seniority level significantly affect the salary.

V. Conclusion

5.1. Conclusion of using AutoAI from IBM Cloud’s Watson Studio

After back-and-forth trying out this product (AutoAI), here is some point of conclusions of my experience that might help you to consider using Watson studio:

AutoAI from IBM Cloud’s Watson Studio surely understands the pain point of the Machine Learning & Data Mining process. With the data assets Refine feature, we can easily touch up our selected data. This refined feature has many useful data preparation tools such as data filtering, cleansing, replacing, etc.

This product has a complete and useful option for the experiment, we can select the algorithm, defining training data split, selecting optimization metric, choosing the model, and other useful options to suits your need.

Experiment Visualization is extremely gorgeous. The visualization display of the pipeline is really beautiful and easy to understand.

All in One. From the first time I sign up for Watson studio service, every step from adding data to deployment, I don’t need to leave Watson studio.

This product has a Lite Plan. it means, we can use it for FREE with some limitations.

5.2. My Experience with IBM Cloud Support Team

A few days ago, I need to upgrade my plan because my Lite plan has reached out to the Capacity Unit-Hours (CUH) limit due to my curiosity about using this product. But I encountered some payment error that says my credit card has been rejected. So I contact the IBM Cloud Support Team through email, and they solved and replied really quickly.

Experience with IBM Cloud’s Support team

5.3. Conclusion of predicting Yearly Salary of IT Professionals in the EU region

Addressing the issues mentioned in the background section, the EU region projected will have a high demand for labor supply. From the IT Professional's point of view that has a plan to work in the EU region, predicting their salary according to some important parameters will lead them to assess their skillset, future financial condition (income & expense), and add more insight their consideration. So they can decide maturely before they start to work in the EU region.

Based on the product experiment results, we can conclude that the prediction results can support their decision-making. For example, a senior back-end developer who uses PHP might choose Frankfurt over Amsterdam because it’s predicted to get a higher salary.

suggestion

To improve this prediction model to achieve more accurate results, it will be better to have more variation of features of the datasets that provide more advanced details about the skillsets. Instead of only choosing the main programming language, it will be better to have more detailed technology. For example, if the person is a Backend Developer, they should be able to input their experience with any modern framework like Eloquent, Django, Rails, etc.

Thank you! see you later in the next article.

GIS Front-End Developer & Computer Scientist based in Bandung, Indonesia

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store