The goal of this project is to develop software to predict revenue from the Lead Generation (LeadGen) business model at startup.

The LeadGen revenue is fluctuating because of various factors including seasonality, number of paying customers, the number of generated leads and so on. The startup has been collecting data from the very beginning of this business model but only starting from January 1st, 2016 has it been free from budget constraints on marketing and this date will be used as the starting point.

Dataset exploration and selection of the algorithm

I have been exploring available data trying to engineer sensible features. Some of the features will be generated by the website code, some will be added by the predictor before training.

After trying out various linear regressions including polynomial with varying regularizations (Lasso and Ridge) I have finally chosen SVR (Support Vector Regression) with the linear kernel. For the SVR I also tried to use polynomial preprocessing, but after tests, the polynomial degree was set to 1.

The idea of the predictor is to use the dataset generated on a given date to receive a revenue prediction for the entire month. At any time of the month (with the exception of the first day) I have the cumulated monthly revenue value available. This means, I only need to predict the total revenue of the remaining days and add it to the already known part of the monthly revenue.

Because of this, the SVR predictor actually predicts the monthly average daily revenue. The predicted value is then multiplied by the number of the remaining days in the month to get the unknown part of the monthly revenue.

So, the pipeline of the application will be as follows:

  1. Receive the dataset and convert it to the pandas DataFrame.
  2. Add additional features.
  3. Train the SVR predictor using the data from the past months.
  4. Generate the prediction for the daily average revenue for the days that have passed in the current month.
  5. Sum up the already known daily revenue from the days passed, add them to the predicted monthly average daily revenue to get the total revenue prediction for the month.
  6. Return the prediction and other relevant data to the requesting application.

Using this approach, I can generate revenue predictions also for the months in the past. I only need to pretend not to know the actual total revenue of the month and treat it as if it is not over yet (by picking any day in it). For the training, I then use only the months preceding the given one.

By doing so I was able to pinpoint the SVR as the best prediction algorithm for my task and fine-tune its parameters using the GridSearch.

Predictor application

The predictor application is implemented in two modules predictor.py and source.py located within the mtm_rev_pred module. The main script initializes classes Source and Pedictor, feeds the dataset file to the Source instance to be converted into the pandas DataFrame and then channels the output to the Predictor instance that trains the algorithm and returns the prediction:

The Source module

The source.py module contains a single class, Source. The purpose of this module is to provide the functionality required to:

  • Read the dataset file from the pre-set source folder
  • Convert the data into pandas DataFrame
  • Separate target values from the features
  • Generate additional features
  • Scale the data

In the main script, the Source class instance only reads the file and converts it to the pandas DataFrame. The actual preprocessing is called from the Predictor class’ method predict.

The original features provided by the website are:

  • revenue – revenue made on the given day
  • cum_month_rev – cumulated month revenue up to the given day
  • avg_daily – average daily revenue that is calculated by the diving the cumulated month revenue by the number of days passed in the month
  • lead_count – number of leads made on the given day
  • lead_num_14 – number of leads made in the last 14 days
  • lead_num_7 – number of leads made in the last 7 days
  • rev_14 – revenue cumulated in the last 14 days
  • rev_7 – revenue cumulated in the last 7 days
  • lead_num_14_ly – number of leads in 14 days last year (this date last year and 14 days back)
  • lead_num_7_ly – number of leads in 7 days last year
  • rev_14_ly – cumulated 14 days revenue last year
  • rev_7_ly – cumulated 7 days revenue last year
  • num_days – number of days in the current month
  • weekdays – number of weekdays in the current month
  • active_partners_14_d – number of partners that received leads in the last 14 days

The additional features that the Source class generates are:

  • Months – the source module translated this categorical feature into 12 separate features. That is the row will get value 1 if the day’s month matches the feature and 0 if not.
  • Days left in the month – calculated by subtracting the current day from the number of days in the month
  • Average daily revenue – the total monthly revenue divided by the number of days in the month. This is the target feature that the predictor will try to estimate. It is discarded from the data in the evaluated month as it is presumed unknown.

After the additional features are generated, a sample entry from the dataset looks like this (actual numbers are hidden):

Next, these data are scaled using the StandardScaler. This process removes the mean and scales the data to unit variance. The docstring of theStandardScaler reads:

The standard score of a sample x is calculated as:

z = (x – u) / s

where u is the mean of the training samples or zero if with_mean=False,
and s is the standard deviation of the training samples or one if
with_std=False.

Finally, the Sourceinstance returns a tuple of the scaled features, target values, and the original (non-scaled, target values together with features) dataset. The latter will be used to calculate the current month’s known cumulated revenue later.

Predicting the monthly revenue

The Predictor class receives the dataset as a Pandas DataFrame. Before using it to train the estimator and get the predictions, the method predict instantiates an instance of the Source class that performs the preprocessing. Then, the preprocessed data are fed to the method monthly_predictor that also receives the estimator object and the month for which the prediction must be made.

Method predict_period

To get a monthly revenue estimation, the system first generates the predicted average daily revenue for the month. This is done in the method predict_period. This method receives the estimator model, the dataset, and three dates. The start_date and end_date limit the period in which the value is predicted. The period between the init_date and start_date is used to train the estimator.

This method returns a DataFrame with predicted values in the column avg_daily:

Method predict_month

The method predict_month takes these predictions of the average daily earnings, multiplies it by the number of the remaining days in the month and adds to that the known cumulated revenue of the month. This produces the prediction of the total revenue for the month,

Method monthly_predictor

The monthly_predictor method takes the estimator model, the features and the target values to produce predictions of the monthly revenues for all months starting from the specified year and month. The predicted month, however, must be present in the features dataset.

This method will try to evaluate the prediction quality by comparing the predictions on each day of the month with the actual monthly revenue. This, however, only possible when the estimation is run for a month that has known real total revenue.

Processing the results

The dictionary returned by the predictor is converted into a JSON string that is sent to the standard output. The website application that runs the predictor using a command-line expression, receives and parses the output.

The prediction data containing the monthly revenue prediction and the data are then saved to the database of the website.

Tests

The application is covered by unit tests. The tests are found in the subpackage teststhat contains two modules:

  • helper.py – here I placed functions that generate mock data for the tests
  • tests.py actual unit tests covering the methods of both the source.py and predictor.py modules.

Deployment

The application is deployed using a fabric script. The installation path is at /var/python/mtm_predictor/leadgen_revenue/. The application can be installed as master, testing, or stage. The repository has to have these branches present. To indicate the type of deployment, provide one of these three values to the dep_type parameter of the installation command:

Under the target directory (e,g, /var/python/mtm_predictor/leadgen_revenue/master) the installation script will create a releasessub-directory. There, the installation will create sub-directories where it will clone the current state of the deployment branch to. At the end of the installation, a symlink will be created (/var/python/mtm_predictor/leadgen_revenue/master/app) that points to the latest release. The installation will keep a maximum of 3 releases sub-directories and delete older ones after each installation.

Besides the releases sub-directories, the installation script creates alocalsubdirectory outside of the releasesdirectory hierarchy. This directory will contain the dataset file that the website application updates each time it asks for new predictions. There is no need to destroy it between deployments and it is created one time.

The local sub-directory is available via a symlink from the mtm_rev_pred package sub-directory.

Thus, the deployment directory structure is like this:

The build is set up in Jenkins as a freestyle project using the Virtualenv builder: