The goal of this project is to develop software to predict revenue from the Lead Generation (LeadGen) business model at startup.

The LeadGen revenue is fluctuating because of various factors including seasonality, number of paying customers, the number of generated leads and so on. The startup has been collecting data from the very beginning of this business model but only starting from January 1st, 2016 has it been free from budget constraints on marketing and this date will be used as the starting point.

## Dataset exploration and selection of the algorithm

I have been exploring available data trying to engineer sensible features. Some of the features will be generated by the website code, some will be added by the predictor before training.

After trying out various linear regressions including polynomial with varying regularizations (Lasso and Ridge) I have finally chosen SVR (Support Vector Regression) with the linear kernel. For the SVR I also tried to use polynomial preprocessing, but after tests, the polynomial degree was set to 1.

The idea of the predictor is to use the dataset generated on a given date to receive a revenue prediction for the entire month. At any time of the month (with the exception of the first day) I have the cumulated monthly revenue value available. This means, I only need to predict the total revenue of the remaining days and add it to the already known part of the monthly revenue.

Because of this, the SVR predictor actually predicts the monthly average daily revenue. The predicted value is then multiplied by the number of the remaining days in the month to get the unknown part of the monthly revenue.

So, the pipeline of the application will be as follows:

1. Receive the dataset and convert it to the pandas DataFrame.
3. Train the SVR predictor using the data from the past months.
4. Generate the prediction for the daily average revenue for the days that have passed in the current month.
5. Sum up the already known daily revenue from the days passed, add them to the predicted monthly average daily revenue to get the total revenue prediction for the month.
6. Return the prediction and other relevant data to the requesting application.

Using this approach, I can generate revenue predictions also for the months in the past. I only need to pretend not to know the actual total revenue of the month and treat it as if it is not over yet (by picking any day in it). For the training, I then use only the months preceding the given one.

By doing so I was able to pinpoint the SVR as the best prediction algorithm for my task and fine-tune its parameters using the GridSearch.

## Predictor application

The predictor application is implemented in two modules `predictor.py` and `source.py` located within the `mtm_rev_pred` module. The main script initializes classes `Source` and `Pedictor`, feeds the dataset file to the Source instance to be converted into the pandas DataFrame and then channels the output to the Predictor instance that trains the algorithm and returns the prediction:

```import sys
import json
from mtm_rev_pred.source import Source
from mtm_rev_pred.predictor import Predictor

source = Source()
predictor = Predictor()

try:
fn = sys.argv
except IndexError:
print('File name is required')
exit(1)
try:
dataset = source.get_dataset(fn)
res = predictor.predict(dataset)
result = json.dumps(res)
print(result)

except FileNotFoundError as ex:
print('No train dataset file found under {:s}'.format(fn))
except ValueError as ex:
print(str(ex))
```

### The Source module

The `source.py` module contains a single class, `Source`. The purpose of this module is to provide the functionality required to:

• Read the dataset file from the pre-set source folder
• Convert the data into pandas DataFrame
• Separate target values from the features
• Scale the data

In the main script, the `Source` class instance only reads the file and converts it to the pandas DataFrame. The actual preprocessing is called from the `Predictor` class’ method `predict`.

The original features provided by the website are:

• `revenue` – revenue made on the given day
• `cum_month_rev` – cumulated month revenue up to the given day
• `avg_daily` – average daily revenue that is calculated by the diving the cumulated month revenue by the number of days passed in the month
• `lead_count` – number of leads made on the given day
• `lead_num_14` – number of leads made in the last 14 days
• `lead_num_7` – number of leads made in the last 7 days
• `rev_14` – revenue cumulated in the last 14 days
• `rev_7` – revenue cumulated in the last 7 days
• `lead_num_14_ly` – number of leads in 14 days last year (this date last year and 14 days back)
• `lead_num_7_ly` – number of leads in 7 days last year
• `rev_14_ly` – cumulated 14 days revenue last year
• `rev_7_ly` – cumulated 7 days revenue last year
• `num_days` – number of days in the current month
• `weekdays` – number of weekdays in the current month
• `active_partners_14_d` – number of partners that received leads in the last 14 days

The additional features that the Source class generates are:

• Months – the source module translated this categorical feature into 12 separate features. That is the row will get value 1 if the day’s month matches the feature and 0 if not.
• Days left in the month – calculated by subtracting the current day from the number of days in the month
• Average daily revenue – the total monthly revenue divided by the number of days in the month. This is the target feature that the predictor will try to estimate. It is discarded from the data in the evaluated month as it is presumed unknown.

After the additional features are generated, a sample entry from the dataset looks like this (actual numbers are hidden):

```lead_count                 ---.000000
revenue                  ---.000000
rev_14                  ---.0000
rev_7                   ---.0000
rev_14_ly               ---.000000
rev_7_ly                ---.000000
monthly_rev             ---.0000
num_days                   ---.000000
weekdays                   ---.000000
cum_month_rev           ---.0000
active_partners_14_d       ---.000000
avg_daily                ---.00000
Jan                         ---.000000
Feb                         ---.000000
Mar                         ---.000000
Apr                         ---.000000
May                         ---.000000
Jun                         ---.000000
Jul                         ---.000000
Aug                         ---.000000
Sep                         ---.000000
Oct                         ---.000000
Nov                         ---.000000
Dec                         ---.000000
month_days_left             ---.000000
```

Next, these data are scaled using the `StandardScaler`. This process removes the mean and scales the data to unit variance. The docstring of the`StandardScaler` reads:

The standard score of a sample `x` is calculated as:

z = (x – u) / s

where `u` is the mean of the training samples or zero if `with_mean=False`,
and `s` is the standard deviation of the training samples or one if
`with_std=False`.

Finally, the `Source`instance returns a tuple of the scaled features, target values, and the original (non-scaled, target values together with features) dataset. The latter will be used to calculate the current month’s known cumulated revenue later.

### Predicting the monthly revenue

The `Predictor` class receives the dataset as a Pandas DataFrame. Before using it to train the estimator and get the predictions, the method `predict` instantiates an instance of the `Source` class that performs the preprocessing. Then, the preprocessed data are fed to the method `monthly_predictor` that also receives the estimator object and the month for which the prediction must be made.

```    def predict(self, dataset, date=None):
"""Provides predictions for each date of the month of the given date

:param DataFrame dataset:
:param string date:
:return:
"""
if date is None:
date = dataset.index[-1].to_pydatetime()
else:
date = datetime.datetime.strptime(str(date), '%Y-%m-%d')

X, y, dataset = source.process_dataset(dataset)
self.dataset = dataset
res = self.monthly_predictor(
self.get_svr(degree=1, include_bias=False,
kernel='linear', c=1.9, epsilon=66
),
X, y.to_frame(), str(date.year), str(date.month), raw_dataset=dataset)
return res
```

#### Method `predict_period`

To get a monthly revenue estimation, the system first generates the predicted average daily revenue for the month. This is done in the method `predict_period`. This method receives the estimator model, the dataset, and three dates. The `start_date` and `end_date` limit the period in which the value is predicted. The period between the `init_date` and `start_date` is used to train the estimator.

This method returns a DataFrame with predicted values in the column `avg_daily`:

```    def predict_period(self, model, X, y, start_date, end_date, date_init='2016-01-01'):
""" Returns predictions for the specified date range using a model
trained in the specified training date range

:param model the predicting model
:param pandas.DataFrame X: the whole available dataset of features
:param pandas.DataFrame y: the whole available target values
:param string start_date: date to start predicting from
:param string end_date: date to predict until
:param string date_init: date to start training from

:return pandas.DataFrame: predictions
"""

X_train = source.get_data_from_date_range(X, date_init, start_date)
y_train = source.get_data_from_date_range(y, date_init, start_date)

X_val = source.get_data_from_date_range(X, start_date, end_date)
y_val = source.get_data_from_date_range(y, start_date, end_date)

if isinstance(model.named_steps['lin_reg'], SVR):
model.fit(X_train, y_train.values.ravel())
else:
model.fit(X_train, y_train)

pred = model.predict(X_val)

y_pred = y_val.copy()
y_pred['avg_daily'] = pred

return y_pred
```

#### Method `predict_month`

The method `predict_month` takes these predictions of the average daily earnings, multiplies it by the number of the remaining days in the month and adds to that the known cumulated revenue of the month. This produces the prediction of the total revenue for the month,

``` def predict_month(self, month_data, start_date, end_date, montlhy_cum, monthly_rev):
""" Predicts monthly revenue on each day of the given month

:param pandas.DataFrame month_data: a dataframe with predicted average daily revenue for the month
:param string start_date: starting date of the month
:param string end_date: end date of the month
:param pandas.DataFrame montlhy_cum: the monthly cumulative revenue values for the entire dataset
:param pandas.DataFrame monthly_rev: real monthly revenue

:return pandas.DataFrame: predicted monthly revenue on each day of the month
"""
X_month = source.get_data_from_date_range(montlhy_cum, start_date, end_date)
month_data['month_rev_pred'] = (
X_month['cum_month_rev'] + (month_data['avg_daily']
* self.dataset['month_days_left']).astype(float))
month_data['monthly_rev'] = monthly_rev

return month_data
```

#### Method `monthly_predictor`

The `monthly_predictor` method takes the estimator model, the features and the target values to produce predictions of the monthly revenues for all months starting from the specified year and month. The predicted month, however, must be present in the features dataset.

This method will try to evaluate the prediction quality by comparing the predictions on each day of the month with the actual monthly revenue. This, however, only possible when the estimation is run for a month that has known real total revenue.

```    def monthly_predictor(self, model, X, y, start_year, start_month, raw_dataset, date_init='2016-01-01'):
""" Generates monthly revenue predictions for all months from the specified year and month
using training data from the date_init. For each month the speciafied model is trained on
data from the init date up to the given month. The month's data is used to generate predictions
and calculate score

:param model: predicting model
:param pandas.DataFrame X: features dataset
:param pandas.DataFrame y: target values dataset
:param string start_year: year to start predicting from
:param string start_month: month to start predicting from
:param pandas.DataFrame raw_dataset: the initial dataset containint the monthly
cumulated and montly total revenues
:param string date_init: date to start training from

:return dict
"""
months = self.get_months(date_init, X.index)
date_str = str(start_year) + '-' + str(start_month) + '-01'
start_date_ = datetime.datetime.strptime(date_str, '%Y-%m-%d')
date_str = start_date_.strftime('%Y-%m-%d')

last_month_date = months[-1]['end']

monthly_rev = self.dataset['monthly_rev'].to_frame()
y_comp = source.get_data_from_date_range(monthly_rev, date_init, last_month_date)
y_pred = source.get_data_from_date_range(monthly_rev, date_init, date_str)

last_month_params = None

for month in months:
if month['start'] >= date_str:
y_month_avg_daily_pred = self.predict_period(model, X, y, month['start'], month['end'])

month_data = self.predict_month(y_month_avg_daily_pred, month['start'], month['end'],
raw_dataset['cum_month_rev'].to_frame(),
raw_dataset['monthly_rev'].to_frame()
)
y_month_rev_pred = month_data['month_rev_pred'].to_frame()
y_month_rev_pred['monthly_rev'] = y_month_rev_pred['month_rev_pred']
y_month_rev_pred = y_month_rev_pred.drop(columns=['month_rev_pred'])
y_pred = pd.concat([y_pred, y_month_rev_pred])

if isinstance(model.named_steps['lin_reg'], SVR):
coef = model.named_steps['lin_reg'].coef_
last_month_params = pd.Series(coef.round(0), index=X.columns.values).sort_values()

score = np.sqrt(mean_squared_error(y_comp, y_pred))

d_y_comp = self.process_ts_dict(y_comp.to_dict(orient='dict')['monthly_rev'])
d_y_pred = self.process_ts_dict(y_pred.to_dict(orient='dict')['monthly_rev'])
d_lmp = last_month_params.to_dict()

return {'y_comp': d_y_comp, 'y_pred': d_y_pred, 'score': score, 'last_month_params': d_lmp}
```

### Processing the results

The dictionary returned by the predictor is converted into a JSON string that is sent to the standard output. The website application that runs the predictor using a command-line expression, receives and parses the output.

The prediction data containing the monthly revenue prediction and the data are then saved to the database of the website.

## Tests

The application is covered by unit tests. The tests are found in the subpackage `tests`that contains two modules:

• `helper.py` – here I placed functions that generate mock data for the tests
• `tests.py` actual unit tests covering the methods of both the `source.py` and `predictor.py` modules.

## Deployment

The application is deployed using a fabric script. The installation path is at `/var/python/mtm_predictor/leadgen_revenue/`. The application can be installed as `master`, `testing`, or `stage`. The repository has to have these branches present. To indicate the type of deployment, provide one of these three values to the `dep_type` parameter of the installation command:

```fab deploy:host=USERNAME@HOSTNAME,dep_type=TYPE -i /PATH_TO_PRIVATE_KEY
```

Under the target directory (e,g, `/var/python/mtm_predictor/leadgen_revenue/master`) the installation script will create a `releases`sub-directory. There, the installation will create sub-directories where it will clone the current state of the deployment branch to. At the end of the installation, a symlink will be created (`/var/python/mtm_predictor/leadgen_revenue/master/app`) that points to the latest release. The installation will keep a maximum of 3 releases sub-directories and delete older ones after each installation.

Besides the releases sub-directories, the installation script creates a`local`subdirectory outside of the `releases`directory hierarchy. This directory will contain the dataset file that the website application updates each time it asks for new predictions. There is no need to destroy it between deployments and it is created one time.

The `local` sub-directory is available via a symlink from the `mtm_rev_pred` package sub-directory.

Thus, the deployment directory structure is like this:

```  app > /var/python/mtm_predictor/leadgen_revenue/master/releases/1555083334.4494643/
local/
releases/
1555083334.4494643/
mtm_rev_pred/