Monday, 3 September 2018

Setting up an environment for Monte Carlo Simulation in Docker

In this blog I will walk you through to install JAGS inside a docker container. You might be thinking why I have chosen docker for this. The answer is very simple, when I was install JAGS on my personal computer, the OS did not recognise as a trusted software so I did not take a risk of installing on my personal computer.

If you want to play with JAGS and you don't want to install it in your computer, then Docker is the best option as I can play with the package/software and then I can delete the container.

Now you got the idea why I have chosen Docker container for this. Let's proceed to setup an environment for Monte Carlo simulation. Make sure you have got Docker installed. Follow below steps to setup the environment:

1. Open Command prompt with administrative privilege and issue follow command:
$ docker run --name mybox -t -p 8004:8004 opencpu/rstudio

Above command will download the opencpu/rstudio image locally.

2. Issue below command to start/run the container:
$ docker container start mybox

3. Open browser in your host computer and point http://localhost:8004/rstudio/ and provide opencpu as username and password like shown below:

4. Now, you need to connect to container, by issuing below command in your command prompt, to install JAGS - a tool that generate Gibbs Sampling:

$ docker exec -I -t mybox /bin/bash

You will be taken to terminal of container like shown below:

5. Issue below commands to terminal of container:
$ sudo apt-get update
$ sudo apt-get install jags

5. Now go the browser (you opened in step 3) and install "rjags" and "runjags" packages like shown below and you are done. Now you use this environment to create a simulation using Monte Carlo.

That's it so far. Stay tuned.

Wednesday, 9 May 2018

Azure IoT and IoT Edge - Part 2 (Building a Machine Learning model using generated data)

This blog is part 2 of Azure IoT Edge series. Please see if you have not read part 1.

In this blog I will cover the how we can build a logistic regression model in R using the data the captured in tables storage via IoT Hub.

We can run the simulated devices (all three at once) and wait for data to be generated and save it to table storage. But for the simplicity I have created an R script to generate the data so that I can build the model and deploy it to IoT Edge and hence we can leverage the this Edge device to apply Machine Learning model on the data it is receiving from the downstream devices.

I am using exactly the same minimum temperatures, pressure and humidity as our simulated device was using. Please see here are few lines of R script.

Let’s plot the data and see how it looks like. There are only 3 fields/feature so I will plot  Temperature vs Pressure using ggplot2:

Output of above R commands:

We can see as the temperature and pressure increases the device is becoming bad or getting away from the good devices. For the simplicity the simulation generates higher number for temperature and pressure if device is flagged as defective.

Now let’s build a simple logistic regression model to find out the probability of device being defective.
I am using caret package for building model. Here is the code to split the training and test data:

The proportion of good vs bad for original data is: 66% (good)/33% (bad). So we make sure we don’t have skewness in the data.

Now applying glm function to data using R script shown below:

Here is the summary of the model:

We can see from above output, the pressure is not statistically significant. The idea of this post is to have a model that we will be using in IoT Edge device.

Let’s test this model on test data set and find out the best threshold to separate the bad from good. I could have used cross-validation to find the best threshold. Use cross validation set to fine tune the parameters (eg. threshold or lambda if ridge regression is used etc).

Below is the confusion matrix when I use threshold 0.5:

Let’s construct a data frame which contains actual, predicted and calculated probability using below code:

And view first and last 5 records:

The higher (or closer to 1) the probability the device is good.

With threshold 0.65, the confusion matrix look like below:

So we can see from above two confusion matrix, the best threshold should be 0.50 as it miss-classifies only 4 instances but when 0.65 is used it miss-classifies 5 instances.

The final model is given below:

So far I have got the model built. I will use this model in IoT Edge module which will make Edge intelligent, which I will post soon so stay tuned and happy IoTing J

Tuesday, 13 March 2018

Azure IoT and IoT Edge (Part 1)

In this blog post I will walk you through how and IoT device (for IoT Hub and IoT Edge gateway) can be created.

The simulated device will generate telemetry data that will be used by IoT Edge Module (e.g. Clustering) to find out which device need to be replace or restart it etc.

I will be posting few more blogs to achieve below:

We can see from above diagram, the main components are:
  •            IoT Hub
  •          Configuration of IoT Edge device as gateway
  •          IoT Edge Module
  •          Downstream devices

I will develop a Machine Learning model (k-means clustering) in R and will leverage in MachineLearningModel Edge Module to find which device need to be replaced or need to restart etc.

For the simplicity, the downstream device will generate following telemetry data:
  • Temperature
  • Humidity
  • Pressure

Let’s develop a downstream device that generates above random data. Follow to setup you IoT Hub. I have got my IoT Hub setup now, I am creating a .Net console app that acts as device which generates some random data.

Here are some code snippets:

Here is the Main method:
Here is an example batch file to run as device1:

Now you need to register 3 devices in Azure Portal in IoT Hub here are the steps:
  •       Navigate to Azure Portal then your IoT Hub
  •       Navigate to IoT Devices
  •       Click on Add button and fill the details for the device like shown below:

  •       Now, go to the device you just created and copy the primary connection string to respective .bat file.
  •       Repeat for 3 times to create 3 devices.

Once you have created three devices, start running device1.bat, device2.bat and device3.bat. It will start sending data to IoT Hub like shown below:

And your IoT Hub will show number of messages received like shown below:

So far we have created/simulated 3 devices that started sending temperature and other data to IoT Hub. These devices will be used to send the data to IoT Edge gateway (by appending GatewayHostName=<your-gateway-host-name> to device connection string) and I will explain in next blog so stay tuned.

Thursday, 18 May 2017

Exploring SparkR using Databricks environment

In this exploration I will share what I have learnt so far R with Spark. Spark, as you all know, is a distributed computing framework. It allows you to program in Scala/python/Java and now in R for performing distributed computation.

I implemented gradient descent in Hadoop to understand how we are going to parallelize gradient computation. Please have a read about it at for understanding mathematics behind it.

Now I am implementing same gradient descent algorithm in SparkR using Databricks community edition. You might be wondering why I am implementing it again J

I always start with the knowledge I have right now then I use those knowledge to learn new language. For this instance Gradient Descent algorithm bets fit here. Also we learn couple of things while implementing GD like:

  •      How we break big loop into cluster of computers
  •      How we are transforming data in parallel
  •      How we share/send common variables/values to worker nodes
  •      How we are aggregating results from worker nodes.
  •      Finally combining those results

If your algorithm has to iterate over millions or more records then it is worth parallelizing it. Any computation you do, you will almost be doing same sort of things as I outlined above. I can use above high-level tasks mentioned above to build a complex Machine Learning model like ensembling models or model stacking etc.

Please write in comments if you have other items than I have listed above
J to learn from you as well.

Now I have talked too much, let's do some coding J

You need to sign-up at first. Once you have done it you can follow it.
Now, navigate to databricks community edition home page like shown below:

First you need to create a cluster first, click on Clusters > Create Cluster to create a cluster. Use Spark 2.1 (Auto-updating, Scala 2.10)

Next, upload your data to cluster. To do this, click on Tables > Create Table you will be presented like below screen:

Click on “Drop file or click here to upload” section and upload your file. Once you have uploaded the file it will show you the path. Note that path to somewhere.

Now, create a notebook by navigating to Workspace and click on dropdown and select Create > Notebook like shown below:

And provide the name for the notebook. I called “SparkR-GradientDescent”

Make sure you have selected R as language. Click on Create button to create the notebook. Now navigate to your newly created notebook and start writing R code J

We now need to load the data. Remember that we are running R code in Spark so we need to use read.df (from SparkR package) to load data into a SparkDataFrame (not data.frame).

Note that all above methods are similar but they are from SparkR package. All these methods understand SparkDataFrame object. Let's run below code to see the structure of the object:

Now run below code:

You can see both are two different object.

Now, I define a method that calculates partial gradient so that we can compute it on worker nodes and get the result back to driver program.

Here is the code:

Now, we write code that initiate worker nodes to calculate partial gradients on each partition, collect those calculated data and update our thetas using below codes:

Here is the result of above code:

Few things to note in above code:

  1. We are caching (using cache(data)) data in memory so that in each iteration Spark does not need to load data from storage.
  2. We are defining schema because dapply needs to transform an r data.frame object to SparkDataFrame with provide schema
  3. We are performing some calculations (partial gradient) on each partition using dapply.  So we are telling spark to run given function on each partition residing on worker nodes.
  4. Each worker nodes are getting a shared variable/object. In Spark-scala we had to broadcast the variable.
  5. We are collecting data from worker nodes as r data.frame object using collect method.
  6. Updating theta and that will be available to each worker in next iteration.

You can view available functions in SparkR package at
Finally we can validate our estimated coefficient using lm package in R (running locally on my machine)

We can validate our calculation on sample data so that we can debug it easily. We can see that estimated coefficients are close to what lm model gave me. If we increase number of iterations we can get thetas close to it.

I hope that this post will help you understanding SparkR. Please provide your feedback if I missed anything.

That’s it for now. Enjoy coding :)

Saturday, 27 August 2016

Implementing Gradient Descent Algorithm in Hadoop for large scale data

In this post I will be exploring how can we use MapReduce to implement Gradient Descent algorithm in Hadoop for large scale data. As we know Hadoop is capable of handling peta-byte scale/size of the data.

In this article I will be using following concept:
  • Gradient Descent algorithm
  • Hadoop Map Reduce Job
  • Writable
  • Reading from HDFS via Hadoop API
  • Hadoop Counter

Before starting, first we need to understand what is Gradient Descent and where can we use it. Below is an excerpt from Wikipedia:
Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. If instead one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent.
Gradient descent is also known as steepest descent, or the method of steepest descent. Gradient descent should not be confused with the method of steepest descent for approximating integrals.

If you look at the algorithm, it is an iterative optimisation algorithm. So if we are talking about millions of observations, then we need to iterate those millions of observations and adjust out parameter (theta).

Some mathematical notations:



Now, the question is how can we leverage Hadoop to distribute the work load to minimize the cost function and find the theta parameter?

MapReduce programming model comprises two phases. 1 Map, 2. Reduce shown in below picture. Hadoop gives programmer to only focus on map and reduce phase and rest of the workload is taken care by Hadoop. Programmers do not need to think how I am going to split data etc.

Please visit to know about MapReduce framework.

When user uploads data to HDFS, the data are splited and saved in various data nodes.
Now we know Hadoop will provide subset of data to each Mapper. So we can program our mapper to emit PartialGradientDescent serializable object. For instance if one split has 50 observations, then that mapper will return 50 partial gradient descent objects.

One more thing, there is only ONE reducer in this example, so reducer will get whole lot of data, it would be better to introduce combiner so that reducer will get low number of PartialGradientDescent objects or you can apply in-memory combining design pattern for MapReduce which I will cover in next post.

Now let’s get into java map reduce program. Before reading further it would be better you understand the Writable concept in Hadoop and some matrix algebra.

Mapper code:

We can see that map task is emitting partialGradientDescent object with lot of information. Like sum0, sum1 and 1. These information will be required in reducer to update the theta.

Now let's have a look at reducer code:

We can see from Reducer code that we are summing up all given partial gradients. This can be improved if we supply combiner that does some partial sum before reaching to reducer. For instance if we have 50 mapper, then after each mapper the combiner will sum and send to reducer in that case reducer will get 50 partial gradient objects.

and custom writable (ie. PartialGradientDescent)

and the last piece of the puzzle is the Driver program that trigger the Hadoop job based on number of iterations you need.

That's it for now. Stay tuned.

Tuesday, 15 March 2016

Timer job – from on premise to Cloud World (Azure) using WebJob

The SharePoint Timer service runs in background to do long running tasks. The Timer service does some important SharePoint clean up tasks in the background but can also be used to provide useful functional tasks. For instance, there may be  a situation when you want to send newsletters to your users on regular basis or want to keep your customer up to date with  some regular timed information.

This is part two of the series. Please read first post at of series "From SharePoint On-Prem to Office365"

I will be using SharePoint Timer Service to send an email to newly registered customers/users for this demo. The newly registered customers/users are stored in SharePoint list with a status field capturing whether an email has been sent or not.
There are some implementation choices when developing a SharePoint Timer service:
  1. Azure Web Job
  2. Azure Worker Role
  3. Windows Service (can be hosted on premise or vm on Cloud)
  4. Task Scheduler (hosted on premise)
I am choosing WebJob as it is free of cost and I can leverage my Console application as WebJob. Please check why to choose Web Job.

Azure web job does not live it its own. It sits under Azure Web Apps. For this purpose I am going to create a dummy web app and host my Azure web job. I will be hosting all my CSOM code in this web job.

There are two types of web job:
  • Continuous best fit for queuing application where it keeps receiving messages from queue.
  • On Demand can be scheduled for hourly, weekly and monthly etc.
The Web Job is used to host and execute CSOM code to get information about the user/customers from SharePoint to send email. Following code snippets show what web job is doing:

Querying SharePoint using CSOM and CAML Query:

Sends Email using Office365 Web Exchange:

Composing email using Razor Engine templating engine:

And finally update SharePoint list item using CSOM:

You can download full source code from Codeplex:
When writing a Web Job, the following points should be considered to make your web job diagnosable and reusable:
  1. Do not absorbs exceptions. Handle it first throw it to let web job know something went wrong.
  2. Try to use interfaces so that it can be mocked for unit testings
  3. Always log major steps and errors using Console.WriteLine etc
  4. Make your code like it can be used as console application so that it can be used in Task scheduler
  5. Try to avoid hardcoding. Maximise the use of configuration. It can be plugged from Azure portal as well.
It is time to publish this web job. There are lots of article out there how to create schedule for the web job. I would simply be using Visual Studio to create the schedule before publish it. On Visual Studio, right click the project and click “Publish as Azure Web Job…” and it will launch a UI to specify your schedule as shown below:
Schedule settings
That’s it. Happy SharePointing :)

Tuesday, 7 April 2015

From SharePoint On-Premise to Office365

In this post I will show you how you can convert you SharePoint farm solutions (.wsp) to solution that works with Office365/cloud.
Following is the road map from Microsoft that shows how Microsoft is transforming Office product for every platform:

Picture copied from Microsoft Office365 Developer site.
It looks like we have now more audiences (Developers) or we can say that we have more options to develop solutions that target Microsoft Office product.
Following are the common tasks that we usually do when we develop SharePoint Farm solutions:
1.      Site Definition
2.      Site creation
3.      Item Receiver code
4.      Feature (I.e. To create site columns, content types, list or document library etc)
5.      Site Columns and Content Types
6.      List or document library creations
7.      Workflow
8.      File upload using Module
9.      Timer job
10.   Querying sites, lists and other SharePoint objects
11.   Item creations in list
12.   Branding (customising master page, page layouts etc)
13.   WebPart development
To convert Farm solutions code to cloud compatible we have got following choices to host code:
1.      ASP.NET MVC application
2.      ASP.NET Web-Form application
3.      Console Application (for continues integration environment)
4.      Windows Phone/Table client
5.      Php application
6.      Android / iOS application
I will start with Console application that query SharePoint objects from On-Premise SharePoint farm as well as Office365. By doing this way we can prepare ourselves for cloud.
Let’s start writing some code that works in both environment.
bool forCloud = true;

Console.WriteLine("Connecting to Office365 at");

// Open connection to Office365 tenant
ClientContext clientContext = new ClientCon-text("");
clientContext.AuthenticationMode = ClientAuthenticationMode.Default;

if (forCloud)
   //creating secure string
   SecureString password = new SecureString();
   foreach (char c in Office365Password)

   clientContext.Credentials = new SharePointOnlineCredentials(Office365UserId, password);
   //Comment this line if you want to use your default network credential
   clientContext.Credentials = new NetworkCredential("UserName", "Password", "Domain");

Console.WriteLine("Executing query...");

//load web
Web web = clientContext.Web;

//displaying title of the web.
Console.WriteLine("Web title: {0}", web.Title);

Console.WriteLine("Loading lists...");
ListCollection lists = web.Lists;

foreach (List list in lists)                
  Console.WriteLine("List title: {0}", list.Title);                


For on premise just set onCloud to false and rest of the code would be same for both environment. After executing it will show web site name and all lists within the site.
Using this simple technique we can develop app-part, which I will blog soon, to show aggregated data from various lists in Office365.
In next post, I will show you how we can leverage ASP.NET MVC application to host SharePoint CSOM code that will be used to create site definition and other stuff.

That's it for now. Please leave your valuable feedback.