Thursday, 18 May 2017

Exploring SparkR using Databricks environment

In this exploration I will share what I have learnt so far R with Spark. Spark, as you all know, is a distributed computing framework. It allows you to program in Scala/python/Java and now in R for performing distributed computation.

I implemented gradient descent in Hadoop to understand how we are going to parallelize gradient computation. Please have a read about it at for understanding mathematics behind it.

Now I am implementing same gradient descent algorithm in SparkR using Databricks community edition. You might be wondering why I am implementing it again J

I always start with the knowledge I have right now then I use those knowledge to learn new language. For this instance Gradient Descent algorithm bets fit here. Also we learn couple of things while implementing GD like:

  •      How we break big loop into cluster of computers
  •      How we are transforming data in parallel
  •      How we share/send common variables/values to worker nodes
  •      How we are aggregating results from worker nodes.
  •      Finally combining those results

If your algorithm has to iterate over millions or more records then it is worth parallelizing it. Any computation you do, you will almost be doing same sort of things as I outlined above. I can use above high-level tasks mentioned above to build a complex Machine Learning model like ensembling models or model stacking etc.

Please write in comments if you have other items than I have listed above
J to learn from you as well.

Now I have talked too much, let's do some coding J

You need to sign-up at first. Once you have done it you can follow it.
Now, navigate to databricks community edition home page like shown below:

First you need to create a cluster first, click on Clusters > Create Cluster to create a cluster. Use Spark 2.1 (Auto-updating, Scala 2.10)

Next, upload your data to cluster. To do this, click on Tables > Create Table you will be presented like below screen:

Click on “Drop file or click here to upload” section and upload your file. Once you have uploaded the file it will show you the path. Note that path to somewhere.

Now, create a notebook by navigating to Workspace and click on dropdown and select Create > Notebook like shown below:

And provide the name for the notebook. I called “SparkR-GradientDescent”

Make sure you have selected R as language. Click on Create button to create the notebook. Now navigate to your newly created notebook and start writing R code J

We now need to load the data. Remember that we are running R code in Spark so we need to use read.df (from SparkR package) to load data into a SparkDataFrame (not data.frame).

Note that all above methods are similar but they are from SparkR package. All these methods understand SparkDataFrame object. Let's run below code to see the structure of the object:

Now run below code:

You can see both are two different object.

Now, I define a method that calculates partial gradient so that we can compute it on worker nodes and get the result back to driver program.

Here is the code:

Now, we write code that initiate worker nodes to calculate partial gradients on each partition, collect those calculated data and update our thetas using below codes:

Here is the result of above code:

Few things to note in above code:

  1. We are caching (using cache(data)) data in memory so that in each iteration Spark does not need to load data from storage.
  2. We are defining schema because dapply needs to transform an r data.frame object to SparkDataFrame with provide schema
  3. We are performing some calculations (partial gradient) on each partition using dapply.  So we are telling spark to run given function on each partition residing on worker nodes.
  4. Each worker nodes are getting a shared variable/object. In Spark-scala we had to broadcast the variable.
  5. We are collecting data from worker nodes as r data.frame object using collect method.
  6. Updating theta and that will be available to each worker in next iteration.

You can view available functions in SparkR package at
Finally we can validate our estimated coefficient using lm package in R (running locally on my machine)

We can validate our calculation on sample data so that we can debug it easily. We can see that estimated coefficients are close to what lm model gave me. If we increase number of iterations we can get thetas close to it.

I hope that this post will help you understanding SparkR. Please provide your feedback if I missed anything.

That’s it for now. Enjoy coding :)

Saturday, 27 August 2016

Implementing Gradient Descent Algorithm in Hadoop for large scale data

In this post I will be exploring how can we use MapReduce to implement Gradient Descent algorithm in Hadoop for large scale data. As we know Hadoop is capable of handling peta-byte scale/size of the data.

In this article I will be using following concept:
  • Gradient Descent algorithm
  • Hadoop Map Reduce Job
  • Writable
  • Reading from HDFS via Hadoop API
  • Hadoop Counter

Before starting, first we need to understand what is Gradient Descent and where can we use it. Below is an excerpt from Wikipedia:
Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. If instead one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent.
Gradient descent is also known as steepest descent, or the method of steepest descent. Gradient descent should not be confused with the method of steepest descent for approximating integrals.

If you look at the algorithm, it is an iterative optimisation algorithm. So if we are talking about millions of observations, then we need to iterate those millions of observations and adjust out parameter (theta).

Some mathematical notations:



Now, the question is how can we leverage Hadoop to distribute the work load to minimize the cost function and find the theta parameter?

MapReduce programming model comprises two phases. 1 Map, 2. Reduce shown in below picture. Hadoop gives programmer to only focus on map and reduce phase and rest of the workload is taken care by Hadoop. Programmers do not need to think how I am going to split data etc.

Please visit to know about MapReduce framework.

When user uploads data to HDFS, the data are splited and saved in various data nodes.
Now we know Hadoop will provide subset of data to each Mapper. So we can program our mapper to emit PartialGradientDescent serializable object. For instance if one split has 50 observations, then that mapper will return 50 partial gradient descent objects.

One more thing, there is only ONE reducer in this example, so reducer will get whole lot of data, it would be better to introduce combiner so that reducer will get low number of PartialGradientDescent objects or you can apply in-memory combining design pattern for MapReduce which I will cover in next post.

Now let’s get into java map reduce program. Before reading further it would be better you understand the Writable concept in Hadoop and some matrix algebra.

Mapper code:

We can see that map task is emitting partialGradientDescent object with lot of information. Like sum0, sum1 and 1. These information will be required in reducer to update the theta.

Now let's have a look at reducer code:

We can see from Reducer code that we are summing up all given partial gradients. This can be improved if we supply combiner that does some partial sum before reaching to reducer. For instance if we have 50 mapper, then after each mapper the combiner will sum and send to reducer in that case reducer will get 50 partial gradient objects.

and custom writable (ie. PartialGradientDescent)

and the last piece of the puzzle is the Driver program that trigger the Hadoop job based on number of iterations you need.

That's it for now. Stay tuned.

Tuesday, 15 March 2016

Timer job – from on premise to Cloud World (Azure) using WebJob

The SharePoint Timer service runs in background to do long running tasks. The Timer service does some important SharePoint clean up tasks in the background but can also be used to provide useful functional tasks. For instance, there may be  a situation when you want to send newsletters to your users on regular basis or want to keep your customer up to date with  some regular timed information.

This is part two of the series. Please read first post at of series "From SharePoint On-Prem to Office365"

I will be using SharePoint Timer Service to send an email to newly registered customers/users for this demo. The newly registered customers/users are stored in SharePoint list with a status field capturing whether an email has been sent or not.
There are some implementation choices when developing a SharePoint Timer service:
  1. Azure Web Job
  2. Azure Worker Role
  3. Windows Service (can be hosted on premise or vm on Cloud)
  4. Task Scheduler (hosted on premise)
I am choosing WebJob as it is free of cost and I can leverage my Console application as WebJob. Please check why to choose Web Job.

Azure web job does not live it its own. It sits under Azure Web Apps. For this purpose I am going to create a dummy web app and host my Azure web job. I will be hosting all my CSOM code in this web job.

There are two types of web job:
  • Continuous best fit for queuing application where it keeps receiving messages from queue.
  • On Demand can be scheduled for hourly, weekly and monthly etc.
The Web Job is used to host and execute CSOM code to get information about the user/customers from SharePoint to send email. Following code snippets show what web job is doing:

Querying SharePoint using CSOM and CAML Query:

Sends Email using Office365 Web Exchange:

Composing email using Razor Engine templating engine:

And finally update SharePoint list item using CSOM:

You can download full source code from Codeplex:
When writing a Web Job, the following points should be considered to make your web job diagnosable and reusable:
  1. Do not absorbs exceptions. Handle it first throw it to let web job know something went wrong.
  2. Try to use interfaces so that it can be mocked for unit testings
  3. Always log major steps and errors using Console.WriteLine etc
  4. Make your code like it can be used as console application so that it can be used in Task scheduler
  5. Try to avoid hardcoding. Maximise the use of configuration. It can be plugged from Azure portal as well.
It is time to publish this web job. There are lots of article out there how to create schedule for the web job. I would simply be using Visual Studio to create the schedule before publish it. On Visual Studio, right click the project and click “Publish as Azure Web Job…” and it will launch a UI to specify your schedule as shown below:
Schedule settings
That’s it. Happy SharePointing :)

Tuesday, 7 April 2015

From SharePoint On-Premise to Office365

In this post I will show you how you can convert you SharePoint farm solutions (.wsp) to solution that works with Office365/cloud.
Following is the road map from Microsoft that shows how Microsoft is transforming Office product for every platform:

Picture copied from Microsoft Office365 Developer site.
It looks like we have now more audiences (Developers) or we can say that we have more options to develop solutions that target Microsoft Office product.
Following are the common tasks that we usually do when we develop SharePoint Farm solutions:
1.      Site Definition
2.      Site creation
3.      Item Receiver code
4.      Feature (I.e. To create site columns, content types, list or document library etc)
5.      Site Columns and Content Types
6.      List or document library creations
7.      Workflow
8.      File upload using Module
9.      Timer job
10.   Querying sites, lists and other SharePoint objects
11.   Item creations in list
12.   Branding (customising master page, page layouts etc)
13.   WebPart development
To convert Farm solutions code to cloud compatible we have got following choices to host code:
1.      ASP.NET MVC application
2.      ASP.NET Web-Form application
3.      Console Application (for continues integration environment)
4.      Windows Phone/Table client
5.      Php application
6.      Android / iOS application
I will start with Console application that query SharePoint objects from On-Premise SharePoint farm as well as Office365. By doing this way we can prepare ourselves for cloud.
Let’s start writing some code that works in both environment.
bool forCloud = true;

Console.WriteLine("Connecting to Office365 at");

// Open connection to Office365 tenant
ClientContext clientContext = new ClientCon-text("");
clientContext.AuthenticationMode = ClientAuthenticationMode.Default;

if (forCloud)
   //creating secure string
   SecureString password = new SecureString();
   foreach (char c in Office365Password)

   clientContext.Credentials = new SharePointOnlineCredentials(Office365UserId, password);
   //Comment this line if you want to use your default network credential
   clientContext.Credentials = new NetworkCredential("UserName", "Password", "Domain");

Console.WriteLine("Executing query...");

//load web
Web web = clientContext.Web;

//displaying title of the web.
Console.WriteLine("Web title: {0}", web.Title);

Console.WriteLine("Loading lists...");
ListCollection lists = web.Lists;

foreach (List list in lists)                
  Console.WriteLine("List title: {0}", list.Title);                


For on premise just set onCloud to false and rest of the code would be same for both environment. After executing it will show web site name and all lists within the site.
Using this simple technique we can develop app-part, which I will blog soon, to show aggregated data from various lists in Office365.
In next post, I will show you how we can leverage ASP.NET MVC application to host SharePoint CSOM code that will be used to create site definition and other stuff.

That's it for now. Please leave your valuable feedback.

Wednesday, 18 March 2015

Deploy Content Type with Lookup field

In this post I am going to show how we can deploy content type declaratively (through xml) with fields using known GUIDs.

One of the challenge with the look up field is that it binds to Lookup list via ListId (i.e. GUID). There are couple of ways you can achieve this. One is to use SharePoint Object Model to create fields and content types via Feature code.

First you need to create the lookup list via site scope feature. You might be thinking why I am creating a list in site scope feature? The reason is simple I just want to create the lookup list in root web of the site collection and that will be accessible to site columns.

Here is the project structure shown in image:

The LookupListProvisioner feature receiver will create a list that can be used as look up field. Here is the code:

Next, you need to create site column using SiteColumnsProvisioner feature receiver using below code:

As you can see in the code, we are using existing field GUID that can be used in Content Type. In ContentTypeProvisioner feature, I have used Elements.xml file to create the content type like shown below:

As you can see in above code, I have used pre-defined ContentTypeId and Field Guid. I have written a tool, that generates xml for fields and content types from SharePoint environment, can be found at
That’s it for now. Happy coding J

Monday, 24 November 2014

PowerView - Getting Started

In this post I am going to show you how you can use PowerView using basic data model. In next post I will demonstrate how you can use PowerView to visualise data from SQL Server, Analysis Service or other data sources.

Assume we have following data in excel called “CountryNStates” as shown below image:

Now add these tables into Data Model using option available from ribbon as shown below image:

Once all your tables are added to data model, you can design a Data Model with proper relationship. Below is shown for above two tables:

Now it’s time to create some PowerView report.

To add a PowerView report in Excel, go to “Insert” tab and click on PowerView icon. You will see a new sheet for PowerView report. You should also see all the tables you created in the data model.

Here is the snapshot of the report:

Without doing too much, PowerView provides, filters, auto measures (i.e. SUM) etc. In above report the numbers are not formatted so I will show you how to format numbers etc.

Go back to Data Model design page by clicking on Manage button under “POWERPIVOT” tab and use formatting tab as shown below image:

When you update it, Excel will prompt you in the report section that you have updated you Data Model. Now refresh the report you will see the numbers are formatted.

Now let’s add a measure that will show the percentage contribution to sales. I will call it “% Sale”. Here we will be using DAX (Data Analysis eXpression) to achieve this.

To create a measure, you need to go back to Data Modeling page. The formula for percentage would be like below:

% Sale of Queensland = ( Total Sales for Queensland / Total Sales in Australia ) * 100

Now let’s calculate it.

Below are measures in DAX:

Total Sales:=CALCULATE(SUM(Sales[Sales]), ALL(Sales))

% of Sales:=SUM(Sales[Sales]) / Sales[Total Sales]

Just make sure you need to display it like percentage. You can do this by right clicking on the measure and selecting formatting option then select “Percentage” option.

Once you have done this, you should see following measures in Sales table under measures area as shown below:

The beauty of % Sale measure is that it knows the context so we don’t need to say calculate % sale for Queensland etc.

Now add above % sales in the report. Here is the final report after adding % sales.

Please leave your comments if you like or dislike it. In next post I will show you how you can create BI Semantic Model using Visual Studio and deploy it to Analysis Service – TABULAR and consume it from PowerView in SharePoint environment.

That’s it for now. Happy SharePointing and PowerViewing J

Sunday, 3 August 2014

Reading Settings from a SharePoint List in Nintex Workflow

Recently one of my client asked me to create a notification email service to send email for content refresh after certain period (x number of days). But I was not allowed to write any server side otherwise I could have written SharePoint Service.

As I was not allowed to write server side code, so I thought to create a Nintex workflow as they are using it in their environment. If anyone has got any other technique for this task please comment it.

Here is my setting list:

and Edit form

Note: I am using KeyName as a choice so that user cannot modify it by mistake.

And below is a snapshot of my test workflow that only logs the setting value:


When you configure Querying Workflow Settings list, from there you can create a workflow variable that will hold the setting value. For each setting value, you need a workflow variable.

Here is the snapshot of Workflow Variables popup:

And below is the screenshot of the configuration of Query List Action:


In Filter area, you need to specify the exact match so that Query List Action can fetch the values for that setting.

For instance, I have got a key name in the list as “FrequencyInDays” as you can see in the screenshot of the settings list.

Once you have filtered the key, you need to select the field, which has got the value for that setting and click “Add” button. After clicking on add button, that field name (i.e. KeyValue) will appear at the bottom as I have highlighted in red and then you have to select the workflow variable to hold the value from that variable.

By using this technique, you can read all required settings value from the list once a workflow has initialised or can be queried before using it in other workflow actions.

That’s it.