Pages

Monday, 22 October 2018

Use of Power BI PushDatasets


In PowerBI, real-time streaming can be consumed in three (please visit https://docs.microsoft.com/en-us/power-bi/service-real-time-streaming ) different ways listed below:

  •  Consuming streaming data from Azure Stream Analytic
  •  Consuming streaming data from PubNub
  •  Consuming streaming data vis PushDataset


If you have got sensors which are emitting data and you want to visualize it in real-time, you can use PowerBI in conjunction with Azure Stream Analytics to build the dashboard. But if data is less frequent and you want to have a dashboards that auto-refreshes then you can use any one of three methods. In this post I will show how Push Datasets can be used to develop a dashboard.

Below is a simple architecture for this post:













We need following components to build a complete end-to-end simple solution shown in above diagram:

  1. Data Generator (to simulate data is coming to db at every x interval). This could your source which generates data less frequently.
  2. Console App (that pushes data to PowerBI PushDataset)
  3. PowerBI Dashboard


I have created a sample code that generates some data and inserts into database. Please see below code snippet:



The console application will be leveraging PowerBI REST API programming interface for pushing data. For this reason, console app needs to authenticate/authorize with PowerBI. So you need to register your console app with Azure AD in order to get OAuth token from Azure Authorization server. Please follow https://docs.microsoft.com/en-us/power-bi/developer/walkthrough-push-data-register-app-with-azure-ad to register your app.

OAuth provides four different OAuth flows based how you want to authenticate/authorize your application with Authorization server. Please visit https://auth0.com/docs/api-auth/which-oauth-flow-to-use to know which flow is best suited for your scenario.

I will be using Client Credential Flow (an OAuth flow which can be read at https://oauth.net/2/grant-types/client-credentials/ ) as console app will be treated as trusted application and also there would not be any human interaction if any authorization popup appears it would not be able to deal with.

Below is the code to get oauth token using Microsoft.IdentityModel.Clients.ActiveDirectory version 2.29.0 of nuget package:



I treat this scenario as syncing two systems (from source to target but target is PowerBI). Most of the syncing solution, we need to maintain what we have synced so far so that next time system should pick delta of data.
For this purpose, we are using ROWVERSION datatype which is auto generated by database. Please visit https://www.mssqltips.com/sqlservertip/4545/synchronizing-sql-server-data-using-rowversion/ for how to use rowversion for syncing scenario.

To maintain what has been synced, I have created a table to keep track the last row version, console application has sent to PowerBI, against a table like shown below:












For the first time, last row version should 0x000.

I also created a stored procedure that returns delta with the help of last row version and table name. Below is the stored procedure code:



Now, we got the data (delta amount), we need to send it to PushDataset in PowerBI. Every PushDataset has a unique id, and data needs to be sent to correct id.

I have created a dataset called “DeviceTelemetry” using REST API. To find the dataset id, you need to call the Power BI REST API like shown below:










And result is shown like below:








Now we got the Dataset Id as GUID, we need to use it to send data to Power BI. We will use PowerBI REST API to do this. You can do it in your console app to fetch all the datasets and grab the id for which you want to send to. For demonstration purpose I have shown you how you achieve it.

Now, the console app can you use dataset id and keep pushing data to it. Again you can leverage Power BI REST API to send data into batches or one by one. Below is a snapshot how I am sending data:









Here is the code that wraps to add rows to PowerBI PushDatasets leveraging api wrapper:


Here is the code for PowerBI Rest Api wrapped around a nice method:


Once your console app start sending data, you can go to PowerBI.com and start creating reports and dashboard like shown below:




Note that the dataset is listed as Push dataset.
Click on red boxed area (create report link) to create report as I created reports and composed them into one dashboard shown below:














That’s it so far.

Monday, 3 September 2018

Setting up an environment for Monte Carlo Simulation in Docker

In this blog I will walk you through to install JAGS inside a docker container. You might be thinking why I have chosen docker for this. The answer is very simple, when I was install JAGS on my personal computer, the OS did not recognise as a trusted software so I did not take a risk of installing on my personal computer.

If you want to play with JAGS and you don't want to install it in your computer, then Docker is the best option as I can play with the package/software and then I can delete the container.

Now you got the idea why I have chosen Docker container for this. Let's proceed to setup an environment for Monte Carlo simulation. Make sure you have got Docker installed. Follow below steps to setup the environment:

1. Open Command prompt with administrative privilege and issue follow command:
$ docker run --name mybox -t -p 8004:8004 opencpu/rstudio

Above command will download the opencpu/rstudio image locally.

2. Issue below command to start/run the container:
$ docker container start mybox

3. Open browser in your host computer and point http://localhost:8004/rstudio/ and provide opencpu as username and password like shown below:






4. Now, you need to connect to container, by issuing below command in your command prompt, to install JAGS - a tool that generate Gibbs Sampling:

$ docker exec -I -t mybox /bin/bash

You will be taken to terminal of container like shown below:



5. Issue below commands to terminal of container:
$ sudo apt-get update
$ sudo apt-get install jags

5. Now go the browser (you opened in step 3) and install "rjags" and "runjags" packages like shown below and you are done. Now you use this environment to create a simulation using Monte Carlo.


That's it so far. Stay tuned.

Wednesday, 9 May 2018

Azure IoT and IoT Edge - Part 2 (Building a Machine Learning model using generated data)


This blog is part 2 of Azure IoT Edge series. Please see http://blog.mmasood.com/2018/03/azure-iot-and-iot-edge-part-1.html if you have not read part 1.

In this blog I will cover the how we can build a logistic regression model in R using the data the captured in tables storage via IoT Hub.

We can run the simulated devices (all three at once) and wait for data to be generated and save it to table storage. But for the simplicity I have created an R script to generate the data so that I can build the model and deploy it to IoT Edge and hence we can leverage the this Edge device to apply Machine Learning model on the data it is receiving from the downstream devices.

I am using exactly the same minimum temperatures, pressure and humidity as our simulated device was using. Please see http://blog.mmasood.com/2018/03/azure-iot-and-iot-edge-part-1.html here are few lines of R script.




Let’s plot the data and see how it looks like. There are only 3 fields/feature so I will plot  Temperature vs Pressure using ggplot2:







Output of above R commands:

















We can see as the temperature and pressure increases the device is becoming bad or getting away from the good devices. For the simplicity the simulation generates higher number for temperature and pressure if device is flagged as defective.

Now let’s build a simple logistic regression model to find out the probability of device being defective.
I am using caret package for building model. Here is the code to split the training and test data:








The proportion of good vs bad for original data is: 66% (good)/33% (bad). So we make sure we don’t have skewness in the data.







Now applying glm function to data using R script shown below:







Here is the summary of the model:

















We can see from above output, the pressure is not statistically significant. The idea of this post is to have a model that we will be using in IoT Edge device.

Let’s test this model on test data set and find out the best threshold to separate the bad from good. I could have used cross-validation to find the best threshold. Use cross validation set to fine tune the parameters (eg. threshold or lambda if ridge regression is used etc).

Below is the confusion matrix when I use threshold 0.5:







Let’s construct a data frame which contains actual, predicted and calculated probability using below code:



And view first and last 5 records:

















The higher (or closer to 1) the probability the device is good.

With threshold 0.65, the confusion matrix look like below:








So we can see from above two confusion matrix, the best threshold should be 0.50 as it miss-classifies only 4 instances but when 0.65 is used it miss-classifies 5 instances.

The final model is given below:










So far I have got the model built. I will use this model in IoT Edge module which will make Edge intelligent, which I will post soon so stay tuned and happy IoTing J

Tuesday, 13 March 2018

Azure IoT and IoT Edge (Part 1)

In this blog post I will walk you through how and IoT device (for IoT Hub and IoT Edge gateway) can be created.

The simulated device will generate telemetry data that will be used by IoT Edge Module (e.g. Clustering) to find out which device need to be replace or restart it etc.

I will be posting few more blogs to achieve below:




We can see from above diagram, the main components are:
  •            IoT Hub
  •          Configuration of IoT Edge device as gateway
  •          IoT Edge Module
  •          Downstream devices


I will develop a Machine Learning model (k-means clustering) in R and will leverage in MachineLearningModel Edge Module to find which device need to be replaced or need to restart etc.

For the simplicity, the downstream device will generate following telemetry data:
  • Temperature
  • Humidity
  • Pressure


Let’s develop a downstream device that generates above random data. Follow https://docs.microsoft.com/en-us/azure/iot-hub/iot-hub-csharp-csharp-getstarted to setup you IoT Hub. I have got my IoT Hub setup now, I am creating a .Net console app that acts as device which generates some random data.

Here are some code snippets:

Here is the Main method:
Here is an example batch file to run as device1:


Now you need to register 3 devices in Azure Portal in IoT Hub here are the steps:
  •       Navigate to Azure Portal then your IoT Hub
  •       Navigate to IoT Devices
  •       Click on Add button and fill the details for the device like shown below:





  •       Now, go to the device you just created and copy the primary connection string to respective .bat file.
  •       Repeat for 3 times to create 3 devices.


Once you have created three devices, start running device1.bat, device2.bat and device3.bat. It will start sending data to IoT Hub like shown below:



And your IoT Hub will show number of messages received like shown below:




So far we have created/simulated 3 devices that started sending temperature and other data to IoT Hub. These devices will be used to send the data to IoT Edge gateway (by appending GatewayHostName=<your-gateway-host-name> to device connection string) and I will explain in next blog so stay tuned.

Thursday, 18 May 2017

Exploring SparkR using Databricks environment

In this exploration I will share what I have learnt so far R with Spark. Spark, as you all know, is a distributed computing framework. It allows you to program in Scala/python/Java and now in R for performing distributed computation.

I implemented gradient descent in Hadoop to understand how we are going to parallelize gradient computation. Please have a read about it at http://blog.mmasood.com/2016/08/implementing-gradient-decent-algorithm.html for understanding mathematics behind it.

Now I am implementing same gradient descent algorithm in SparkR using Databricks community edition. You might be wondering why I am implementing it again J

I always start with the knowledge I have right now then I use those knowledge to learn new language. For this instance Gradient Descent algorithm bets fit here. Also we learn couple of things while implementing GD like:

  •      How we break big loop into cluster of computers
  •      How we are transforming data in parallel
  •      How we share/send common variables/values to worker nodes
  •      How we are aggregating results from worker nodes.
  •      Finally combining those results


If your algorithm has to iterate over millions or more records then it is worth parallelizing it. Any computation you do, you will almost be doing same sort of things as I outlined above. I can use above high-level tasks mentioned above to build a complex Machine Learning model like ensembling models or model stacking etc.

Please write in comments if you have other items than I have listed above
J to learn from you as well.


Now I have talked too much, let's do some coding J

You need to sign-up at https://databricks.com/ first. Once you have done it you can follow it.
Now, navigate to databricks community edition home page like shown below:





First you need to create a cluster first, click on Clusters > Create Cluster to create a cluster. Use Spark 2.1 (Auto-updating, Scala 2.10)

Next, upload your data to cluster. To do this, click on Tables > Create Table you will be presented like below screen:



Click on “Drop file or click here to upload” section and upload your file. Once you have uploaded the file it will show you the path. Note that path to somewhere.



Now, create a notebook by navigating to Workspace and click on dropdown and select Create > Notebook like shown below:




And provide the name for the notebook. I called “SparkR-GradientDescent”



Make sure you have selected R as language. Click on Create button to create the notebook. Now navigate to your newly created notebook and start writing R code J

We now need to load the data. Remember that we are running R code in Spark so we need to use read.df (from SparkR package) to load data into a SparkDataFrame (not data.frame).



Note that all above methods are similar but they are from SparkR package. All these methods understand SparkDataFrame object. Let's run below code to see the structure of the object:





Now run below code:



You can see both are two different object.

Now, I define a method that calculates partial gradient so that we can compute it on worker nodes and get the result back to driver program.

Here is the code:



Now, we write code that initiate worker nodes to calculate partial gradients on each partition, collect those calculated data and update our thetas using below codes:



Here is the result of above code:



Few things to note in above code:

  1. We are caching (using cache(data)) data in memory so that in each iteration Spark does not need to load data from storage.
  2. We are defining schema because dapply needs to transform an r data.frame object to SparkDataFrame with provide schema
  3. We are performing some calculations (partial gradient) on each partition using dapply.  So we are telling spark to run given function on each partition residing on worker nodes.
  4. Each worker nodes are getting a shared variable/object. In Spark-scala we had to broadcast the variable.
  5. We are collecting data from worker nodes as r data.frame object using collect method.
  6. Updating theta and that will be available to each worker in next iteration.

You can view available functions in SparkR package at https://docs.databricks.com/spark/latest/sparkr/index.html
Finally we can validate our estimated coefficient using lm package in R (running locally on my machine)




We can validate our calculation on sample data so that we can debug it easily. We can see that estimated coefficients are close to what lm model gave me. If we increase number of iterations we can get thetas close to it.

I hope that this post will help you understanding SparkR. Please provide your feedback if I missed anything.

That’s it for now. Enjoy coding :)

Saturday, 27 August 2016

Implementing Gradient Descent Algorithm in Hadoop for large scale data

In this post I will be exploring how can we use MapReduce to implement Gradient Descent algorithm in Hadoop for large scale data. As we know Hadoop is capable of handling peta-byte scale/size of the data.

In this article I will be using following concept:
  • Gradient Descent algorithm
  • Hadoop Map Reduce Job
  • Writable
  • Reading from HDFS via Hadoop API
  • Hadoop Counter

Before starting, first we need to understand what is Gradient Descent and where can we use it. Below is an excerpt from Wikipedia:
Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. If instead one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent.
Gradient descent is also known as steepest descent, or the method of steepest descent. Gradient descent should not be confused with the method of steepest descent for approximating integrals.

If you look at the algorithm, it is an iterative optimisation algorithm. So if we are talking about millions of observations, then we need to iterate those millions of observations and adjust out parameter (theta).

Some mathematical notations:








where



















And






Now, the question is how can we leverage Hadoop to distribute the work load to minimize the cost function and find the theta parameter?

MapReduce programming model comprises two phases. 1 Map, 2. Reduce shown in below picture. Hadoop gives programmer to only focus on map and reduce phase and rest of the workload is taken care by Hadoop. Programmers do not need to think how I am going to split data etc.

Please visit https://en.wikipedia.org/wiki/MapReduce to know about MapReduce framework.
















When user uploads data to HDFS, the data are splited and saved in various data nodes.
Now we know Hadoop will provide subset of data to each Mapper. So we can program our mapper to emit PartialGradientDescent serializable object. For instance if one split has 50 observations, then that mapper will return 50 partial gradient descent objects.

One more thing, there is only ONE reducer in this example, so reducer will get whole lot of data, it would be better to introduce combiner so that reducer will get low number of PartialGradientDescent objects or you can apply in-memory combining design pattern for MapReduce which I will cover in next post.

Now let’s get into java map reduce program. Before reading further it would be better you understand the Writable concept in Hadoop and some matrix algebra.

Mapper code:

We can see that map task is emitting partialGradientDescent object with lot of information. Like sum0, sum1 and 1. These information will be required in reducer to update the theta.





Now let's have a look at reducer code:


We can see from Reducer code that we are summing up all given partial gradients. This can be improved if we supply combiner that does some partial sum before reaching to reducer. For instance if we have 50 mapper, then after each mapper the combiner will sum and send to reducer in that case reducer will get 50 partial gradient objects.

and custom writable (ie. PartialGradientDescent)


and the last piece of the puzzle is the Driver program that trigger the Hadoop job based on number of iterations you need.

That's it for now. Stay tuned.