It's all about Data, IoT and AI

Wednesday 29 November 2023

How to use LLM with custom data sitting in a private network

In this blog post, I am going to show how we can use LLM using RAG (Retrieval-augmented generation) technique to answer the questions from our documents/files/web pages/data sitting in a private network.

With RAG architecture, your LLM is not required to train on your data but your provided data as additional context to LLM by asking to answer it.

Here is the RAG architecture:

Here is a simple method to interact with the OpenAI LLM model, you need to define the OpenAI key:

Let's start with a simple prompt without context:

As we can see the LLM can’t answer if you ask any specific question.

Now pass some context with a prompt to the LLM and see the response:

We can see LLM answered it correctly.

Let's see how we can introduce some sort of database that stores all your private/intranet data which LLM does not know about or LLM did not use during the training.

To find the right context information for a given question, we will be using Vector DB which stores the data as embedding vectors. Please read more at Word embedding - Wikipedia

Let’s use a simple example to understand how we can find the relevant context. I am using OpenAI embedding here. Assume we have got two different context information and one question as shown below:

The above code snippet shows the sentences are converted to the numeric vectors which is an embedding vector. We need to find which context is similar to the question. We can do this using cosine similarity using the below code:

We can see question and context 1 has the highest value for similarity than to context 2.

I have a simple chatbot using LangChain, Chroma (an open-source vector db) OpenAI embedding and gpt-3.5-turbo LLM to ask any question related to my website (https://www.musofttech.com.au)

Here is the output of a few questions I have asked:

Now I am asking a question where the website does not have information or context and see how the LLM is responding to it

One last thing, you can use LLM to ask about your intranet/private data but we have to pass the context to LLM so make sure you are not passing any information which are not supposed to go beyond your network. The better option is to have your LLM deployed in your private network so it will be safe to use.

Thanks for reading the blog and let me know if you want to achieve a similar thing for you or your customers.

Thursday 12 October 2023

Getting started with DBT with Databricks

Let's start with the question what is dbt?

“dbt is a transformation workflow that helps you get more work done while producing higher quality results. You can use dbt to modularize and centralize your analytics code, while also providing your data team with guardrails typically found in software engineering workflows.” You can read more at Whatis dbt? | dbt Developer Hub (getdbt.com)

There are couple of options to get started, one is pip with virtual environment but I will use pip inside a docker container.

Let’s get started, first thing first start the docker container with below command:

$ docker run -p 8888:8888 -p 8000:8080 -v D:/projects/rnd2/:/home/jovyan/work jupyter/minimal-notebook

In above command, I am exposing two ports 1) 8888 and 2) 8080. The first port will be used to connect to jupyter notebook/terminal and the other port will be used to expose the dbt documentation.

Access your juypter notebook server with http://localhost:8888/lab/tree/work like shown below:

Click on the terminal and execute the pip statement:

Note: For other target, you need to install other package. For instance dbt-snowflake for Snowflake.

Once installed, execute the below command to create a dbt project

Copy the /home/jovyan/.dbt/profiles.yml to your project folder like below:

Now everything is done, let’s open the Visual Studio code

You can explore the generated codes to understand the structure and codes etc.

I have generated the databricks token and supplied to profiles.yml file. Go to the terminal in your jupyter notbook server and execute dbt commands.

The first command is the dbt debug, which validates the configuration.

To apply your transformation, run dbt run command.

Points to remember:

Everytime you execute dbt run, a transformation is applied. With this concept you integrate this with orchestration tool like Apache Airflow or even with Databrick workflow or even with Docker container which will run your dbt command and destroy itself.

You can go in databricks to check the bronze schema which will have 1 table loaded.

Lets run dbt test

As we know that first model contains 1 null record.

Lets look at documentation part, run dbt docs generate and then serve. By default it exposes the documentation at port 8080 because we are running this doc server inside a container and we already mapped our 8000 port to 8080. So simply access localhost:8000 will take you to dbt generated documentation

Navigate to http://localhost:8000

View lineage as well

That’s for now, next I will share how we can use dbt to build data vault layer.

Sunday 24 July 2022

Real-time video analytics with OpenVINO on Azure IoT Edge

In this blog post, I will go through how we can build real-time video analytics on IoT Edge devices with OpenVINO.

Before I start, let me explain what is OpenVINO and what is good for?

OpenVINO stands for Open Visual Inference and Neural network Optimization.

As its name suggests, it is used to optimize the models and allows you to host the models based on your process architecture.

There are two main steps given below:

1. Convert your favorite model to IR format as shown below:

2. Host your IR (intermediate representation) to the OpenVINO model server. The runtime is process architecture dependent.

With this approach, you can develop your own model with your favorite framework or download the prebuilt models from Model Zoo and host them in OpenVINO.

You can read more at https://docs.openvino.ai/latest/index.html

Now we know the OpenVINO, let’s see how we can use it for IoT Edge. We can host the OpenVINO model in a Docker container which makes it perfect for IoT Edge devices. The below diagram shows the general architecture of how OpenVINO can be used:

For more information on how OpenVINO can be used with Azure IoT Edge please visit https://docs.microsoft.com/en-us/azure/azure-video-analyzer/video-analyzer-docs/edge/use-intel-openvino-tutorial

Now let’s see how we can host the model and how we can consume it to make inference for given video frame. For the example, I have chosen a prebuilt model called “Vehicle Detection” from Model Zoo.

Run below command to host model in a Docker container:

$ docker run --rm -d -v "path/to/your/models":/models:ro -p 9000:9000 -p 9001:9001 openvino/model_server:latest --config_path models/config.json --port 9000 --rest_port 9001

You can see OpenVINO documentation for more information: https://docs.openvino.ai/latest/ovms_docs_docker_container.html

Once the model is hosted you can query the metadata of the model by navigating to the URL

http://iotedgedevice-ip:9001/v1/models/vehicle-detection-0202/versions/1/metadata and response will be as shown below:

Which shows the shape of input and output. It means our vehicle detection model is hosted.

Now let’s consume this model from the python script (which later will be converted to the IoT Edge module to process the frame).

The first step is to load the frame in the OpenCV object as shown below:

The endpoint for model prediction is at http://iotedgedevice-ip:9001/v1/models/vehicle-detection-0202/versions/1:predict

We need to submit this image as JSON data and get the result as JSON which contains a NumPy array as per the specification mentioned at https://docs.openvino.ai/latest/omz_models_model_vehicle_detection_0202.html

The below method is used to convert an image to JSON input for the model:

The code flow is shown below:

Here is the model that predicted the vehicle bounding boxes which are then drawn to the mage as shown below:

We can see the model predicted all those vehicles shown in the above image.

That’s it for now and thanks for reading my blog.

Thursday 2 June 2022

Exploring Databricks SQL endpoint and Databricks Dashboard

In this blog post, I will be exploring how we can use SQL Endpoint to expose Delta tables to the outer world and build a Dashboard in the Databricks environment which uses delta tables from the Silver and/or the Gold layer.

What is SQL Endpoint?

"A SQL endpoint is a computation resource that lets you run SQL commands on data objects within Databricks SQL." From Microsoft.

As you might know that now Databricks support three modes given below:

Data Science and Engineering
Machine Learning
SQL

To create an SQL endpoint and dashboard you have to use SQL workspace.

Follow https://docs.microsoft.com/en-us/azure/databricks/sql/admin/sql-endpoints to create an SQL endpoint.

Scenario 1: I have got sales data in the source system and want to view the summary of the sales or return in Databricks Dashboard.

Scenario 2: I have got aggregated data in Delta format and wanted to expose it to the custom application (for instance C# desktop or web application).

To cover both scenarios, I have got below is reference architecture from Databricks:

The above architecture is slightly changed as I am not using Azure Synapse Analytics as my serving layer instead I am exposing my Delta tables through SQL Endpoint and also I am not using a streaming dataset.

I am loading data from source to all layers (raw lake to delta lake (bronze, silver, and gold). The Gold layer contains the aggregated results (or summary).

If you want to create the below-aggregated data in the Gold layer, simply run the below code to a python notebook in the “Data Science and Engineering” workspace:

What are the components of the Dashboard?

The Dashboard in Databricks is composed of the following two major components:

. Visualizations
. Text

The Visualization component is based on the Query. So a query can have multiple visualizations which can be used in the dashboard.

The text component is where you can compose text using markdown. For instance, adding graphics and information.

The query has a schedule component that allows it to refresh the data shown in the dashboard:

To me, the Query and associated visualizations follow a publisher-subscriber model where the query is the publisher, and all visualization is a subscriber. Any changes to query results will reflect those changes to visualizations as well.

For instance, if you have a streaming source that is constantly updating your aggregated results then the query should have a scheduled component to refresh it automatically as per your schedule.

There are lots of visualizations like Bar chart, pie chart, counter etc. When you add a visualization widget to the dashboard it asks you to select your query first and then select your visualization associated with that query.

Now we understood the Dashboard, let’s create a query called “Sales-summary” as shown below:

As you can see I have used a very simple query (just select * …) the reason is that the Gold layer is populated with complex queries (multiple joins and/or group by etc) when we were loading data from Bronze to Silver and Silver to Gold.

You can also see there are four visualizations (tabs in the above screenshot) associated with the “Sales-Summary” query mentioned below:

Table (which is showing the results as a table)
Comparing Sales vs Return (It is a Bar chart)
Return Transaction (It is a counter showing the returned amount with formatting)
Sales Transaction (It is a counter)

I also created two more queries to show items sold and items returned as counter visualization.

Now I have got all my queries and associated visualizations ready, I can simply create a Dashboard in the Dashboard collections named “Northwind Sales” and arranged my visualizations as shown below:

I also used the Text component to place a MusoftTech logo and text with the help of markdown.

You may be thinking I have got more returns than sales and this is because my random number generated more returns than the sale.

One of the cool things is that each component of the Dashboard shows how fresh your data is.

The Dashboard can be refreshed based on configured schedule or it can manually be refreshed as required and also it can be shared via email by subscribing the email address to the Dashboard.

For scenario 2, the Databricks SQL endpoint is used to expose the Delta tables. The Delta tables can be consumed directly from PowerBI with SQL endpoint connection details.

I am going to consume the Delta table from the C# console application which shows that it can be consumed by any .net application. To connect your C# application to the Databricks SQL endpoint you need to create System DSN or User DSN and you have to install the ODBC driver.

Below is the screenshot of my DSN configuration screen:

Where

Host(s) is your Databricks SQL endpoint host URL.
User name is your azure ad account
Password is your personal access token.

You also need to set “HTTP Options” with the HTTP path for your endpoint and “SSL Option” to tick on “SSL”.

Once you have provided all settings you can test the connectivity by clicking on the “Test” button and the result will be shown below:

After testing the connection I have used this DSN in my C# console app, like in the code is given below:

and running the app would return the below result:

With this approach, we can stream the result to the PowerBI streaming dataset or Azure Stream analytics.

That’s it for now.

Monday 22 October 2018

Use of Power BI PushDatasets

In PowerBI, real-time streaming can be consumed in three (please visit https://docs.microsoft.com/en-us/power-bi/service-real-time-streaming ) different ways listed below:

Consuming streaming data from Azure Stream Analytic
Consuming streaming data from PubNub
Consuming streaming data vis PushDataset

If you have got sensors which are emitting data and you want to visualize it in real-time, you can use PowerBI in conjunction with Azure Stream Analytics to build the dashboard. But if data is less frequent and you want to have a dashboards that auto-refreshes then you can use any one of three methods. In this post I will show how Push Datasets can be used to develop a dashboard.

Below is a simple architecture for this post:

We need following components to build a complete end-to-end simple solution shown in above diagram:

Data Generator (to simulate data is coming to db at every x interval). This could your source which generates data less frequently.
Console App (that pushes data to PowerBI PushDataset)
PowerBI Dashboard

I have created a sample code that generates some data and inserts into database. Please see below code snippet:

The console application will be leveraging PowerBI REST API programming interface for pushing data. For this reason, console app needs to authenticate/authorize with PowerBI. So you need to register your console app with Azure AD in order to get OAuth token from Azure Authorization server. Please follow https://docs.microsoft.com/en-us/power-bi/developer/walkthrough-push-data-register-app-with-azure-ad to register your app.

OAuth provides four different OAuth flows based how you want to authenticate/authorize your application with Authorization server. Please visit https://auth0.com/docs/api-auth/which-oauth-flow-to-use to know which flow is best suited for your scenario.

I will be using Client Credential Flow (an OAuth flow which can be read at https://oauth.net/2/grant-types/client-credentials/ ) as console app will be treated as trusted application and also there would not be any human interaction if any authorization popup appears it would not be able to deal with.

Below is the code to get oauth token using Microsoft.IdentityModel.Clients.ActiveDirectory version 2.29.0 of nuget package:

I treat this scenario as syncing two systems (from source to target but target is PowerBI). Most of the syncing solution, we need to maintain what we have synced so far so that next time system should pick delta of data.

For this purpose, we are using ROWVERSION datatype which is auto generated by database. Please visit https://www.mssqltips.com/sqlservertip/4545/synchronizing-sql-server-data-using-rowversion/ for how to use rowversion for syncing scenario.

To maintain what has been synced, I have created a table to keep track the last row version, console application has sent to PowerBI, against a table like shown below:

For the first time, last row version should 0x000.

I also created a stored procedure that returns delta with the help of last row version and table name. Below is the stored procedure code:

Now, we got the data (delta amount), we need to send it to PushDataset in PowerBI. Every PushDataset has a unique id, and data needs to be sent to correct id.

I have created a dataset called “DeviceTelemetry” using REST API. To find the dataset id, you need to call the Power BI REST API like shown below:

And result is shown like below:

Now we got the Dataset Id as GUID, we need to use it to send data to Power BI. We will use PowerBI REST API to do this. You can do it in your console app to fetch all the datasets and grab the id for which you want to send to. For demonstration purpose I have shown you how you achieve it.

Now, the console app can you use dataset id and keep pushing data to it. Again you can leverage Power BI REST API to send data into batches or one by one. Below is a snapshot how I am sending data:

Here is the code that wraps to add rows to PowerBI PushDatasets leveraging api wrapper:

Here is the code for PowerBI Rest Api wrapped around a nice method:

Once your console app start sending data, you can go to PowerBI.com and start creating reports and dashboard like shown below:

Note that the dataset is listed as Push dataset.

Click on red boxed area (create report link) to create report as I created reports and composed them into one dashboard shown below:

That’s it so far.

Monday 3 September 2018

Setting up an environment for Monte Carlo Simulation in Docker

In this blog I will walk you through to install JAGS inside a docker container. You might be thinking why I have chosen docker for this. The answer is very simple, when I was install JAGS on my personal computer, the OS did not recognise as a trusted software so I did not take a risk of installing on my personal computer.

If you want to play with JAGS and you don't want to install it in your computer, then Docker is the best option as I can play with the package/software and then I can delete the container.

Now you got the idea why I have chosen Docker container for this. Let's proceed to setup an environment for Monte Carlo simulation. Make sure you have got Docker installed. Follow below steps to setup the environment:

1. Open Command prompt with administrative privilege and issue follow command:
$ docker run --name mybox -t -p 8004:8004 opencpu/rstudio

Above command will download the opencpu/rstudio image locally.

2. Issue below command to start/run the container:
$ docker container start mybox

3. Open browser in your host computer and point http://localhost:8004/rstudio/ and provide opencpu as username and password like shown below:

4. Now, you need to connect to container, by issuing below command in your command prompt, to install JAGS - a tool that generate Gibbs Sampling:

$ docker exec -I -t mybox /bin/bash

You will be taken to terminal of container like shown below:

5. Issue below commands to terminal of container:
$ sudo apt-get update
$ sudo apt-get install jags

5. Now go the browser (you opened in step 3) and install "rjags" and "runjags" packages like shown below and you are done. Now you use this environment to create a simulation using Monte Carlo.

That's it so far. Stay tuned.

Wednesday 9 May 2018

Azure IoT and IoT Edge - Part 2 (Building a Machine Learning model using generated data)

This blog is part 2 of Azure IoT Edge series. Please see http://blog.mmasood.com/2018/03/azure-iot-and-iot-edge-part-1.html if you have not read part 1.

In this blog I will cover the how we can build a logistic regression model in R using the data the captured in tables storage via IoT Hub.

We can run the simulated devices (all three at once) and wait for data to be generated and save it to table storage. But for the simplicity I have created an R script to generate the data so that I can build the model and deploy it to IoT Edge and hence we can leverage the this Edge device to apply Machine Learning model on the data it is receiving from the downstream devices.

I am using exactly the same minimum temperatures, pressure and humidity as our simulated device was using. Please see http://blog.mmasood.com/2018/03/azure-iot-and-iot-edge-part-1.html here are few lines of R script.

You can download the whole script from https://gist.github.com/mmasooddatascience/2e9807421f147f8abd6b00988eeba37f

Let’s plot the data and see how it looks like. There are only 3 fields/feature so I will plot Temperature vs Pressure using ggplot2:

Output of above R commands:

We can see as the temperature and pressure increases the device is becoming bad or getting away from the good devices. For the simplicity the simulation generates higher number for temperature and pressure if device is flagged as defective.

Now let’s build a simple logistic regression model to find out the probability of device being defective.

I am using caret package for building model. Here is the code to split the training and test data:

The proportion of good vs bad for original data is: 66% (good)/33% (bad). So we make sure we don’t have skewness in the data.

Now applying glm function to data using R script shown below:

Here is the summary of the model:

We can see from above output, the pressure is not statistically significant. The idea of this post is to have a model that we will be using in IoT Edge device.

Let’s test this model on test data set and find out the best threshold to separate the bad from good. I could have used cross-validation to find the best threshold. Use cross validation set to fine tune the parameters (eg. threshold or lambda if ridge regression is used etc).

Below is the confusion matrix when I use threshold 0.5:

Let’s construct a data frame which contains actual, predicted and calculated probability using below code:

And view first and last 5 records:

The higher (or closer to 1) the probability the device is good.

With threshold 0.65, the confusion matrix look like below:

So we can see from above two confusion matrix, the best threshold should be 0.50 as it miss-classifies only 4 instances but when 0.65 is used it miss-classifies 5 instances.

The final model is given below:

So far I have got the model built. I will use this model in IoT Edge module which will make Edge intelligent, which I will post soon so stay tuned and happy IoTing J

Navigation Bar