Tuesday, October 18, 2022

RangerMSP KnowledgeGraph Installation

Click here for an Overview of Prerequisites

You must SHARE your RangerMSP (CommitCRM) folder.  You will need to grant both FILE and SHARE permissions to whichever account you run the data ingestion from.

Install Apache HOP

Download the Advantage JDBC driver, then copy the JDBC Advantage adsjdbc.jar file from C:\Program Files (x86)\Advantage 11.10\JDBC into  in the /lib folder of your HOP installation.

Launch HOP

Tools → Options

Select the Projects tab

Select the field "GUI:  The standard project folder proposed when creating projects"

Choose a top level folder for your projects  - recommend this NOT be in the folder where you installed HOP - This makes it less hassle when you upgrade to a newer version of HOP in the future

Also I recommend a folder that is backed up (or even better yet, you can version control your HOP projects with the built-in GIT integration!)


Clone the GITHUB repository:

Launch your GIT shell
Navigate to the folder you designated for your PROJECTS folder.

git clone https://github.com/pdrangeid/RangerMSP-neo4j-ingestion

This will create a folder in hop-projects named 'RangerMSP-neo4j-ingestion



Git-Clone

There are 3 configuration files needed for this project  I have provided samples that you will need to mofify for your environment:

• Environment Configuration
• (2) Datasource configs (1 for the Neo4j database, 1 for your RangerMSP Advantage SQL Database)

Download the sample environment configuration file  and save it in your environments folder for this new project.
I store my environment configurations in their own folder structure SEPARATE from my projects (this segments configurations from the actual pipeline/workflow code to allow for easier upgrades in the future) . 
ie: /documents/hop-environments/rangermsp-neo4j/

Here are sample metadata config files:

Neo4j - Save this within the /metadata/neo4j-connection folder

Neo4j cloud database connection example download link
or
Neo4j local database connection  example download link

and 
Download RangerMSP server connection  example download link
Save this within the /metadata/rdbms folder

Launch HOP

Under Projects →
Click the p+ icon to ADD a project

Give it a name: RangerMSP-neo4j
for HOME folder browse to the GIT repository you previously cloned:

Hop Open Directory

After OPENING the project - Allow HOP to Add a default run configuration

Add local run config

Also allow it to create a local workflow run configuration



Create local workflow run config


Also allow it to Create a lifecycle environment.  You will need to provide some parameters (explained below)


Create Project lifecycle environment

Provide a name and purpose for the environment



Click Select, then choose the environment config file you downloaded previously

Adjust Configuration Parameters.  Click "Edit".  Make adjustments to any of the import parameters for your environment.
This is particularly important if you are using the free cloud hosted version of the neo4j AuraDB, as it is limited to 200k nodes, and 400k relationships.

Next you will need to modify the datasource properties for both Neo4j and RangerMSP:

On the left side of HOP click the "Metadata" explorer.
Select auradb-hopdev1 by doubleclicking (this example configuration is for Neo4j Cloud hosted auradb)
Edit the server hostname, and change the user/password with credentials you were provided when signing up for your Neo4j AuraDB cloud instance.



configure neo4j connection

Click TEST and ensure you can connect to the neo4j database.

Now do the same for your RangerMSP Advantage SQL connection.
Again from the "Metadata" explorer selectRangerMSP by double-clicking.  

In my sample it is assuming the rangerMSP database is running on the localhost (127.0.0.1) but you can change this to your server's IP.
Also change the sharename if it is different from what I've used (RangerMSP$)



configure RangerMSP connection

Click Test to verify connectivity.

Now it's time to ingest Ranger data!

From File Explorer open the HOP Workflow "RangerMSP-Refresh-Data.hwf"


hop-open-rangermsp-refresh-data

Now RUN the workflow to import data.  Any errors will be shown below:

hop-run-rangermsp-refresh-data

Ok - this is where I need some help & feedback.  I've used these software packages, and tried to document the process step-by-step, but there are enough moving pieces that I'm sure there are some gotchas somewhere in the process.

What are some errors you encountered?

Missing instructions, or common errors you ran into?


RangerMSP Knowledge Graph requirements

This post covers the pre-requisite packages you will need to build your Knowledge Graph for RangerMSP.  For some background see the previous post discussing MSP Knowledge Graphs

 Before we begin - you SHOULD be aware recently there have begun to develop some commercial offerings that can provide functionality like this.  They have full web user interfaces, and much prettier reporting features, and MANY more integrations than I've built.  An example of this is https://www.liongard.com/  When I started this project these products were just getting their start and were NOT very mature.  Also the platform I've developed is built entirely using open source tools, can be completely customized and self-hosted which lets you avoid the pay-per-endpoint-per-month service model.  So there is that.  Ok - enough with the disclaimer - let's get to it!

Let's talk about the tools you will need:

• RangerMSP SQL edition

This system was built using the FULL Advantage SQL Server version of RangerMSP.  The other tools used  will not work without this because the JDBC data loader is not available for the non-SQL local database.

• You must SHARE your RangerMSP (CommitCRM) folder.  You will need to grant both FILE and SHARE permissions to whichever account you run the data ingestion (HOP) workflows from.

The REAL magic of this system is the next 2 tools.  Neo4j Graph database and HOP orchestration Platform:


Neo4j is the graph database that unleashes the power of the relationships between your data.  This will house the knowledge graph database.

• For the Neo4j Database - You have a multiple options:  

Cloud hosted (AuraDB):  Free edition is limited to 200k nodes, 400k relationships: 

Download here Self hosted: (community Edition is free)   

You could also use the desktop edition if you want full enterprise functionality (note this cannot act as a shared multi-user server though - use Aura or community edition for that)

If you are installing a self-hosted Neo4j database I have instructions here



Apache HOP is the modern evolution of the popular ETL (Extract, Transform and Load) tool Pentaho Data Integration (Kettle).  This allows us to build workflows and perform API calls that import, massage, and deliver the data and relationships that will then be stored within Neo4j.


Download this and install it on a pc or server (requires JAVA).  You can install it on the same computer as your neo4j database if you like.

Additional Requirements:

• Neo4j requires the APOC libraries (automatically included in auraDB)

Advantage Database JDBC Driver  (this gets installed in HOP)

Getting your data into this system is only the first step.  Making the data usable, and to elicit ACTION is the real key!  If there's a problem, you want a ticket created with the proper details.  If there is interest I will follow-up with further parts to this series:

• REST API for RangerMSP to allow automation of ticket/charge/asset/contact/opportunity creation

• An easy-to-use DASHBOARD report viewer that works with Neo4j to present visual and tabular reporting

• Additional data integration workflows for HOP 

• A ticket-queue building workflow based on outcomes of above integrations

...Next → Step-by-Step Installation & Setup

A knowledge graph for an MSP…

 So I determined that I needed a knowledge base for my MSP.  Here's the previous article explaining how and why I came to that conclusion.

You may be thinking, doesn't my PSA/CRM tool already do that for me?  Yes it does.

Well, sort-of.... But actually, No not really.  It does have the data, and MUCH of it may even have foreign key relationships (Pretty much ALL PSA/CRM tools are backed by a traditional relational database system like SQL), but not in an easy to view relationship model, and very difficult to construct deeply complicated, multi-level-deep queries.

However a proper knowledge graph database (like Neo4j in this case) allows you to write MUCH more deeply complex relationship queries than you can easily execute using traditional reporting.  It also permits you to ask questions about relationships to systems that your PSA/CRM tool has no integration with. (once you start ingesting MORE data sources)

Here's the schema visualization of the RangerMSP data within our graph:

RangerMSP Knowledge Graph Schema

How does this serve us in the real world?  Here's an example:

Because first impressions with a new user with any support organization are so crucial to getting buy-in and active participation we want to ensure that a new user is promptly getting their problem moved towards resolution. 

We could enable that by making sure a new user doesn't have a support ticket languishing in help-desk purgatory.

We want to check to see if any newly onboarded users for our clients are having delays in getting progress for a ticket they have opened.  My question may be something like this:

Show me all clients with an active Managed user plan contract that has a user added in the last 60 days that has an open ticket (NOT set to waiting vendor/client/approval) that has NOT had a charge or change to the ticket in the last 48 hours.  Return the Account, client, open ticket, ticket manager, and the assets attached to the ticket.

To answer that with a traditional relational database would involve a very complex query with several joins, your DBA would hate you, and the performance of this query could be quite terrible (if your PSA tool even had capability of answering this complex query).   This is a walk-in-the-park for a knowledge graph!

Here's how that query looks in cql (Cypher Query Language):


with timestamp()-172800000 as twodays,timestamp()-5184000000 as sixtydays
MATCH (a:Company)←[:CONTRACT_FOR]-(rc:Rangercontract)-[:CONTRACT_CATEGORY]→(ccat:Rangercontractcat {name:'User Plan'}) WHERE (rc)-[:CONTRACT_STATUS]→(:Rangercontractstatus {value:'A'})
MATCH (a)←[:WORKS_FOR]-(c:Contact)←[:TICKET_FOR]-(t:Ticket)-[]-(ts:Ticketstatus) WHERE ts.statcode <800 and NOT ts.name IN ['Waiting Approval','Waiting Client','Waiting Vendor/Materials'] and t.updatedon<twodays and t.createdon<twodays AND c.createdon > sixtydays
OPTIONAL MATCH (t)-[:PART_OF_TICKET]-(ch:Rangercharge) WHERE (ch.createdon>twodays or ch.updatedon>twodays)
WITH * WHERE ch.recid IS NULL
MATCH (t)-[:MANAGED_BY]→(e:Rangeremployee)
OPTIONAL MATCH (ra:Rangerasset)-[:TICKET_ASSET]-(t)
RETURN a,c,t,ts,e,ra


The query results can be displayed visually so you can explore your data:
RangerMSP explore Data

Or it can be displayed in a more traditional tabular report by changing the RETURN method:

RETURN a.name as Company, c.name as Contact, t.number as Ticket ,ts.name as Status, e.name as Manager, ra.name as Asset, t.problem as Description


RangerMSP-return-tabular-data
While this can provide for some interesting query capabilities, where this gets TRULY game-changing is when you start to ingest OTHER information from your other various service providers.  Then you can ask VERY complex questions about service states, over/under provisioned users, license consumption, consistency, etc.


So what does the Schema look like when you start ingesting much of the data above?  Something like this:


Knowledge Graph Schema Example

As you may imagine, you can answer complicated deep-data-relationship questions with this knowledge graph schema 

Here's an example of what connected data knowledge graph looks like for a single user:


real-world-graph query


In the next section I'll review the pre-requisites to start to build a platform so you can design your own workflows and integrations.


Why do we need a knowledge graph?

 This article began as an attempt to document some of the many development projects of a knowledge graph for running the MSP I was partnered at.  I have since exited the parnership - and wanted to share some of the results of my knowledge graph project with others.

In 2011 I merged my IT consultancy business with a relatively new MSP startup.

I was responsible for the service delivery side of our business.  We sold several (too many IMHO) customizable managed user service plans, as well as hardware, software, consulting and integration services. Add to that private cloud hosting, and by 2016 we were starting to add public cloud (Office365 and later Azure)

This presented several challenges around tracking and accuracy:

• Do all the right users had all the right services provisioned
• Are all the endpoints and servers added to RMM, and AV protected (with OS patching)
• Are we billing for the right number of users/services/plans
• Have we de-provisioned and offboarded services for terminated users
• Are all mailboxes, endpoints and servers backed up?  Are they even IN the backup system?
• Did all critical infrastructure make it into our monitoring and NOC?

I knew that to execute a truly comprehensive solution to this problem you ultimately need to be able to easily answer 3 pretty basic questions:

Given a client, show me ALL THE USERS/DEVICES under plan, and a list (or better yet - create a ticket) for any services that are UNDERPROVISIONED or OVERPROVISIONED so that we can correct them.

Have any NEW users/devices shown up, and are they ALSO supposed to be on-plan?

Has the client TERMINATED any users/devices, and should they be REMOVED from service?

Sounds simple enough right?  The problem was very quickly apparent.  

We don't have a single system that can answer those 3 questions.

To build a shared understanding, you would need a list of WHAT YOU ARE DELIVERING and WHAT IS EXEMPT so there could be collaboration between your service director, account manager, and client.

Even if you had a fully integrated PSA/RMM tool (which we didn't) would that system ALSO check to see that 3rd party services were properly provisioned?  (Office, spam filter, backup systems, etc.) - seems unlikely.  Would that system even know if a particular device/user was subject to service add-ons, or exemptions?

Our systems knew the client had a contract, but it doesn't know WHICH users/devices were subscribed to the support plan, and which were exempt.

We knew how our service bundles were built (User received Office365 license, patch management, antivirus, email backup, infrastructure monitoring, and service desk support) but there was no easy way to ask - ARE all those services ACTUALLY provisioned.  It basically meant a lot of manual reporting, and by the time you completed an audit for one client of ALL THOSE SERVICES, it was already out-of-date, and the client likely has added/removed users or changed a-la-carte upgrades/downgrades.

Up to this point we had primarily relied on storing data within our clients' active directory to track users that were part of a plan.  We marked exempt those accounts that weren't plan users.  Also users group membership helped us track licensing/subscription (we were doing some SPLA/multi-tenant hosting as well).

A script ran monthly that queried the clients directory, assembled an Excel billing/usage report, and uploaded it to us.

This helped us improve tracking and billing accuracy. We could also compare usage vs what we were billed from our vendors/distribution.  This helped, but it didn't provide the ability to verify service implementation.  For that we still relied on our implementation SOPs.  If something got missed a manual audit would be required to discovered the problem.  If an  audit wasn't performed, we had to hope someone noticed (hopefully our employee, but often enough that could be our client).  Not ideal!

An interim solution was exporting lists of users/devices from our RMM discovery tool, and then using that to import back into our asset management/PSA/CRM systems.  That DID help make for an more accurate auditing process.  But didn't dramatically improve the dynamic service trackability we were after.  It also did nothing to dynamically (or automatically) verify proper implementation.

At the end of 2017 we merged with another organization, and one of the principals joining us had extensive SQL Server development experience.  I had SQL experience myself, but it was primarily from the support/infrastructure/best practices side, And I really didn't have any experience with database development or table architecture/design.

I thought this would be the perfect solution!  We could just build a data warehouse in SQL server to answer all these questions!

At the time we were using RangerMSP fka CommitCRM (PSA/ITAM/CRM), Managed Workplace for RMM (patching/AV/automation), and CheckMK (a Nagios based NOC monitoring system).

So away we went - ingested data from all 3 systems into a SQL database, and then began writing some queries that would JOIN views together so we could discover discrepancies.

I almost immediately saw this was not going to work.  Just to query those 3 platforms required complex JOINs.  I knew a little TSQL, but I was heavily reliant on our SQL expert to design the database, and write the stored procedures and views.

I knew the relationships between users/devices/services/tickets very well.  My database expert was relatively new to managed services, so was looking to be to provide the layout of the data.  Unfortunately the joins were so complicated that I knew I was going to be unable to assist in any of the database design.  This was going to make the implementation very inefficient.

But hold-on…  WAIT! - We only had 3 systems ingested so far.  We would also need to add in Office 365, Azure, 3 backup providers, active directory, spam filtering, security training, another RMM vendor, QuoteWerks and Vmware data into the warehouse!

Also - JOINS are slow - they are done EVERY time you query - and the MORE data, the More JOINs, and the slower the performance.  

The complexity of adding all those to create a comprehensive view was staggering to even contemplate.

I was disillusioned.  At about the same time, a colleague of mine had been telling me about graph databases.  

I had read a bit about them and watched a few videos about it.  So when he suggested we attend a conference to learn more, I was all-in and attended the Neo4j GraphConnect conference in New York City in 2018.

Hearing about real-life use cases and seeing Neo4j graph databases in action convinced me.  I started writing Cypher code on my laptop on the plane ride home.  The knowledge graph project really started in earnest after this.

First - What is a knowledge graph?  If you aren't familiar, and for our purposes here, it is a representation of the network of devices/events/concepts and users that also shows the relationship between these items.  It is stored in a schema-less node/edge graph database.

If you have a need to integrate multiple structured, unstructured, and semi-structured sources of data from several sources, you need a knowledge graph.

Knowledge graphs are used as a solution to the problems presented by our data being siloed in the various systems we use to manage/support our clients in an MSP.

This allows you to ask complex relationship questions about your clients, their employees, tickets, charges, contracts and opportunities.  From there I use this to facilitate automation, validation workflows and other processes.

…Continued in "A knowledge graph for an MSP"




Sunday, June 26, 2022

Bare bones Configure Apache Hop


Bare bones Configure Apache Hop

Ok - there are in-depth technical overviews of Apache Hop.  This is not one of them.

The intention of this is a stupidly fast, minimum MUST-dos to get someone from zero-to-functioning with Apache Hop.

My primary goal of this was to get Apache Hop working well enough to follow some of my other tutorials.  It's JUST good-enough to get it up and running.

It's also probably good enough if you already have decent ETL experience - especially if you have Pentaho/Kettle/PDI experience!

If you want some more in-depth background - with best practices and feature reviews please check out some of the following videos:

Pre-requisites
  1. SET your HOP_CONFIG_FOLDER

    Because you want to have no-muss, no-fuss Hop UPGRADES when new version come out, and not worry about the binaries mixed in with your config/projects you MUST set some Hop environment variables:

    DO NOT SKIP THIS STEP! - PLEASE AT LEAST set the HOP_CONFIG_FOLDER!

    No! this really isn't optional - set this somewhere fairly permanent, hopefully somewhere you are backing up.

    Perhaps a home directory (or network share) defined you could use something like this:

     CMD SHELL 
    MD %homedrive%%homepath%hop-config
    SETX HOP_CONFIG_FOLDER %homedrive%%homepath%hop-config
    SETX HOP_JAVA_HOME "%java_home%"
    
    

    I wanted my Hop config & Projects to move with me between my laptop, work/home PC, so I used my Microsoft Office365 OneDrive. (which also then backs it up to a cloud drive)  This path also (unfortunately) had spaces in the name, so you HAVE to quote it out like this:

    CMD SHELL 
    cd %onedrive%
    MD hop-config
    SETX HOP_CONFIG_FOLDER "%ONEDRIVE%\hop-config"
    
    SETX HOP_JAVA_HOME "%java_home%"

  2. Edit Hop launch scripts (if required)

    If you had to quote your hop-config (or any other Hop environment variables that you configured).  You will also need to ALSO add the quoting within the script/batch file you use to launch Hop:
Simply EDIT those scripts, and add quotes around any of the Environment variables that have spaces in the paths.  This is why it is BEST to use paths without spaces, but this will work but will ALSO REQUIRE you to edit these scripts EVERY time you upgrade the Apache Hop binaries.

  1. HOP_OPTIONS Additional Configuration

    HOP_OPTIONS let you configure things like memory.  These can be set in the launch scripts above (hop-gui, hop-run, hop-server etc).

    Most commonly you may need to increase the JVM Heap size to accommodate larger data sets.

    For example: 

    HOP_OPTIONS=-Xmx512m: start Hop with maximum 512MB of memory
    HOP_OPTIONS=-Xmx2048m: start Hop with maximum 2048MB (or 2GB) of memory
    HOP_OPTIONS=-Xmx4g: start Hop with maximum 4GB of memory

  2. Set your AES Encryption Key

    Security 101 - NO CLEAR_TEXT PASSWORDS in your workflows/script/projects right?
    Apache Hop has built-in AES encryption capabilities for securing your data source passwords.

    Set your AES key environment variable before launching any of your Hop tools.
    Here's a couple examples to set your AES variable on a Windows computer:

     POWERSHELL 
    $env:HOP_PASSWORD_ENCODER_PLUGIN="AES"
    $env:HOP_AES_ENCODER_KEY=ddsfsdfsfsdf

    or

    CMD SHELL 
    SETX HOP_PASSWORD_ENCODER_PLUGIN AES
    SETX HOP_AES_ENCODER_KEY ddsfsdfsfsdf

     In the examples above, I'm just manually setting the AES key.  A better (and more secure way) would be to retrieve your key programmatically (via API call) from a secrets VAULT, password manager, access controlled file/folder or a secure-string encrypted registry key (you get the idea).

  3.  Add any custom JDBC drivers.

    Hop already comes with drivers for many of your most common data sources, so you may not need to do this.  I am working on a project for a CRM/ERP tool that uses a little less common "Advantage SQL Database" (a Sybase/SAP product).  So I had to go download the Advantage JDBC jar file, extract the contents, and place the adsjdbc.jar file into the /lib folder of my Apache Hop installation.

There you have it.  Now you should be ready to get Apache Hop up and running for your first
project! 

For additional help - check out: