GraphCommit

Wednesday, October 8, 2025

Docker running Neo4j with traefik routing browser (:7474) and Bolt (:7687) using cloudflare ddns and letsencrypt (cloudflare and letsencrypt configs are not shown)

Here is my traefik / dynamic / neo4j-mydomain-bolt.yaml file:

tcp:
  routers:
    mydomain-neo4j-bolt-router:
      entryPoints:
        - bolt
      rule: "HostSNI(`bolt.mydomain.com`)"
      service: mydomain-neo4j-bolt-service
      tls:
        certResolver: cloudflare

  services:
    mydomain-neo4j-bolt-service:
      loadBalancer:
        servers:
          - address: "172.16.5.74:7687"

here is my traefik/dynamic/neo4j-browser-mydomain.yaml file:

http:
  routers:
    sandbox-neo4j-browser-ui-router:
      entryPoints:
        - websecure
      rule: "Host(`sandbox.mydomain.com`)"
      service: sandbox-neo4j-browser-service
      tls:
        certResolver: cloudflare

  services:
    sandbox-neo4j-browser-service:
      loadBalancer:
        servers:
          - url: "http://172.16.5.74:7474"

my /traefik.yml file:

entryPoints:
  web:
    address: ":80"
  websecure:
    address: ":443"
    http:
      tls:
        options: tls-options
  bolt:
    address: ":7687"

tls:
  options:
    tls-options:
      minVersion: VersionTLS12

api:
  dashboard: true
  insecure: false

providers:
  docker:
    endpoint: "unix:///var/run/docker.sock"
    exposedByDefault: false
  file:
    directory: /etc/traefik/dynamic
    watch: true

log:
  level: DEBUG

certificatesResolvers:
  cloudflare:
    acme:
      email: me@mydomain.com
      storage: /acme.json
      dnsChallenge:
        provider: cloudflare
        delayBeforeCheck: 0
        resolvers:
          - "1.1.1.1:53"
          - "8.8.8.8:53"

And Finally my docker-compose.yaml for the neo4j server:

x-neo4j-common: &neo4j-common
  image: neo4j:5.26.6
  restart: unless-stopped
  environment:
    NEO4J_AUTH: "neo4j/mysecretpassword"
    NEO4J_server_memory_heap_initial__size: "512m"
    NEO4J_server_memory_heap_max__size: "1g"
    NEO4J_PLUGINS: "[\"apoc\"]"
    NEO4J_dbms_security_procedures_unrestricted: "apoc.*"
    NEO4J_dbms_security_procedures_allowlist: "apoc.*"
    NEO4J_apoc_import_file_enabled: true
    NEO4J_apoc_export_file_enabled: true
    NEO4J_apoc_import_file_use__neo4j__config: true
  networks:
    - internal-net

services:
  sandbox:
    <<: *neo4j-common
    container_name: neo4j-sandbox
    environment:
      NEO4J_server_memory_heap_initial__size: "4g"
      NEO4J_server_memory_heap_max__size: "8g"
      NEO4J_server_memory_pagecache_size: "16g"
      NEO4J_dbms_security_procedures_unrestricted: "apoc.*"
      NEO4J_dbms_security_procedures_allowlist: "apoc.*"
      NEO4J_apoc_import_file_enabled: true
      NEO4J_apoc_export_file_enabled: true
      NEO4J_apoc_import_file_use__neo4j__config: true
    ports:
      - "172.16.5.74:7474:7474"   # HTTP
      - "172.16.5.74:7687:7687"   # Bolt
    volumes:
      - /volume1/docker/neo4j-stack/neo4j/sandbox/data:/data
      - /volume1/docker/neo4j-stack/neo4j/sandbox/import:/import
      - /volume1/docker/neo4j-stack/neo4j/sandbox/plugins:/plugins
      - /volume1/docker/neo4j-stack/neo4j/sandbox/backups:/backups
      - /volume1/docker/neo4j-stack/neo4j/sandbox/conf:/var/lib/neo4j/conf

networks:
  internal-net:

Wednesday, February 15, 2023

NeoDash on Azure Web App for Containers

I was a little late to the #NeoDash game. I'm not quite sure how this one got past me, but I discovered NeoDash at GraphConnect 2022 on the last session on the last day, and it was by far the best gem I mined from the conference! If you haven't used it yet, it is an extremely easy low-code method for delivering dashboards and visualizations for #Neo4j.

At my previous job, I linked the dashboard site within MS Teams channels for various project/reporting reviews, and it put the power of our Neo4j knowledge graph right at my colleagues' fingertips!

Getting started has been made very easy. There is even a free cloud hosted version that you can use at neodash.graphapp.io This is a great way to create your dashboard configurations with virtually NO installation - by far the easiest method if you are already using #AuraDB cloud hosted Neo4j. NeoDash supports both an on-premise Neo4j server and/or connecting directly to an AuraDB cloud instance. If you are self-hosting there would be more complex firewall configurations to use the graphapp.io hosted NeoDash - so in those cases (your private/public cloud tenant, or hosting on a local Docker container), using your own NeoDash installation may be more practical.

If you are hosting this in production you likely want to host your own dedicated standalone version of the dashboard (which turns off the editor and lets you publish the dashboard configurations for read-only access.) I wanted to use the docker container and host it on a public cloud provider for a non-profit client that already had a Microsoft #Azure sponsorship subscription. I had not yet deployed anything using Azure Web App for Containers before this project, but it appears to be an easy and cost-effective Azure hosting method for this type of application.

The configuration is pretty straight-forward. The biggest challenge I came across was determining how and which docker parameters needed to be configured and how best to assign them. I prefered to accomplish this using the Azure CLI powershell commands.

NeoDash has excellent guides and instructions, and I found everything I needed in the developer guide for running the docker instance.

I put this guide together to help get you up and running quickly.

With that - let's get started!

Pre-requisites

Microsoft Azure Subscription
Neo4j Database
Azure CLI PowerShell module (optional)
Some relevant Cypher queries for your reports
a preconfigured NeoDash dashboard
These instructions are to publish a standalone NeoDash dashboard - and assumes you have already CREATED the dashboard config
How to create a dashboard with NeoDash

Navigate to:
Azure Console → Create a resource → Search for "Web app for containers"

Click "Create"
Create Web App - 'Basics' tab options:
Subscription: <choose your Azure subscription>
Resource group <choose existing or Create new>
Name: <give your Web App a userful name>
Publish: Docker Container
Operating System: Linux
Region: <choose your region>
Linux Plan: <choose your plan>
Pricing Plan:

Huzaah! There's even a Free plan! - this is CPU usage limited, so based on your needs you may need to upgrade this, but a great way to test this out.

Skip the Deployment tab
Select the "Docker" Tab

Image and tag: neo4jlabs/neodash
Image Source: Docker Hub
(unless you have created your own image in a repo)

Access Type <Public or Private>
Startup Command: <leave blank>

Warning: If you pull neo4jlabs/neodash from the Docker Hub image source you will always pull the latest version. If you wish to TEST a new version of the dashboard before deploying into production, you may want to reference a specific TAGGED VERSION or create your own custom image BASED ON a particular version, and then store it on your own private Azure Container Registry. Those procedures are beyond the scope of this article.
Skip the remaining tabs

Move to "Review + create" to review your selections

If you are satisfied with the selections click "Create"

Deployment may take a few minutes to complete.
Once your deployment is complete click the "Go to resource" button.

Before you launch the container, you first need to configure Web App for Containers to use the proper ports, and also pass along parameters that NeoDash needs to identify your Neo4j database, and your dashboard parameters.

For more details see the Configuration section of the Developer Guide.

The settings we will be applying are listed here:
WEBSITES_PORT=5005
standalone=true
standaloneProtocol="neo4j+s"
standaloneHost="myhostname.databases.neo4j.io"
standalonePort="7687"
standaloneDatabase="neo4j"
standaloneUsername="neo4j"
standalonePassword="mysupersecretpassword"
standaloneDashboardName="My Custom Dashboard"
standaloneDashboardDatabase="neo4j"
ssoEnabled=false

You will notice I have ssoEnabled set to 'false' - this is because in this demo I am utilizing the "free AuraDB", and with that comes the limitation of a single user (no DB permissions capabilities). This IS NOT RECOMMENDED for production dashboards! Security practices would demand you have a LIMITED access and READ ONLY at minimum, and SSO / specific permissions would be best case.

This can be accomplished in 2 ways. Personally I prefer the CLI, but either works:

Web UI steps:
In the UI navigate to Configuration → Application Settings →
+ New application setting.
You can pass along your config by adding entries and their values here:

Azure CLI steps:

If you prefer the Azure CLI

From PowerShell:
If you haven't installed Azure CLI yet:
POWERSHELL

Install-Module -Name Az -Scope CurrentUser -Repository PSGallery -Force

Now apply your settings:
POWERSHELL

az webapp config appsettings set --resource-group "myresourcegroupname" --name "My-web-app-name" --settings WEBSITES_PORT=5005 standalone=true standaloneProtocol="neo4j+s" standaloneHost="myhostname.databases.neo4j.io" standalonePort="7687" standaloneDatabase="neo4j" standaloneUsername="neo4j"  standalonePassword="mysupersecretpassword" standaloneDashboardName="My Custom Dashboard" standaloneDashboardDatabase="neo4j" ssoEnabled=false

Navigate to the "Overview" screen.
This is where you can Start/Stop your web app container. It will also show the URL to use to access the page.
Troubleshooting:

There are a few places to check if the container isn't launching correctly. The first is to Navigate to Deployment Center → Logs
This will show the logs of downloading and launching the container.

Now you should be ready to share your NeoDash reports from your Azure Cloud environment!

Just one other item to review:

Authorization (for both the webapp and NeoDash):

Ideally you will have the licensed version of Neo4j, and then you can use a limited rights (or at least read-only) db credential for NeoDash to limit access. If not, keep in mind that NeoDash would otherwise be accessing your database with FULL read/write access.

Also - you can leverage Microsoft authentication to limit WHO can open this webapp (regardless of your Neo4j db permissions). The organization I configured this for uses Office365, so I required authentication via Office365, and additionally limited the webapp to specific users.

Thanks for reading, and I welcome your comments and feedback!

For additional help - check out these resources:

Tuesday, October 18, 2022

RangerMSP KnowledgeGraph Installation

Click here for an Overview of Prerequisites

You must SHARE your RangerMSP (CommitCRM) folder. You will need to grant both FILE and SHARE permissions to whichever account you run the data ingestion from.

Install Apache HOP:

Download the Advantage JDBC driver, then copy the JDBC Advantage adsjdbc.jar file from C:\Program Files (x86)\Advantage 11.10\JDBC into in the /lib folder of your HOP installation.

Launch HOP

Tools → Options

Select the Projects tab

Select the field "GUI: The standard project folder proposed when creating projects"

Choose a top level folder for your projects - recommend this NOT be in the folder where you installed HOP - This makes it less hassle when you upgrade to a newer version of HOP in the future

Also I recommend a folder that is backed up (or even better yet, you can version control your HOP projects with the built-in GIT integration!)

Clone the GITHUB repository:

Launch your GIT shell
Navigate to the folder you designated for your PROJECTS folder.

git clone https://github.com/pdrangeid/RangerMSP-neo4j-ingestion

This will create a folder in hop-projects named 'RangerMSP-neo4j-ingestion

There are 3 configuration files needed for this project I have provided samples that you will need to mofify for your environment:

• Environment Configuration

• (2) Datasource configs (1 for the Neo4j database, 1 for your RangerMSP Advantage SQL Database)

Download the sample environment configuration file and save it in your environments folder for this new project.

I store my environment configurations in their own folder structure SEPARATE from my projects (this segments configurations from the actual pipeline/workflow code to allow for easier upgrades in the future) .

ie: /documents/hop-environments/rangermsp-neo4j/

Here are sample metadata config files:

Neo4j - Save this within the /metadata/neo4j-connection folder

Neo4j cloud database connection example download link

Neo4j local database connection example download link

and

Download RangerMSP server connection example download link

Save this within the /metadata/rdbms folder

Launch HOP

Under Projects →

Click the p+ icon to ADD a project

Give it a name: RangerMSP-neo4j

for HOME folder browse to the GIT repository you previously cloned:

After OPENING the project - Allow HOP to Add a default run configuration

Also allow it to create a local workflow run configuration

Also allow it to Create a lifecycle environment. You will need to provide some parameters (explained below)

Provide a name and purpose for the environment

Click Select, then choose the environment config file you downloaded previously

Adjust Configuration Parameters. Click "Edit". Make adjustments to any of the import parameters for your environment.

This is particularly important if you are using the free cloud hosted version of the neo4j AuraDB, as it is limited to 200k nodes, and 400k relationships.

Next you will need to modify the datasource properties for both Neo4j and RangerMSP:

On the left side of HOP click the "Metadata" explorer.

Select auradb-hopdev1 by doubleclicking (this example configuration is for Neo4j Cloud hosted auradb)

Edit the server hostname, and change the user/password with credentials you were provided when signing up for your Neo4j AuraDB cloud instance.

Click TEST and ensure you can connect to the neo4j database.

Now do the same for your RangerMSP Advantage SQL connection.

Again from the "Metadata" explorer selectRangerMSP by double-clicking.

In my sample it is assuming the rangerMSP database is running on the localhost (127.0.0.1) but you can change this to your server's IP.

Also change the sharename if it is different from what I've used (RangerMSP$)

Click Test to verify connectivity.

Now it's time to ingest Ranger data!

From File Explorer open the HOP Workflow "RangerMSP-Refresh-Data.hwf"

Now RUN the workflow to import data. Any errors will be shown below:

Ok - this is where I need some help & feedback. I've used these software packages, and tried to document the process step-by-step, but there are enough moving pieces that I'm sure there are some gotchas somewhere in the process.

What are some errors you encountered?

Missing instructions, or common errors you ran into?

RangerMSP Knowledge Graph requirements

This post covers the pre-requisite packages you will need to build your Knowledge Graph for RangerMSP. For some background see the previous post discussing MSP Knowledge Graphs

Before we begin - you SHOULD be aware recently there have begun to develop some commercial offerings that can provide functionality like this. They have full web user interfaces, and much prettier reporting features, and MANY more integrations than I've built. An example of this is https://www.liongard.com/ When I started this project these products were just getting their start and were NOT very mature. Also the platform I've developed is built entirely using open source tools, can be completely customized and self-hosted which lets you avoid the pay-per-endpoint-per-month service model. So there is that. Ok - enough with the disclaimer - let's get to it!

Let's talk about the tools you will need:

• RangerMSP SQL edition

This system was built using the FULL Advantage SQL Server version of RangerMSP. The other tools used will not work without this because the JDBC data loader is not available for the non-SQL local database.

• You must SHARE your RangerMSP (CommitCRM) folder. You will need to grant both FILE and SHARE permissions to whichever account you run the data ingestion (HOP) workflows from.

The REAL magic of this system is the next 2 tools. Neo4j Graph database and HOP orchestration Platform:

Neo4j is the graph database that unleashes the power of the relationships between your data. This will house the knowledge graph database.

• For the Neo4j Database - You have a multiple options:

Cloud hosted (AuraDB): Free edition is limited to 200k nodes, 400k relationships:

Download here Self hosted: (community Edition is free)

You could also use the desktop edition if you want full enterprise functionality (note this cannot act as a shared multi-user server though - use Aura or community edition for that)

If you are installing a self-hosted Neo4j database I have instructions here.

Apache HOP is the modern evolution of the popular ETL (Extract, Transform and Load) tool Pentaho Data Integration (Kettle). This allows us to build workflows and perform API calls that import, massage, and deliver the data and relationships that will then be stored within Neo4j.

Apache Hop
Installation instruction

Download this and install it on a pc or server (requires JAVA). You can install it on the same computer as your neo4j database if you like.

Additional Requirements:

• Neo4j requires the APOC libraries (automatically included in auraDB)

• Advantage Database JDBC Driver (this gets installed in HOP)

Getting your data into this system is only the first step. Making the data usable, and to elicit ACTION is the real key! If there's a problem, you want a ticket created with the proper details. If there is interest I will follow-up with further parts to this series:

• REST API for RangerMSP to allow automation of ticket/charge/asset/contact/opportunity creation

• An easy-to-use DASHBOARD report viewer that works with Neo4j to present visual and tabular reporting

• Additional data integration workflows for HOP

• A ticket-queue building workflow based on outcomes of above integrations

...Next → Step-by-Step Installation & Setup

A knowledge graph for an MSP…

So I determined that I needed a knowledge base for my MSP. Here's the previous article explaining how and why I came to that conclusion.

You may be thinking, doesn't my PSA/CRM tool already do that for me? Yes it does.

Well, sort-of.... But actually, No not really. It does have the data, and MUCH of it may even have foreign key relationships (Pretty much ALL PSA/CRM tools are backed by a traditional relational database system like SQL), but not in an easy to view relationship model, and very difficult to construct deeply complicated, multi-level-deep queries.

However a proper knowledge graph database (like Neo4j in this case) allows you to write MUCH more deeply complex relationship queries than you can easily execute using traditional reporting. It also permits you to ask questions about relationships to systems that your PSA/CRM tool has no integration with. (once you start ingesting MORE data sources)

Here's the schema visualization of the RangerMSP data within our graph:

How does this serve us in the real world? Here's an example:

Because first impressions with a new user with any support organization are so crucial to getting buy-in and active participation we want to ensure that a new user is promptly getting their problem moved towards resolution.

We could enable that by making sure a new user doesn't have a support ticket languishing in help-desk purgatory.

We want to check to see if any newly onboarded users for our clients are having delays in getting progress for a ticket they have opened. My question may be something like this:

Show me all clients with an active Managed user plan contract that has a user added in the last 60 days that has an open ticket (NOT set to waiting vendor/client/approval) that has NOT had a charge or change to the ticket in the last 48 hours. Return the Account, client, open ticket, ticket manager, and the assets attached to the ticket.

To answer that with a traditional relational database would involve a very complex query with several joins, your DBA would hate you, and the performance of this query could be quite terrible (if your PSA tool even had capability of answering this complex query). This is a walk-in-the-park for a knowledge graph!

Here's how that query looks in cql (Cypher Query Language):

with timestamp()-172800000 as twodays,timestamp()-5184000000 as sixtydays

MATCH (a:Company)←[:CONTRACT_FOR]-(rc:Rangercontract)-[:CONTRACT_CATEGORY]→(ccat:Rangercontractcat {name:'User Plan'}) WHERE (rc)-[:CONTRACT_STATUS]→(:Rangercontractstatus {value:'A'})

MATCH (a)←[:WORKS_FOR]-(c:Contact)←[:TICKET_FOR]-(t:Ticket)-[]-(ts:Ticketstatus) WHERE ts.statcode <800 and NOT ts.name IN ['Waiting Approval','Waiting Client','Waiting Vendor/Materials'] and t.updatedon<twodays and t.createdon<twodays AND c.createdon > sixtydays

OPTIONAL MATCH (t)-[:PART_OF_TICKET]-(ch:Rangercharge) WHERE (ch.createdon>twodays or ch.updatedon>twodays)

WITH * WHERE ch.recid IS NULL

MATCH (t)-[:MANAGED_BY]→(e:Rangeremployee)

OPTIONAL MATCH (ra:Rangerasset)-[:TICKET_ASSET]-(t)

RETURN a,c,t,ts,e,ra

The query results can be displayed visually so you can explore your data:

Or it can be displayed in a more traditional tabular report by changing the RETURN method:

RETURN a.name as Company, c.name as Contact, t.number as Ticket ,ts.name as Status, e.name as Manager, ra.name as Asset, t.problem as Description

While this can provide for some interesting query capabilities, where this gets TRULY game-changing is when you start to ingest OTHER information from your other various service providers. Then you can ask VERY complex questions about service states, over/under provisioned users, license consumption, consistency, etc.

So what does the Schema look like when you start ingesting much of the data above? Something like this:

As you may imagine, you can answer complicated deep-data-relationship questions with this knowledge graph schema

Here's an example of what connected data knowledge graph looks like for a single user:

In the next section I'll review the pre-requisites to start to build a platform so you can design your own workflows and integrations.

Why do we need a knowledge graph?

This article began as an attempt to document some of the many development projects of a knowledge graph for running the MSP I was partnered at. I have since exited the parnership - and wanted to share some of the results of my knowledge graph project with others.

In 2011 I merged my IT consultancy business with a relatively new MSP startup.

I was responsible for the service delivery side of our business. We sold several (too many IMHO) customizable managed user service plans, as well as hardware, software, consulting and integration services. Add to that private cloud hosting, and by 2016 we were starting to add public cloud (Office365 and later Azure)

This presented several challenges around tracking and accuracy:

• Do all the right users had all the right services provisioned
• Are all the endpoints and servers added to RMM, and AV protected (with OS patching)
• Are we billing for the right number of users/services/plans
• Have we de-provisioned and offboarded services for terminated users
• Are all mailboxes, endpoints and servers backed up? Are they even IN the backup system?
• Did all critical infrastructure make it into our monitoring and NOC?

I knew that to execute a truly comprehensive solution to this problem you ultimately need to be able to easily answer 3 pretty basic questions:

Given a client, show me ALL THE USERS/DEVICES under plan, and a list (or better yet - create a ticket) for any services that are UNDERPROVISIONED or OVERPROVISIONED so that we can correct them.

Have any NEW users/devices shown up, and are they ALSO supposed to be on-plan?

Has the client TERMINATED any users/devices, and should they be REMOVED from service?

Sounds simple enough right? The problem was very quickly apparent.

We don't have a single system that can answer those 3 questions.

To build a shared understanding, you would need a list of WHAT YOU ARE DELIVERING and WHAT IS EXEMPT so there could be collaboration between your service director, account manager, and client.

Even if you had a fully integrated PSA/RMM tool (which we didn't) would that system ALSO check to see that 3rd party services were properly provisioned? (Office, spam filter, backup systems, etc.) - seems unlikely. Would that system even know if a particular device/user was subject to service add-ons, or exemptions?

Our systems knew the client had a contract, but it doesn't know WHICH users/devices were subscribed to the support plan, and which were exempt.

We knew how our service bundles were built (User received Office365 license, patch management, antivirus, email backup, infrastructure monitoring, and service desk support) but there was no easy way to ask - ARE all those services ACTUALLY provisioned. It basically meant a lot of manual reporting, and by the time you completed an audit for one client of ALL THOSE SERVICES, it was already out-of-date, and the client likely has added/removed users or changed a-la-carte upgrades/downgrades.

Up to this point we had primarily relied on storing data within our clients' active directory to track users that were part of a plan. We marked exempt those accounts that weren't plan users. Also users group membership helped us track licensing/subscription (we were doing some SPLA/multi-tenant hosting as well).

A script ran monthly that queried the clients directory, assembled an Excel billing/usage report, and uploaded it to us.

This helped us improve tracking and billing accuracy. We could also compare usage vs what we were billed from our vendors/distribution. This helped, but it didn't provide the ability to verify service implementation. For that we still relied on our implementation SOPs. If something got missed a manual audit would be required to discovered the problem. If an audit wasn't performed, we had to hope someone noticed (hopefully our employee, but often enough that could be our client). Not ideal!

An interim solution was exporting lists of users/devices from our RMM discovery tool, and then using that to import back into our asset management/PSA/CRM systems. That DID help make for an more accurate auditing process. But didn't dramatically improve the dynamic service trackability we were after. It also did nothing to dynamically (or automatically) verify proper implementation.

At the end of 2017 we merged with another organization, and one of the principals joining us had extensive SQL Server development experience. I had SQL experience myself, but it was primarily from the support/infrastructure/best practices side, And I really didn't have any experience with database development or table architecture/design.

I thought this would be the perfect solution! We could just build a data warehouse in SQL server to answer all these questions!

At the time we were using RangerMSP fka CommitCRM (PSA/ITAM/CRM), Managed Workplace for RMM (patching/AV/automation), and CheckMK (a Nagios based NOC monitoring system).

So away we went - ingested data from all 3 systems into a SQL database, and then began writing some queries that would JOIN views together so we could discover discrepancies.

I almost immediately saw this was not going to work. Just to query those 3 platforms required complex JOINs. I knew a little TSQL, but I was heavily reliant on our SQL expert to design the database, and write the stored procedures and views.

I knew the relationships between users/devices/services/tickets very well. My database expert was relatively new to managed services, so was looking to be to provide the layout of the data. Unfortunately the joins were so complicated that I knew I was going to be unable to assist in any of the database design. This was going to make the implementation very inefficient.

But hold-on… WAIT! - We only had 3 systems ingested so far. We would also need to add in Office 365, Azure, 3 backup providers, active directory, spam filtering, security training, another RMM vendor, QuoteWerks and Vmware data into the warehouse!

Also - JOINS are slow - they are done EVERY time you query - and the MORE data, the More JOINs, and the slower the performance.

The complexity of adding all those to create a comprehensive view was staggering to even contemplate.

I was disillusioned. At about the same time, a colleague of mine had been telling me about graph databases.

I had read a bit about them and watched a few videos about it. So when he suggested we attend a conference to learn more, I was all-in and attended the Neo4j GraphConnect conference in New York City in 2018.

Hearing about real-life use cases and seeing Neo4j graph databases in action convinced me. I started writing Cypher code on my laptop on the plane ride home. The knowledge graph project really started in earnest after this.

First - What is a knowledge graph? If you aren't familiar, and for our purposes here, it is a representation of the network of devices/events/concepts and users that also shows the relationship between these items. It is stored in a schema-less node/edge graph database.

If you have a need to integrate multiple structured, unstructured, and semi-structured sources of data from several sources, you need a knowledge graph.

Knowledge graphs are used as a solution to the problems presented by our data being siloed in the various systems we use to manage/support our clients in an MSP.

This allows you to ask complex relationship questions about your clients, their employees, tickets, charges, contracts and opportunities. From there I use this to facilitate automation, validation workflows and other processes.

…Continued in "A knowledge graph for an MSP"

Sunday, June 26, 2022

Bare bones Configure Apache Hop

Ok - there are in-depth technical overviews of Apache Hop. This is not one of them.

The intention of this is a stupidly fast, minimum MUST-dos to get someone from zero-to-functioning with Apache Hop.

My primary goal of this was to get Apache Hop working well enough to follow some of my other tutorials. It's JUST good-enough to get it up and running.

It's also probably good enough if you already have decent ETL experience - especially if you have Pentaho/Kettle/PDI experience!

If you want some more in-depth background - with best practices and feature reviews please check out some of the following videos:

Pre-requisites

System Requirements
Apache Hop requires JAVA Runtime 11. (JDK will work as well)
Latest Apache Hop version
Extract it to a folder somewhere.

SET your HOP_CONFIG_FOLDER

Because you want to have no-muss, no-fuss Hop UPGRADES when new version come out, and not worry about the binaries mixed in with your config/projects you MUST set some Hop environment variables:

DO NOT SKIP THIS STEP! - PLEASE AT LEAST set the HOP_CONFIG_FOLDER!

No! this really isn't optional - set this somewhere fairly permanent, hopefully somewhere you are backing up.

Perhaps a home directory (or network share) defined you could use something like this:

CMD SHELL
```
MD %homedrive%%homepath%hop-config
SETX HOP_CONFIG_FOLDER %homedrive%%homepath%hop-config
SETX HOP_JAVA_HOME "%java_home%"
```
I wanted my Hop config & Projects to move with me between my laptop, work/home PC, so I used my Microsoft Office365 OneDrive. (which also then backs it up to a cloud drive) This path also (unfortunately) had spaces in the name, so you HAVE to quote it out like this:

CMD SHELL
```
cd %onedrive%
MD hop-config
SETX HOP_CONFIG_FOLDER "%ONEDRIVE%\hop-config"
SETX HOP_JAVA_HOME "%java_home%"
```
Edit Hop launch scripts (if required)

If you had to quote your hop-config (or any other Hop environment variables that you configured). You will also need to ALSO add the quoting within the script/batch file you use to launch Hop:

Simply EDIT those scripts, and add quotes around any of the Environment variables that have spaces in the paths. This is why it is BEST to use paths without spaces, but this will work but will ALSO REQUIRE you to edit these scripts EVERY time you upgrade the Apache Hop binaries.

HOP_OPTIONS Additional Configuration

HOP_OPTIONS let you configure things like memory. These can be set in the launch scripts above (hop-gui, hop-run, hop-server etc).

Most commonly you may need to increase the JVM Heap size to accommodate larger data sets.

For example:

HOP_OPTIONS=-Xmx512m: start Hop with maximum 512MB of memory
HOP_OPTIONS=-Xmx2048m: start Hop with maximum 2048MB (or 2GB) of memory
HOP_OPTIONS=-Xmx4g: start Hop with maximum 4GB of memory
Set your AES Encryption Key

Security 101 - NO CLEAR_TEXT PASSWORDS in your workflows/script/projects right?
Apache Hop has built-in AES encryption capabilities for securing your data source passwords.

Set your AES key environment variable before launching any of your Hop tools.
Here's a couple examples to set your AES variable on a Windows computer:

POWERSHELL
```
$env:HOP_PASSWORD_ENCODER_PLUGIN="AES"
$env:HOP_AES_ENCODER_KEY=ddsfsdfsfsdf
```
or

CMD SHELL
```
SETX HOP_PASSWORD_ENCODER_PLUGIN AES
SETX HOP_AES_ENCODER_KEY ddsfsdfsfsdf
```
In the examples above, I'm just manually setting the AES key. A better (and more secure way) would be to retrieve your key programmatically (via API call) from a secrets VAULT, password manager, access controlled file/folder or a secure-string encrypted registry key (you get the idea).
Add any custom JDBC drivers.

Hop already comes with drivers for many of your most common data sources, so you may not need to do this. I am working on a project for a CRM/ERP tool that uses a little less common "Advantage SQL Database" (a Sybase/SAP product). So I had to go download the Advantage JDBC jar file, extract the contents, and place the adsjdbc.jar file into the /lib folder of my Apache Hop installation.

There you have it. Now you should be ready to get Apache Hop up and running for your first
project!

For additional help - check out: