Confluence OnPrem Connector (COPC)



Solution Overview

The COPC is a standalone, dockerized NodeJS application that runs on a configurable schedule and synchronizes the Confluence OnPrem data with the Unique FinanceGPT service.

The COPC uses the Confluence REST API to fetch the data and the Unique Ingestion API to ingest the data into the FinanceGPT chat.

Confluence users can use the label functionality of Confluence to determine which pages should get ingested.

There are two labels to choose from that indicate if a page should be synced with FinanceGPT:

  • ai-ingest
    This label will sync the labeled page

  • ai-ingest-all:
    This label will sync the labeled page and all its sub-pages (recursively).

Pages that had their label removed will be deleted from the chat with the next sync.

The label names (ai-ingest and ai-ingest-all) can be changed via the env file.


The COPC uses a service user from confluence to make API requests. It is recommended that this service user is specifically created for the COPC and has the appropriate access rights to pages and spaces.

The COPC uses the following CQL (confluence query) to get the pages that should be synced:
cql=(label="ai-ingest") OR (label="ai-ingest-all")&expand=metadata.labels,version&os_authType=basic&limit=${limit}&start=${start}

The COPC runs through all the labeled pages twice. One time to find all IDs of the pages that should be ingested and one time to ingest.

  • First Query Run:
    It syncs these files with the file-diff endpoint of unique to determine which files are new and updated (to ingest), which files were deleted, and which files were moved.

  • Second Query Run:
    In the second run, the COPC goes through all pages (and the subpages) that need to be ingested one by one and ingests them via the Ingestion API

Docker Image

The COPC is publicly available as a docker image:

docker pull ghcr.io/unique-ag/confluence-connector:latest

If you are using custom certs, dont forget to mount it when using docker run:

docker run --env-file .env --rm -it -p 8083:8083 -v $(pwd)/my_custom_ca.cert:/node/my_custom_ca.cert:z confluence-connector

General Recommendations

  • Use a PAT (Personal Access Token) for the confluence service user with the necessary access rights for authentication

  • Use the TEST_MODE=true when running it for the first time to observe the performance, duration, etc.

  • Use the COPC's GET endpoint /sync to manually trigger a synchronization. Example:

    localhost:8083/sync
  • Use the CRON_SCHEDULE only after the first initial real ingestion is finished. Once a night should suffice in most cases (0 1 * * *).

  • Be conservative with the CONFLUENCE_TOKENS_PER_MINUTE rate limiter setting to not nuke your OnPremise Confluence Server.

Requirements

  • The connector must be able to reach the Confluence OnPrem installation and the Unique FinanceGPT.

  • The connector must have a user that has read access to all spaces and pages that should be synced. This can be either through a basic auth (username + password) or using a PAT (Personal Access Token) generated from confluence.

  • The connector must use the user that was provided by Zitadel to authenticate against the Unique Ingestion API

  • The Confluence OnPrem server version must be 6.13.23 or higher.

ENV Variables for the COPC:

To configure the COPC, the following env variables are available:

APP_PORT (required)
The port of the COPC. Default: 8083

CLIENT_ID (required)
The Zitadel service user that has permission to ingest data into FinanceGPT

CLIENT_SECRET (required)
The Zitadel service user's access token

CONFLUENCE_TOKENS_PER_MINUTE
Rate limiter for the API requests to Confluence. 1 request = 1 token. Default: 250

CONFLUENCE_URL (required)
The URL to your confluence server. On localhost this is http://localhost:1990/confluence
Important: Include the http / https prefix.

CONFLUENCE_PAT (required or username/password)
Personal Access Token of the Confluence service user. The COPC will make the Confluence API requests with this user.

CONFLUENCE_USERNAME
CONFLUENCE_PASSWORD
For testing purposes. On localhost, these are both "admin".

CRON_SCHEDULE
Defines how often the COPC should sync the Confluence data with FinanceGPT using the cron format: "* * * * *"

INGESTION_URL (required)
The ingestion endpoint of FinanceGPT. Example: https://gateway.<baseUrl>/ingestion/v1/content
Important: Include the http / https prefix.

INGEST_ALL_LABEL (required)
The confluence label that defines which page and its sub-pages will get ingested (recursively). Default: "ai-ingest-all"

INGEST_SINGLE_LABEL (required)
The confluence label that defines which page will get ingested. Default: "ai-ingest"

TEST_MODE
When test mode is set to true, the COPC will run the process without ingesting. Default: false

OAUTH_TOKEN_URL
The Zitadel endpoint generates a valid token for ingestion. Example: https://id.<baseUrl>/oauth/v2/token
Important: Include the http / https prefix.

PROJECT_ID (required)
The FinanceGPT Project ID from Zitadel from which the service user will generate a token from

SCOPE_ID
The Knowledge Base scope where the data will be ingested to in FinanceGPT's. If no scope id is given, the connector will auto-create a scope for each space and ingest the documents in the respective scope.

DEBUG_MODE
When debug mode is set to true, all outputs are written into the log file. Default: false

Using proxies and custom certs

If you use proxies or custom certs, you have to define the relevant env variables. Example:

NODE_EXTRA_CA_CERTS="/node/my_custom_ca.cert"
HTTPS_PROXY="https://myproxy:8080"
NO_PROXY: "localhost,*.mydomain"

When you run your container via “docker run” you have to mount the cert volume. Example:

 

Using helmfiles, you then need to mount the volumes. Example:

Delete and reset ingested files manually

If your /sync doesnt automatically delete ingested files, it might be because of wrong configuration during testing and the files being associated to the wrong space / confluence instance / project / etc.

You can use the following DELETE endpoint of the COPC to manually trigger a reset which will delete all ingested confluence pages for a given scope id so you can start again with a clean slate:

It is possible that /reset needs specific parameters to identify the files correctly. For this you can provide it with a partialKey in the body. This might be your confluence url (same as from env value) or the space prefix (spaceId_spaceKey)

Examples:

 

Example Helmfiles

Example helmfiles can be found in the release repo: confluence-connector.yaml

Local Setup

Set up the Atlassian Plugin SDK to run a local confluence instance:

Follow this guide: https://developer.atlassian.com/server/framework/atlassian-sdk/set-up-the-atlassian-plugin-sdk-and-build-a-project/

Up until "create a plugin" but no need to do that part.

Run Atlassian instance:

The `atlassian-confluence/server` folder contains a tutorial on how to make a macro. We dont care about the macro, just the working server so we can access it locally.

From `atlassian-confluence/server` folder run the command `atlas-run`.

This will take some time on the first run. When done, you should be able to reach your local confluence instance at `localhost:1990/confluence`

Credentials for login locally:


Here is an example rest api url that gets all pages with `ai-ingest` and `ai-ingest-all` labels and expands them (so it's in the json response):

 

Read more about cql here: https://developer.atlassian.com/server/confluence/advanced-searching-using-cql/

You can expand empty string properties and they can contain data. Example _expandable.body is generally empty. However, if you add the query parameter &expand=body.value you will see the body.

Run the confluence scanner:

From add-ins/atlassian-confluence run

 


Author

see Parent

 

© 2024 Unique AG. All rights reserved. Privacy PolicyTerms of Service