Version: 0.3.94

Chatbot trained on Notion pages

This is a two-part tutorial. First, we show you how to build a chatbot trained on input text provided through an API end-point. Then we show you how to extend that functionality and also train it on a set of Notion pages. Turning your Notion knowledge repository into a chat GPT-style API endpoint.

This tutorial assumes you have a basic understanding of Seaplane, Langhcain, vector stores, and in-context learning (ICL). To deploy your application you need access to a Seaplane account. You can sign up here.

tip

In-context learning is the process of feeding large language models new information during the inference step to answer questions. This allows them to reason about content not provided during model training. We wrote a blog about it here if you want to learn more.

The applications

The chatbot consists of two Seaplane applications.

Processor application - This application extracts the content used for ICL, embeds the text into vectors and stores them in the vector store.
Chatbot application - This application provides the API interface for Q&A it uses the content stored in the vector store to answer user questions.

The Processor Application

The processor application extracts the content from the selected information repository, turns the text into a vector representation and stores them in the vector store. Run the following command to create the default application structure to get started.

seaplane init chat-processor

Remove the boilerplate hello world code and rename your task id, application id, and application path to reflect the purpose of this application. Leaving you with the following basic application.

from seaplane import app, task, start

# the processing task
@task(type="compute", id='chat-processor')
def process_data(data):
    # processing logic goes here!

# the HTTP enabled application
@app(id='processors', path='/process-chat-data', method=['POST', 'GET'])
def chatbot_processor_application(data):
    return process_data(data)

start()

Now let's extend your task component with the required code to process new data. In this example, you assume new data is fed to the application endpoint as a body parameter (text) to the POST request.

tip

You can find more specific examples in the section below and in the Seaplane demo repository on GitHub.

First, create a new index in the vector store, call it chat-documents. In this example, you use MPT-30B and the Seaplane Embeddings function which has a dimension of 768.

from seaplane import app, task, start
from seaplane.vector import vector_store

# the processing task
@task(type="compute", id='chat-processor')
def process_data(data):
     # create vector store if it does not yet exist, 768 dimensions for seaplane embeddings
    vector_store.create_index("chat-documents", 768)

# the HTTP enabled application
@app(id='processors', path='/process-chat-data', method=['POST', 'GET'])
def chatbot_processor_application(data):
    return process_data(data)

start()

Next, split the input text into chunks using the RecursiveCharacterTextSplitter functionality from Langchain. Create documents from the chunks and embed them using the Seaplane embeddings.

from seaplane import app, task, start
from seaplane.vector import vector_store
from langchain.text_splitter import RecursiveCharacterTextSplitter
from seaplane.integrations.langchain import seaplane_embeddings


# the processing task
@task(type="compute", id='chat-processor')
def process_data(data):
     # create vector store if it does not yet exist, 768 dimensions for seaplane embeddings
    vector_store.create_index("chat-documents", 768)

    # create text splitter
    text_splitter = RecursiveCharacterTextSplitter(
            chunk_size = 1000,
            chunk_overlap  = 100,
            length_function = len,
            add_start_index = True,
    )

    # create documents from our input string
    texts = text_splitter.create_documents([data['text']])

    # embed documents
    vectors = seaplane_embeddings.embed_documents([page.page_content for page in texts])

# the HTTP enabled application
@app(id='processors', path='/process-chat-data', method=['POST', 'GET'])
def chatbot_processor_application(data):
    return process_data(data)

start()

The create_documents function expects a list as input hence the [] around data['text']. The embed_documents expects a list of text chunks as input. You create it inside the function call using a list comprehension extracting the page_content from each previously created page.

Finally, transform your vectors into a vector format the vector store understands. This includes the vector itself an ID (UUID4), the metadata, and a data component containing the text representation of the vector. This is important as you use the data component later in the chat application to feed the relevant text snippets to the LLM.

from seaplane import app, task, start
from seaplane.vector import vector_store
from langchain.text_splitter import RecursiveCharacterTextSplitter
from seaplane.integrations.langchain import seaplane_embeddings
from seaplane.model import Vector
import uuid

# the processing task
@task(type="compute", id='chat-processor')
def process_data(data):
     # create vector store if it does not yet exist, 768 dimensions for seaplane embeddings
    vector_store.create_index("chat-documents", 768)

    # create text splitter
    text_splitter = RecursiveCharacterTextSplitter(
            chunk_size = 1000,
            chunk_overlap  = 100,
            length_function = len,
            add_start_index = True,
    )

    # create documents from our input string
    texts = text_splitter.create_documents([data['text']])

    # embed documents
    vectors = seaplane_embeddings.embed_documents([page.page_content for page in texts])

    # create the vector representation the vector store understands
    vectors = [Vector(id=str(uuid.uuid4()), vector=vector, metadata={"page_content": texts[idx].page_content, "metadata": texts[idx].metadata}) for idx, vector in enumerate(vectors)]

    # insert vectors in vector store
    return vector_store.insert("chat-documents", vectors)

# the HTTP enabled application
@app(id='processors', path='/process-chat-data', method=['POST', 'GET'])
def chatbot_processor_application(data):
    return process_data(data)

start()

Deploying The Processing Application

Open the .env file in the project and add your Seaplane API key. Add Langchain as a required package to the pyproject.toml file

pyproject.toml
[tool.poetry]
name = "chat-processor"
version = "0.0.1"
description = ""
authors = []

[tool.seaplane]
main = "main.py"

[tool.poetry.dependencies]
python = ">=3.10,<3.12"
seaplane = "^0.3.89"
langchain = "0.0.195"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

Run the following two commands inside the project directory to install the required packages and deploy the application

poetry install
seaplane deploy

Your application is now available at https://carrier.cplane.cloud/apps/processors/latest/process-chat-data and you can add new data using the following cURL commands. Replace <YOUR-SEAPLANE-KEY> with your Seaplane API key.

TOKEN=$(curl -X POST 'https://flightdeck.cplane.cloud/identity/token' -H "Authorization: Bearer <YOUR-SEAPLANE-KEY>")
curl -X POST 'https://carrier.cplane.cloud/apps/processors/latest/process-chat-data' \
     -H "Authorization: Bearer $TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
           "input": [{
               "text": "The quick brown fox jumps over the lazy dog"
           }]
         }'

Chat Application

The second application you build is the chat application itself. This is an HTTP-enabled application users can query to perform Q&A on the text processed with your processing application.

To get started run the following command to create the default application structure

seaplane init chatbot

Remove the boiler-plate hello-world code and update the task id, application id and path to better reflect the purpose of this app. The final basic application should look something like this:

from seaplane import app, task, start

# the chat task that performs the document search and feeds them to the LLM
@task(type="inference", id='chat-task')
def chat_task(data):
	# chat task logic here

# HTTP enabled chat app
@app(id='chat-app', path='/chat', method=['POST', 'GET'])
def chat_app(data):
	return chat_task(data)

start()

For the Q&A part of the application, you use Langchain or more specifically ConversationalRetrievalChain powered by the Langchain Seaplane integration, using an instance of Seaplane hosted MPT-30B as the LLM of choice.

from seaplane import app, task, start
from langchain.chains import ConversationalRetrievalChain
from seaplane.integrations.langchain import SeaplaneLLM, langchain_vectorstore

# the chat task that performs the document search and feeds them to the LLM
@task(type="inference", id='chat-task')
def chat_task(data):
    # create vector store instance with langchain integration
    vectorstore = langchain_vectorstore(index_name="chat-documents")

    # Create the chain
    pdf_qa_hf = ConversationalRetrievalChain.from_llm(
        llm=SeaplaneLLM(),
        retriever=vectorstore.as_retriever(),
        return_source_documents=True,
    )

# HTTP enabled chat app
@app(id='chat-app', path='/chat', method=['POST', 'GET'])
def chat_app(data):
    return chat_task(data)

start()

Make sure you select your previously created index (chat-documents) as the index_name and notice SeaplaneLLM() is added as the llm parameter for the ConversationalRetrievalChain. In short, this ensures you use the Seaplane vector store and Seaplane hosted MPT-30B as your LLM.

Finally, you can query the LLM with the user-provided question.

from seaplane import app, task, start
from langchain.chains import ConversationalRetrievalChain
from seaplane.integrations.langchain import SeaplaneLLM, langchain_vectorstore

# the chat task that performs the document search and feeds them to the LLM
@task(type="inference", id='chat-task')
def chat_task(data):
    # create vector store instance with langchain integration
    vectorstore = langchain_vectorstore(index_name="chat-documents")

    # Create the chain
    pdf_qa_hf = ConversationalRetrievalChain.from_llm(
        llm=SeaplaneLLM(),
        retriever=vectorstore.as_retriever(),
        return_source_documents=True,
    )

    # answer the question using MPT-30B
    result = pdf_qa_hf({"question": data["query"], "chat_history": data['chat_history']})

    # return only the answer to the user
    return result["answer"].split("\n\n### Response\n")[1]

# HTTP enabled chat app
@app(id='chat-app', path='/chat', method=['POST', 'GET'])
def chat_app(data):
    return chat_task(data)

start()

The above assumes that the user adds the questions and chat history to the POST request as query and chat_hisotry respectively.

Deploying The Chat Application

Open the .env file in the project and add your Seaplane API key. Add Langchain to the required packages in the pyproject.toml file.

[tool.poetry]
name = "chatbot"
version = "0.0.1"
description = ""
authors = []

[tool.seaplane]
main = "main.py"

[tool.poetry.dependencies]
python = ">=3.10,<3.12"
seaplane = "^0.3.89"
langchain = "0.0.195"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

Run the following two commands inside the project directory to install the required packages and deploy the application

poetry install
seaplane deploy

Your application is now available at https://carrier.cplane.cloud/apps/chat-app/latest/chat and you can submit new questions using the following cURL command.

TOKEN=$(curl -X POST 'https://flightdeck.cplane.cloud/identity/token' -H "Authorization: Bearer <YOUR-SEAPLANE-KEY>")
curl -X POST 'https://carrier.cplane.cloud/apps/chat-app/latest/chat' \
     -H "Authorization: Bearer $TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
           "input": [{
               "query": "who jumped over the lazy dog?",
                "chat_history" : ""
           }]
         }'

Followed by the following cURL request to retrieve the answer. Replace <REQUEST_ID> with the request ID output from the POST request above.

curl -X GET "https://carrier.cplane.cloud/apps/chat-app/latest/chat/request/<REQUEST_ID>" \
     -H "Authorization: Bearer $TOKEN"

Notion Processors

Now that you know the basics let's extend the chat app and train it on a Notion repository instead.

The Notion Processor is an extension of the default processing task you create in this tutorial. The notion processor uses the Notion API to scrape all content from a Notion workspace.

To enable the Notion processor, create a new file called notion_processor.py, copy and paste the Notion processing task from our GitHub repository and import it in main.py. Call the task inside your DAG in your application as follows.

The notion processor uses a couple of helper functions that talk to the Notion API. To make things easy we included them for you.

from notion_processor import process_notion

# the HTTP enabled application
@app(id='processors', path='/process-chat-data', method=['POST', 'GET'])
def chatbot_processor_application(data):
    # default processor created in this tutorial
    process_data(data) # you can remove this if you only want to process Notion instead
    
    # notion processor
    process_notion(data)

start()

Head over to the Notion site and create a new integration. Copy the Internal Integration Secret and add it to the .env file of your application as follows.

SEAPLANE_API_KEY=<YOUR-SEAPLANE-API=KEY>
NOTION_KEY=<YOUR-INTERNAL-INTEGRATION_SECRET>

Add the integration to all pages you want to process. By default, all nested pages are processed. As such the easiest way to analyze all pages is by creating a primary page that contains all other pages and adding the integration to this page only.

To add an integration to a page click on the three dots at the top right corner of the page and scroll all the way down to the bottom of the menu and click Add connections, and now select the integration you just created.

Trigger a run of your processor by calling the API endpoint. Processing of the documents might take a few minutes to hours depending on the size of your Notion$$ repository. But once the first document is processed you can start asking questions.

Your Notion pages now have a chatGPT-style API endpoint to ask questions!

tip

Notion allows any user to create pages, this in many cases means there are a ton of less relevant pages. For example, at Seaplane everyone has personal pages with a lot of braindumps and to-do lists. To increase the performance of your chatbot you can consider excluding those pages from the chatbot i.e., removing the integration from these pages.

The applications​

The Processor Application​

Deploying The Processing Application​

Chat Application​

Deploying The Chat Application​

Notion Processors​