PDF Summary Tutorial
In this tutorial, you are building a PDF summary application deployed on Seaplane. You can find the finished project in our demo repository on GitHub.
The demo application creates summaries of any PDF, powered by GPT-3.5. This is a simple example but should help with the understanding of various components inside the seaplane platform. This application consists of a linear flow with four components.
- An API entry point available at
/demo-input
- A pre-processing step to download, extract the text from our PDF and create the prompt for GPT-3.5
- An inference step to create the summary using GPT-3.5
- A database step to store the result in Seaplane's managed global SQL database
The diagram below shows the flow of the data. To create this pipeline in
Seaplane you need to set up an API-enabled application (@app
) and three tasks
(@task
). You can learn more about tasks and apps in our documentation
here and here respectively.
This tutorial assumes you have access to:
- A Seaplane account. You can sign up for our beta here
- An OpenAI API key. You can request one here
- Seaplane installed on your machine you can install it using
pip3 install seaplane
- A SQL database provisioned on the Flightdeck with the following table created
CREATE TABLE pdf_summaries (
url VARCHAR,
prompt VARCHAR,
summary VARCHAR
);
Create a new project using the Seaplane CLI seaplane init pdf-summary
. This
creates the default project structure and a main.py
file containing a demo
application. Remove all code from main.py
. You are going to replace this with
the PDF summary code in the steps below.
The demo project has the following file structure.
pdf-summary/
├── pdf-summary/
│ ├── pre_processising.py
│ ├── inference.py
│ ├── database.py
│ └── main.py
├── .env
└── pyproject.toml
Authentication​
This demo app uses Seaplane and OpenAI. To authenticate with both platforms add
the API key for both Seaplane and OpenAI to the .env
file in your projects
root directory. You can get your Seaplane API key from the
Flightdeck.
SEAPLANE_API_KEY=sp-your-api-key
OPENAI_API_KEY=
Pre-Processing​
In the first step of your application, you are downloading the PDF based on a
URL provided by the user through the API endpoint. You do this inside our first
@task
. Define a new task and mark it as type=compute
. Give it a recognizable
ID for example id=pre-processing
.
The code below downloads the PDF, extracts the text using the PyPDF2
package
and constructs the summary prompt for OpenAI.
The task returns the variables for the next step in your pipeline: the inference
step. You wire these and the other tasks together later in the @app
step.
from seaplane import task
import requests
from PyPDF2 import PdfReader
import os
import json
@task(type='compute', id='pre-processing')
def pre_processing(data):
# get the URL from the request
url = data['url']
# download PDF
response = requests.get(url, allow_redirects=True)
if response.status_code == 200:
with open("my_pdf.pdf", 'wb') as file:
file.write(response.content)
else:
print('Failed to download PDF.')
# place holder to save PDF text
pdf_text = ""
# extract text
count = 0
reader = PdfReader('my_pdf.pdf')
for page_number in range(len(reader.pages)):
if count == 3: # limited to the first 3 pages due to context length of GPT 3.5
break
page = reader.pages[page_number]
page_content = page.extract_text()
pdf_text += page_content
count+=1
# delete PDF
os.remove('my_pdf.pdf')
# construct prompt
prompt = "write a summary of the following text. Make sure to maintain the scientific tone in the paper: " + pdf_text
# pass the required information to the next step
return(json.dumps({
'url' : str(url),
'prompt' : prompt
}))
Inference​
The next step in your pipeline is the inference step. Just like in the previous
step, you define this in a @task
. It takes the output of the last step -
specifically the prompt you constructed - and runs it through OpenAI to create a
summary of the PDF.
By wiring together your various tasks in the @app
component your output from
the previous step is available as a JSON object in the data
variable. Extract
the prompt from the JSON object and run it through OpenAI.
The Seaplane SDK has built-in features for OpenAI models. You access them by
adding the model to the @task
decorator specifically `model="gpt-3.5"' and
passing the model object in the function definition.
Construct the model parameters and pass them to the model.
from seaplane import task
import json
@task(type='inference', id='pdf-inferencer', model="gpt-3.5")
def inferencing(data, model):
# convert data to json
data = json.loads(data)
# get the URL and prompt from the input message
prompt = data['prompt']
url = data['url']
# construct the model parameters including the prompt from the prev step
params = {
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": prompt}],
"temperature": 0.7,
}
# run the inference request
result = model(params)
# return the inferenced result plus input and parameters
result['url'] = url
result['prompt'] = prompt
return(json.dumps(result))
The result
object contains all the information retrieved from OpenAI. Add your
input and parameters and return them as the last step in your task. By wiring
together the tasks the output of the inference step is passed as a JSON object
to the next step in the directed acyclic graph (DAG).
Database​
The final task in your pipeline pushes the results into Seaplane Managed Global
SQL. Define a new task, and give it an ID. This example uses a persistent
database that operates separately from the pipeline. As a result, you need to
authenticate yourself. Create a sql_access
object to authenticate with
Seaplane global SQL. The username and password are loaded from environment
variables which are set during the deployment phase and loaded from the Seaplane
secrets store.
sql_access = {
"username": "SEAPLANE_SQL_USERNAME",
"password": "SEAPLANE_SQL_PASSWORD",
"database": "<YOUR-DB-NAME>",
"port" : 5432
}
You can retrieve your username, password and database name from the Seaplane Flightdeck.
Add a task below the sql_acces
object, give it an ID and add the sql_access
object to the @task
decorator. Make sure to pass along the database object to
the task in the function definition.
from seaplane import task
import json
# you can get this information from the flightdeck
sql_access = {
"username": "<YOUR-USER-NAME>",
"password": "<YOUR-PASSWORD>",
"database": "<YOUR-DB-NAME>",
"port" : 5432
}
@task(type='sql', id='pdf-summary-db', sql=sql_access)
def database(data, sql):
# convert response to JSON
data = json.loads(data)
# get the summary from the response
summary = data["choices"][0]["message"]["content"]
# insert into SQL DB
sql.insert(''' INSERT INTO pdf_summaries
(url, prompt, summary)
VALUES
(%s,%s,%s)
''', [
data["url"], data['prompt'], summary]
)
Application​
In the application step, you wire together all previously created tasks by
defining a directed acyclic graph (DAG). Extend the application.py
file by
creating a new application using the @app
decorator and give it a recognizable
ID.
To make this application accessible as an API endpoint, add the path and the
method. In this example that is path='/demo-input'
and method='POST'
. When
deployed Seaplane automatically creates the endpoint and makes it accessible
using your API key and a POST request.
Call the start()
command below the application definition to start the
application.
from seaplane import config, app, start
import os
api_keys = {
"SEAPLANE_API_KEY": "<SEAPLANE_API_KEY>",
"OPENAI_API_KEY": "<OPENAI_API_KEY>",
}
config.set_api_keys(api_keys)
@app(path='/demo-input', method='POST', id='pdf-summary')
def my_smartpipe(body):
# wire the tasks together in a DAG
prompt = pre_processing(body),
summary = inferencing(prompt),
database(summary)
start()
Deployment​
You are almost ready to deploy your application on Seaplane. Before you deploy
make sure to add the following packages to your pyproject.toml
file in the
dependency section.
[tool.poetry.dependencies]
python = "^3.10"
seaplane = "^0.3.69"
PyPDF2 = "*"
requests = "2.31.0"
To deploy your application run poetry install
followed by seaplane deploy
in
the root directory of your project. Once executed Seaplane sets up all the
required infrastructure including three individually scalable containers, a SQL
database and an API gateway with your API endpoint.
The new API is available through
<tenant-id>.on.cplane.cloud/pdf-summary/latest/demo-input
. To query the API
run the following Python code.
import requests
fd_key = 'Bearer key'
fd_url = "https://flightdeck.cplane.cloud/identity/token"
fd_headers = {"Authorization": "Bearer {}".format(fd_key)}
fd_token = requests.post(fd_url, headers=fd_headers).text
# send request
headers= {
'Authorization': f'Bearer ' + fd_token,
'Content-Type' : 'application/json'
}
data = {'url' : 'https://www.diochnos.com/about/McCarthyWhatisAI.pdf'}
response = requests.post("https://<tenant-id>.on.cplane.cloud/pdf-summary/latest/demo-input", headers=headers, data=data)
print(response)
Replace <tenant-id>
with your tenant ID (available through the Flightdeck).
The summary shows up in your database once the PDF makes it through the pipeline. To confirm it is working you can query the database directly from your terminal as follows.
!echo "SELECT choices FROM pdf_summaries" | psql postgres://<YOUR-USERNAME>:@sql.cplane.cloud/<YOUR-DB-NAME>
Replace <YOUR-USERNAME>
and <YOUR-DB-NAME>
with your username and your
database name.
Video Demo​
The video below shows a working example of the demo application. However, it was constructed with an older version of the Seaplane SDK. Some parts of the implementation and deployment might differ slightly.