So, in this article, we will be learning to use Langchain to create applications using OpenAI ChatGPT 3.5 model. We will explore various methods of character text splitting, while we create document embeddings. We also explore the concept of agents in this blog article. You will learn how to attach metadata with a recursive text splitter.
Begin with installing the dependencies
!pip install tiktoken openai langchain chromadb pypdf
Import your OpenAI API Key.
import os os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
So now lets load our pdf. Create embeddings and ask questions.
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.prompts import PromptTemplate
from langchain.document_loaders import PyPDFLoader
doc = PyPDFLoader("/content/INDAS1.pdf").load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(doc) embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=docsearch.as_retriever(), return_source_documents=True)
query = "What is the document about"
result = qa({"query": query})
print(result["result"])
print(result["source_documents"])
Pretty simple right ? We can also see the source document and the page number from which we got our data from. Nice !
Recursive Text Splitter Technique
Now, in this above mentioned code, we are using Character Text Splitter, which splits your text based on some splitting symbol you specify, default is "/n/n". However, if our pdf paragraphs are way too long, the split texts that we would get would be way too long to fit in our llm context window, so below mentioned is the function you can use to split your documents by using Recursive text splitter and also obtain their metadata.
def split_documents_into_chunks(docs, chunk_size, chunk_overlap):
chunked_docs = []
for doc in docs:
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder( chunk_size=chunk_size, chunk_overlap=chunk_overlap, )
chunks = text_splitter.split_text(doc.page_content)
for i, chunk in enumerate(chunks):
doc = Document( page_content=chunk, metadata={ "page": doc.metadata.get("page", 1), "chunk": i + 1, "source": f"{doc.metadata.get('page', 1)}-{i + 1}", }, )
chunked_docs.append(doc)
return chunked_docs
docs = Docx2txtLoader("/Users/muskankhandelwal/auditGPT/accounting_standard_pdfs/Basel III Simplified version.docx"))
texts = split_documents_into_chunks(docs, chunk_size, chunk_overlap)
I used docx loader in the above example to include that as well, you can use pdf loader or any loader of your choice. This can handle multiple documents as well.
Google Search Agent in Langchain
Okay, so this is an important step that you should use in your applications. Now, let us use agents and tools that langchain provides us. So lets begin with using the search tool (Google search abilities). Import your Serp Api keys.
import os os.environ["SERPAPI_API_KEY"] = "your-serp-api-keys"
Now let's define tools.
from langchain import SerpAPIWrapper
from langchain.agents import initialize_agent, Tool
from langchain.agents import AgentType
from langchain.chat_models import ChatOpenAI
import langchain
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613")
search = SerpAPIWrapper()
tools = [ Tool( name="Search", func=search.run, description="Useful when you need to answer questions that are not found in QA system. You should ask targeted questions.", ), ]
Let's initialize our agent and run it to check its results.
mrkl = initialize_agent( tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True )
mrkl.run("What is the INDAS7 and INDAS109?")
Now you can add more tools to your agent chain and see how the agent chain decides which tool to use.
That is nice :) If you have any trouble creating agents, understanding langchain, or LLM space in general. Feel free to reach out to us :)