Recursive Text Split in Langchain

So, in this article, we will be learning to use Langchain to create applications using OpenAI ChatGPT 3.5 model. We will explore various methods of character text splitting, while we create document embeddings. We also explore the concept of agents in this blog article. You will learn how to attach metadata with a recursive text splitter.

Begin with installing the dependencies

!pip install tiktoken openai langchain chromadb pypdf

Import your OpenAI API Key.

import os os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

So now lets load our pdf. Create embeddings and ask questions.

from langchain.embeddings import OpenAIEmbeddings 
from langchain.chains import RetrievalQA 
from langchain.document_loaders import TextLoader 
from langchain.embeddings.openai import OpenAIEmbeddings 
from langchain.llms import OpenAI 
from langchain.text_splitter import CharacterTextSplitter 
from langchain.vectorstores import Chroma 
from langchain.prompts import PromptTemplate 
from langchain.document_loaders import PyPDFLoader 
  
doc = PyPDFLoader("/content/INDAS1.pdf").load() 
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) 
texts = text_splitter.split_documents(doc) embeddings = OpenAIEmbeddings() 
docsearch = Chroma.from_documents(texts, embeddings) 
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=docsearch.as_retriever(), return_source_documents=True) 
query = "What is the document about" 
result = qa({"query": query}) 
print(result["result"]) 
print(result["source_documents"])

Pretty simple right ? We can also see the source document and the page number from which we got our data from. Nice !

Recursive Text Splitter Technique

Now, in this above mentioned code, we are using Character Text Splitter, which splits your text based on some splitting symbol you specify, default is "/n/n". However, if our pdf paragraphs are way too long, the split texts that we would get would be way too long to fit in our llm context window, so below mentioned is the function you can use to split your documents by using Recursive text splitter and also obtain their metadata.

def split_documents_into_chunks(docs, chunk_size, chunk_overlap): 
  chunked_docs = [] 
  for doc in docs: 
    text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder( chunk_size=chunk_size, chunk_overlap=chunk_overlap, ) 
    chunks = text_splitter.split_text(doc.page_content) 
    for i, chunk in enumerate(chunks): 
      doc = Document( page_content=chunk, metadata={ "page": doc.metadata.get("page", 1), "chunk": i + 1, "source": f"{doc.metadata.get('page', 1)}-{i + 1}", }, ) 
      chunked_docs.append(doc) 
  return chunked_docs 
  
docs = Docx2txtLoader("/Users/muskankhandelwal/auditGPT/accounting_standard_pdfs/Basel III Simplified version.docx")) 
texts = split_documents_into_chunks(docs, chunk_size, chunk_overlap)

I used docx loader in the above example to include that as well, you can use pdf loader or any loader of your choice. This can handle multiple documents as well.

Google Search Agent in Langchain

Okay, so this is an important step that you should use in your applications. Now, let us use agents and tools that langchain provides us. So lets begin with using the search tool (Google search abilities). Import your Serp Api keys.

import os os.environ["SERPAPI_API_KEY"] = "your-serp-api-keys"

Now let's define tools.

from langchain import SerpAPIWrapper 
from langchain.agents import initialize_agent, Tool 
from langchain.agents import AgentType 
from langchain.chat_models import ChatOpenAI 
import langchain 
  
# Do this so we can see exactly what's going on under the hood 
#langchain.debug = True 
  
# Initialize the OpenAI language model 
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613") 
  
# Initialize the SerpAPIWrapper for search functionality 
search = SerpAPIWrapper() 
  
# Define a list of tools offered by the agent 
tools = [ Tool( name="Search", func=search.run, description="Useful when you need to answer questions that are not found in QA system. You should ask targeted questions.", ), ]

Let's initialize our agent and run it to check its results.

mrkl = initialize_agent( tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True ) 
mrkl.run("What is the INDAS7 and INDAS109?")

Now you can add more tools to your agent chain and see how the agent chain decides which tool to use.

That is nice :) If you have any trouble creating agents, understanding langchain, or LLM space in general. Feel free to reach out to us :)