In this blog, you learn to use the Azure OpenAI and embeddings model to chat to your documents using Langchain. We would be using local but persistent Db - Chroma Db here. You also learn various different docx, website loader techniques to take your applications to the next level.
So we might not wish to send our data to OpenAI via our embeddings and chat models. That is where Microsoft Azure OpenAI APIs come into picture.
Microsoft provides us access to OpenAI models via its own platform.
Setting Up Azure Account
Now before you begin opening Colab notebook and reading langchain documentations. You might wish to create your own deployment of OpenAI Embedding and Model on Azure. This is the article I found really helpful to do the same.
So now that you are all ready,You will need to get a ton of variables from azure being being able to run your model.
Setting Up Azure LLM in Langchain
import os
os.environ["OPENAI_API_TYPE"]="x"
os.environ["OPENAI_API_BASE"]="your-azure-endpoint"
os.environ["OPENAI_API_KEY"]="x"
os.environ["OPENAI_DEPLOYMENT_VERSION"]="llm-model-version"
os.environ["OPENAI_MODEL_NAME"]="llm-model-name"
os.environ["OPENAI_DEPLOYMENT_NAME"]="llm-model-deployment-name"
os.environ["OPENAI_EMBEDDING_DEPLOYMENT_VERSION"]="embedding-model-deployed-version"
os.environ["OPENAI_EMBEDDING_MODEL_NAME"]="your-deployed-embedding-model-name"
os.environ["OPENAI_EMBEDDING_DEPLOYMENT_NAME"]="your-deployed-embedding-deployment-name"
Now let's try calling the azure openai embedding and llm model to check if everything is working fine. Make sure you have installed dependencies like - langchain, openai, chromadb, tiktoken.
import os
import dotenv
from langchain.chat_models import AzureChatOpenAI
from langchain.schema import HumanMessage
from langchain.embeddings import OpenAIEmbeddings
from dotenv import load_dotenv
llm = AzureChatOpenAI( openai_api_type="azure", openai_api_key=os.getenv("OPENAI_API_KEY"), openai_api_base=os.getenv("OPENAI_API_BASE"), deployment_name=os.getenv("OPENAI_DEPLOYMENT_NAME"), temperature=0.7, openai_api_version=os.getenv("OPENAI_API_VERSION"))
res = llm([HumanMessage(content="Tell me a joke about a penguin sitting on a fridge.")])
print(res)
embeddings = OpenAIEmbeddings( openai_api_type="azure", openai_api_key=os.getenv("OPENAI_API_KEY"), openai_api_base=os.getenv("OPENAI_API_BASE"), deployment=os.getenv("OPENAI_EMBEDDING_DEPLOYMENT_NAME"), model=os.getenv("OPENAI_EMBEDDING_MODEL_NAME"), chunk_size=1)
Okay great, hopefully you would have gotten some answer for your joke. Now let's begin loading our document. And initializing our vector store with text embeddings.
from langchain.document_loaders import PyPDFLoader
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.prompts import PromptTemplate
from langchain.embeddings import OpenAIEmbeddings
loader = PyPDFLoader('/content/INDAS7.pdf')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
docsearch = Chroma.from_documents(texts, embeddings, persist_directory="./chroma_db")
Now it is time for us to talk with our loaded pdfs. Remember we are using pdf for our example, however using langchain you can load ppts, documents and from a lot more of these sources.
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
langchain.debug = False
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
qa = ConversationalRetrievalChain.from_llm(llm, docsearch.as_retriever(), memory=memory)
query = "What is Cost of inventories"
result = qa({"question": query})
print(result["answer"])
query = "How are they calculated"
result = qa({"question": query})
print(result['answer'])
Looks great !! Now in case you wish to load some document, you can use the following method to do the same. You might have to install some dependencies for document loading. I would say those were - docx2txt
Word Document Loader
from langchain.document_loaders import Docx2txtLoader
loader = Docx2txtLoader("/content/Basel III Simplified version.docx")
documents = loader.load()
Directory Loader
Now, what if you wish to load multiple pdfs inside a directory -
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import PyPDFLoader
folder_path="/content/pdf_resources"
pdf_loader = DirectoryLoader(folder_path, glob="*.pdf", loader_cls=PyPDFLoader)
pdf_documents = pdf_loader.load()
What if you wish to load multiple docx files inside a directory? Here is how you can do the same - (You might have to install some dependencies for the same)
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import UnstructuredWordDocumentLoader
folder_path="/content/text_resources"
txt_loader = DirectoryLoader(folder_path, glob="*.docx", loader_cls=UnstructuredWordDocumentLoader)
txt_documents = txt_loader.load()
Pdf, Docx, Txt Loader
What if you wish to load multiple of these different docx, pdfs, ppts and load them together ? Here is how you can do the same -
documents = []
for file in os.listdir("docs"):
if file.endswith(".pdf"):
pdf_path = "./docs/" + file loader = PyPDFLoader(pdf_path)
documents.extend(loader.load())
elif file.endswith('.docx') or file.endswith('.doc'):
doc_path = "./docs/" + file loader = Docx2txtLoader(doc_path)
documents.extend(loader.load())
elif file.endswith('.txt'):
text_path = "./docs/" + file
loader = TextLoader(text_path)
documents.extend(loader.load())
Webpage Loader
If you have some website blog or article of your choice, that you wish to obtain answers from, here is how you can do that too.
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
loader = WebBaseLoader("your-website-blog-url")
docs = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
ruff_texts = text_splitter.split_documents(docs)
ruff_db = Chroma.from_documents(ruff_texts, embeddings, collection_name="ruff")
ruff = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=ruff_db.as_retriever() )
To call the retrieval function you just defined -
ruff("Why is terraform used here. Explain to me in a story like I am a noob")
Super nice. Now you are all set to create amazing applications for yourself and your clients. If you need any help, feel free to reach out to us, we will be more than happy to help you :)