Privacy with Microsoft Azure OpenAI embeddings and chat model

In this blog, you learn to use the Azure OpenAI and embeddings model to chat to your documents using Langchain. We would be using local but persistent Db - Chroma Db here. You also learn various different docx, website loader techniques to take your applications to the next level.

So we might not wish to send our data to OpenAI via our embeddings and chat models. That is where Microsoft Azure OpenAI APIs come into picture.
Microsoft provides us access to OpenAI models via its own platform.

Setting Up Azure Account

Now before you begin opening Colab notebook and reading langchain documentations. You might wish to create your own deployment of OpenAI Embedding and Model on Azure. This is the article I found really helpful to do the same.

So now that you are all ready,You will need to get a ton of variables from azure being being able to run your model.

Setting Up Azure LLM in Langchain

import os 
# Set this to `azure` 
os.environ["OPENAI_API_TYPE"]="x" 
  
# The API version you want to use os.environ["OPENAI_API_VERSION"]="x" 
# The base URL for your Azure OpenAI resource. You can find this in the Azure portal under your Azure OpenAI resource.
os.environ["OPENAI_API_BASE"]="your-azure-endpoint" 

# The API key for your Azure OpenAI resource. You can find this in the Azure portal under your Azure OpenAI resource. 
os.environ["OPENAI_API_KEY"]="x" 
os.environ["OPENAI_DEPLOYMENT_VERSION"]="llm-model-version" 
os.environ["OPENAI_MODEL_NAME"]="llm-model-name" 
os.environ["OPENAI_DEPLOYMENT_NAME"]="llm-model-deployment-name" 
os.environ["OPENAI_EMBEDDING_DEPLOYMENT_VERSION"]="embedding-model-deployed-version" 
os.environ["OPENAI_EMBEDDING_MODEL_NAME"]="your-deployed-embedding-model-name" 
os.environ["OPENAI_EMBEDDING_DEPLOYMENT_NAME"]="your-deployed-embedding-deployment-name"

Now let's try calling the azure openai embedding and llm model to check if everything is working fine. Make sure you have installed dependencies like - langchain, openai, chromadb, tiktoken.

import os 
import dotenv 
from langchain.chat_models import AzureChatOpenAI 
from langchain.schema import HumanMessage 
from langchain.embeddings import OpenAIEmbeddings 
from dotenv import load_dotenv 
  
# Load environment variables from .env file load_dotenv() 
# Create an instance of the AzureChatOpenAI class using Azure OpenAI 

llm = AzureChatOpenAI( openai_api_type="azure", openai_api_key=os.getenv("OPENAI_API_KEY"), openai_api_base=os.getenv("OPENAI_API_BASE"), deployment_name=os.getenv("OPENAI_DEPLOYMENT_NAME"), temperature=0.7, openai_api_version=os.getenv("OPENAI_API_VERSION")) 

# Testing chat llm 
res = llm([HumanMessage(content="Tell me a joke about a penguin sitting on a fridge.")]) 
print(res) 
embeddings = OpenAIEmbeddings( openai_api_type="azure", openai_api_key=os.getenv("OPENAI_API_KEY"), openai_api_base=os.getenv("OPENAI_API_BASE"), deployment=os.getenv("OPENAI_EMBEDDING_DEPLOYMENT_NAME"), model=os.getenv("OPENAI_EMBEDDING_MODEL_NAME"), chunk_size=1)

Okay great, hopefully you would have gotten some answer for your joke. Now let's begin loading our document. And initializing our vector store with text embeddings.

# Load and preprocess the PDF document 
from langchain.document_loaders import PyPDFLoader 
from langchain.chains import ConversationalRetrievalChain 
from langchain.memory import ConversationBufferMemory 
from langchain.embeddings import OpenAIEmbeddings 
from langchain.chains import RetrievalQA 
from langchain.document_loaders import TextLoader 
from langchain.embeddings.openai import OpenAIEmbeddings 
from langchain.llms import OpenAI 
from langchain.text_splitter import CharacterTextSplitter 
from langchain.vectorstores import Chroma 
from langchain.prompts import PromptTemplate 
from langchain.embeddings import OpenAIEmbeddings 

loader = PyPDFLoader('/content/INDAS7.pdf') 
documents = loader.load() 

# Split the documents into smaller chunks for processing 
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) 
texts = text_splitter.split_documents(documents) 
docsearch = Chroma.from_documents(texts, embeddings, persist_directory="./chroma_db")

Now it is time for us to talk with our loaded pdfs. Remember we are using pdf for our example, however using langchain you can load ppts, documents and from a lot more of these sources.

#Chat with memory 
from langchain.chains import ConversationalRetrievalChain 
from langchain.memory import ConversationBufferMemory 

langchain.debug = False 

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True) 
qa = ConversationalRetrievalChain.from_llm(llm, docsearch.as_retriever(), memory=memory) 
query = "What is Cost of inventories" 
result = qa({"question": query}) 
print(result["answer"]) 
query = "How are they calculated" 
result = qa({"question": query}) 
print(result['answer'])

Looks great !! Now in case you wish to load some document, you can use the following method to do the same. You might have to install some dependencies for document loading. I would say those were - docx2txt

Word Document Loader

from langchain.document_loaders import Docx2txtLoader 
loader = Docx2txtLoader("/content/Basel III Simplified version.docx") 
documents = loader.load()

Directory Loader

Now, what if you wish to load multiple pdfs inside a directory -

from langchain.document_loaders import DirectoryLoader 
from langchain.document_loaders import PyPDFLoader 
folder_path="/content/pdf_resources" 
pdf_loader = DirectoryLoader(folder_path, glob="*.pdf", loader_cls=PyPDFLoader) 
pdf_documents = pdf_loader.load()

What if you wish to load multiple docx files inside a directory? Here is how you can do the same - (You might have to install some dependencies for the same)

from langchain.document_loaders import DirectoryLoader 
from langchain.document_loaders import UnstructuredWordDocumentLoader 
  
folder_path="/content/text_resources" 
txt_loader = DirectoryLoader(folder_path, glob="*.docx", loader_cls=UnstructuredWordDocumentLoader) 
txt_documents = txt_loader.load()

Pdf, Docx, Txt Loader

What if you wish to load multiple of these different docx, pdfs, ppts and load them together ? Here is how you can do the same -

documents = [] 
   
for file in os.listdir("docs"): 
  if file.endswith(".pdf"): 
    pdf_path = "./docs/" + file loader = PyPDFLoader(pdf_path) 
    documents.extend(loader.load()) 
  elif file.endswith('.docx') or file.endswith('.doc'): 
    doc_path = "./docs/" + file loader = Docx2txtLoader(doc_path) 
    documents.extend(loader.load()) 
  elif file.endswith('.txt'): 
    text_path = "./docs/" + file 
    loader = TextLoader(text_path) 
    documents.extend(loader.load())

Webpage Loader

If you have some website blog or article of your choice, that you wish to obtain answers from, here is how you can do that too.

from langchain.document_loaders import WebBaseLoader 
from langchain.text_splitter import CharacterTextSplitter 
from langchain.vectorstores import Chroma 
  
loader = WebBaseLoader("your-website-blog-url") 
docs = loader.load() 
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) 

ruff_texts = text_splitter.split_documents(docs) 
ruff_db = Chroma.from_documents(ruff_texts, embeddings, collection_name="ruff") 
  
ruff = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=ruff_db.as_retriever() )

To call the retrieval function you just defined -

 ruff("Why is terraform used here. Explain to me in a story like I am a noob")

Super nice. Now you are all set to create amazing applications for yourself and your clients. If you need any help, feel free to reach out to us, we will be more than happy to help you :)