Loading MacM3restart.sh 0 → 100644 +10 −0 Original line number Diff line number Diff line #!/bin/bash # Restart Redis server brew services restart redis # Restart PostgreSQL server brew services restart postgresql # Run the main program python main.py No newline at end of file MacM3setup.sh 0 → 100644 +95 −0 Original line number Diff line number Diff line #!/bin/bash # Check if the 'uploads' directory exists if [ ! -d "uploads" ]; then echo "Error: The 'uploads' directory does not exist. Please create it and add your documents." exit 1 fi # Create a virtual environment python3 -m venv offlineqa-env || { echo "Error creating virtual environment. Exiting."; exit 1; } # Activate the virtual environment source offlineqa-env/bin/activate || { echo "Error activating virtual environment. Exiting."; exit 1; } # Install dependencies pip install -r requirements.txt || { echo "Error installing dependencies. Exiting."; exit 1; } # Download and organize models mkdir -p models/translation models/llm models/embedding # Download translation models python -c """ from transformers import MarianMTModel, MarianTokenizer language_pairs = { 'es': ('Helsinki-NLP/opus-mt-es-en', 'Helsinki-NLP/opus-mt-en-es'), 'fr': ('Helsinki-NLP/opus-mt-fr-en', 'Helsinki-NLP/opus-mt-en-fr'), 'de': ('Helsinki-NLP/opus-mt-de-en', 'Helsinki-NLP/opus-mt-en-de'), 'th': ('Helsinki-NLP/opus-mt-th-en', 'Helsinki-NLP/opus-mt-en-th'), 'ru': ('Helsinki-NLP/opus-mt-ru-en', 'Helsinki-NLP/opus-mt-en-ru'), 'ar': ('Helsinki-NLP/opus-mt-ar-en', 'Helsinki-NLP/opus-mt-en-ar'), 'pt': ('Helsinki-NLP/opus-mt-pt-en', 'Helsinki-NLP/opus-mt-en-pt'), 'zh': ('Helsinki-NLP/opus-mt-zh-en', 'Helsinki-NLP/opus-mt-en-zh'), } for _, (to_en_model_name, from_en_model_name) in language_pairs.items(): MarianTokenizer.from_pretrained(to_en_model_name, cache_dir='./models/translation') MarianMTModel.from_pretrained(to_en_model_name, cache_dir='./models/translation') MarianTokenizer.from_pretrained(from_en_model_name, cache_dir='./models/translation') MarianMTModel.from_pretrained(from_en_model_name, cache_dir='./models/translation') """ || { echo "Error downloading translation models. Exiting."; exit 1; } # Download the LLM (Vicuna-7B) python -c """ from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained('TheBloke/Vicuna-7B-v1.5-GGUF', cache_dir='./models/llm') model = AutoModelForCausalLM.from_pretrained('TheBloke/Vicuna-7B-v1.5-GGUF', cache_dir='./models/llm') """ || { echo "Error downloading Vicuna-7B. Exiting."; exit 1; } # Download the SentencePiece tokenizer python -c """ from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base', cache_dir='./models/embedding') """ || { echo "Error downloading SentencePiece tokenizer. Exiting."; exit 1; } # Generate embeddings for documents python -c """ from langchain.document_loaders import DirectoryLoader from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.vectorstores import FAISS from langchain.embeddings import HuggingFaceEmbeddings # Load and preprocess documents from the 'uploads' folder loader = DirectoryLoader('./uploads', glob='**/*.txt') documents = loader.load() text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0) texts = text_splitter.split_documents(documents) # Generate embeddings (all-MiniLM-L6-v2 is recommended for Vicuna) embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2') db = FAISS.from_documents(texts, embeddings) db.save_local('data/faiss_index') """ || { echo "Error generating embeddings. Exiting."; exit 1; } # Download the spaCy language model and enable the coherence pipe python -m spacy download en_core_web_sm || { echo "Error downloading spaCy model. Exiting."; exit 1; } python -m spacy_experimental.coref.download en || { echo "Error downloading spaCy coreference data. Exiting."; exit 1; } # Install and start Redis brew install redis || { echo "Error installing Redis. Exiting."; exit 1; } brew services start redis || { echo "Error starting Redis. Exiting."; exit 1; } # Install and start PostgreSQL brew install postgresql || { echo "Error installing PostgreSQL. Exiting."; exit 1; } brew services start postgresql || { echo "Error starting PostgreSQL. Exiting."; exit 1; } # Create a database and user in PostgreSQL psql postgres -c "CREATE DATABASE my_qna_db;" || { echo "Error creating database. Exiting."; exit 1; } psql postgres -c "CREATE USER my_qna_user WITH ENCRYPTED PASSWORD 'your_password';" || { echo "Error creating user. Exiting."; exit 1; } psql postgres -c "GRANT ALL PRIVILEGES ON DATABASE my_qna_db TO my_qna_user;" || { echo "Error granting privileges. Exiting."; exit 1; } echo "Setup complete! You can now run the system using 'python main.py'" No newline at end of file README.md +96 −0 Original line number Diff line number Diff line Loading @@ -133,6 +133,102 @@ The system's modular architecture comprises interconnected components, each fulf * Gets user input, generates UIDs for new chat sessions, and retrieves context for follow-up questions * Adds chat requests to the queue for processing ### M3 Mac Quick Setup 1. **Clone the repository:** ```bash git clone https://code.ornl.gov/6cq/offline-multilingual-question-answering-system ``` 2. **Navigate to the project directory:** ```bash cd offline-multilingual-question-answering-system ``` 3. **Create a document upload folder:** * Create a folder named `uploads` at the root level of the project. 4. **Add your documents:** * Place your documents (e.g., plain text files) in the `uploads` folder. 5. **Run the setup script:** ```bash ./MacM3setup.sh # This will execute the setup script ``` This script will perform the following actions: * **Create and activate a virtual environment:** * It creates a virtual environment named `offlineqa-env` using `python3 -m venv`. * It activates the virtual environment using `source offlineqa-env/bin/activate`. * **Install dependencies:** * It installs the required Python packages listed in `requirements.txt` using `pip install -r requirements.txt`. * **Download and organize models:** * It creates the necessary directories for storing the models (`models/translation`, `models/llm`, `models/embedding`). * It downloads the translation models (MarianMT) for the supported language pairs using the `transformers` library. * It downloads the Vicuna-7B LLM and its tokenizer. * It downloads the SentencePiece tokenizer. * **Prepare your document collection:** * It loads and preprocesses documents from the `uploads` folder using `DirectoryLoader` and `RecursiveCharacterTextSplitter`. * It generates embeddings for the documents using the `all-MiniLM-L6-v2` embedding model (recommended for Vicuna) and stores them in a FAISS index. * It saves the FAISS index to disk (`data/faiss_index`). * **Download the spaCy language model and enable the coherence pipe:** * It downloads the `en_core_web_sm` language model for spaCy using `python -m spacy download en_core_web_sm`. * It downloads the coreference resolution data for spaCy using `python -m spacy_experimental.coref.download en`. * **Set up and start Redis and PostgreSQL:** * It installs Redis and PostgreSQL using Homebrew. * It starts the Redis and PostgreSQL servers. * It creates a database and user in PostgreSQL. 6. **Run the main program:** ```bash python main.py ``` The system will start running, presenting the terminal-based chat interface for user interaction. ### Running the System 1. **Initial Setup:** * If you haven't already, follow the installation instructions to set up the system and its dependencies. 2. **Starting the System:** * Execute the `MacM3restart.sh` script to start the Redis and PostgreSQL servers and run the main program: ```bash ./MacM3restart.sh ``` * The system will start running, presenting the terminal-based chat interface for user interaction. 3. **Restarting the System:** * If you need to restart the system (e.g., after making changes to the code or configuration), you can use the same `MacM3restart.sh` script: ```bash ./MacM3restart.sh ``` * This script will restart the Redis and PostgreSQL servers and then re-run the main program. ## Getting Started ### Prerequisites Loading Loading
MacM3restart.sh 0 → 100644 +10 −0 Original line number Diff line number Diff line #!/bin/bash # Restart Redis server brew services restart redis # Restart PostgreSQL server brew services restart postgresql # Run the main program python main.py No newline at end of file
MacM3setup.sh 0 → 100644 +95 −0 Original line number Diff line number Diff line #!/bin/bash # Check if the 'uploads' directory exists if [ ! -d "uploads" ]; then echo "Error: The 'uploads' directory does not exist. Please create it and add your documents." exit 1 fi # Create a virtual environment python3 -m venv offlineqa-env || { echo "Error creating virtual environment. Exiting."; exit 1; } # Activate the virtual environment source offlineqa-env/bin/activate || { echo "Error activating virtual environment. Exiting."; exit 1; } # Install dependencies pip install -r requirements.txt || { echo "Error installing dependencies. Exiting."; exit 1; } # Download and organize models mkdir -p models/translation models/llm models/embedding # Download translation models python -c """ from transformers import MarianMTModel, MarianTokenizer language_pairs = { 'es': ('Helsinki-NLP/opus-mt-es-en', 'Helsinki-NLP/opus-mt-en-es'), 'fr': ('Helsinki-NLP/opus-mt-fr-en', 'Helsinki-NLP/opus-mt-en-fr'), 'de': ('Helsinki-NLP/opus-mt-de-en', 'Helsinki-NLP/opus-mt-en-de'), 'th': ('Helsinki-NLP/opus-mt-th-en', 'Helsinki-NLP/opus-mt-en-th'), 'ru': ('Helsinki-NLP/opus-mt-ru-en', 'Helsinki-NLP/opus-mt-en-ru'), 'ar': ('Helsinki-NLP/opus-mt-ar-en', 'Helsinki-NLP/opus-mt-en-ar'), 'pt': ('Helsinki-NLP/opus-mt-pt-en', 'Helsinki-NLP/opus-mt-en-pt'), 'zh': ('Helsinki-NLP/opus-mt-zh-en', 'Helsinki-NLP/opus-mt-en-zh'), } for _, (to_en_model_name, from_en_model_name) in language_pairs.items(): MarianTokenizer.from_pretrained(to_en_model_name, cache_dir='./models/translation') MarianMTModel.from_pretrained(to_en_model_name, cache_dir='./models/translation') MarianTokenizer.from_pretrained(from_en_model_name, cache_dir='./models/translation') MarianMTModel.from_pretrained(from_en_model_name, cache_dir='./models/translation') """ || { echo "Error downloading translation models. Exiting."; exit 1; } # Download the LLM (Vicuna-7B) python -c """ from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained('TheBloke/Vicuna-7B-v1.5-GGUF', cache_dir='./models/llm') model = AutoModelForCausalLM.from_pretrained('TheBloke/Vicuna-7B-v1.5-GGUF', cache_dir='./models/llm') """ || { echo "Error downloading Vicuna-7B. Exiting."; exit 1; } # Download the SentencePiece tokenizer python -c """ from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base', cache_dir='./models/embedding') """ || { echo "Error downloading SentencePiece tokenizer. Exiting."; exit 1; } # Generate embeddings for documents python -c """ from langchain.document_loaders import DirectoryLoader from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.vectorstores import FAISS from langchain.embeddings import HuggingFaceEmbeddings # Load and preprocess documents from the 'uploads' folder loader = DirectoryLoader('./uploads', glob='**/*.txt') documents = loader.load() text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0) texts = text_splitter.split_documents(documents) # Generate embeddings (all-MiniLM-L6-v2 is recommended for Vicuna) embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2') db = FAISS.from_documents(texts, embeddings) db.save_local('data/faiss_index') """ || { echo "Error generating embeddings. Exiting."; exit 1; } # Download the spaCy language model and enable the coherence pipe python -m spacy download en_core_web_sm || { echo "Error downloading spaCy model. Exiting."; exit 1; } python -m spacy_experimental.coref.download en || { echo "Error downloading spaCy coreference data. Exiting."; exit 1; } # Install and start Redis brew install redis || { echo "Error installing Redis. Exiting."; exit 1; } brew services start redis || { echo "Error starting Redis. Exiting."; exit 1; } # Install and start PostgreSQL brew install postgresql || { echo "Error installing PostgreSQL. Exiting."; exit 1; } brew services start postgresql || { echo "Error starting PostgreSQL. Exiting."; exit 1; } # Create a database and user in PostgreSQL psql postgres -c "CREATE DATABASE my_qna_db;" || { echo "Error creating database. Exiting."; exit 1; } psql postgres -c "CREATE USER my_qna_user WITH ENCRYPTED PASSWORD 'your_password';" || { echo "Error creating user. Exiting."; exit 1; } psql postgres -c "GRANT ALL PRIVILEGES ON DATABASE my_qna_db TO my_qna_user;" || { echo "Error granting privileges. Exiting."; exit 1; } echo "Setup complete! You can now run the system using 'python main.py'" No newline at end of file
README.md +96 −0 Original line number Diff line number Diff line Loading @@ -133,6 +133,102 @@ The system's modular architecture comprises interconnected components, each fulf * Gets user input, generates UIDs for new chat sessions, and retrieves context for follow-up questions * Adds chat requests to the queue for processing ### M3 Mac Quick Setup 1. **Clone the repository:** ```bash git clone https://code.ornl.gov/6cq/offline-multilingual-question-answering-system ``` 2. **Navigate to the project directory:** ```bash cd offline-multilingual-question-answering-system ``` 3. **Create a document upload folder:** * Create a folder named `uploads` at the root level of the project. 4. **Add your documents:** * Place your documents (e.g., plain text files) in the `uploads` folder. 5. **Run the setup script:** ```bash ./MacM3setup.sh # This will execute the setup script ``` This script will perform the following actions: * **Create and activate a virtual environment:** * It creates a virtual environment named `offlineqa-env` using `python3 -m venv`. * It activates the virtual environment using `source offlineqa-env/bin/activate`. * **Install dependencies:** * It installs the required Python packages listed in `requirements.txt` using `pip install -r requirements.txt`. * **Download and organize models:** * It creates the necessary directories for storing the models (`models/translation`, `models/llm`, `models/embedding`). * It downloads the translation models (MarianMT) for the supported language pairs using the `transformers` library. * It downloads the Vicuna-7B LLM and its tokenizer. * It downloads the SentencePiece tokenizer. * **Prepare your document collection:** * It loads and preprocesses documents from the `uploads` folder using `DirectoryLoader` and `RecursiveCharacterTextSplitter`. * It generates embeddings for the documents using the `all-MiniLM-L6-v2` embedding model (recommended for Vicuna) and stores them in a FAISS index. * It saves the FAISS index to disk (`data/faiss_index`). * **Download the spaCy language model and enable the coherence pipe:** * It downloads the `en_core_web_sm` language model for spaCy using `python -m spacy download en_core_web_sm`. * It downloads the coreference resolution data for spaCy using `python -m spacy_experimental.coref.download en`. * **Set up and start Redis and PostgreSQL:** * It installs Redis and PostgreSQL using Homebrew. * It starts the Redis and PostgreSQL servers. * It creates a database and user in PostgreSQL. 6. **Run the main program:** ```bash python main.py ``` The system will start running, presenting the terminal-based chat interface for user interaction. ### Running the System 1. **Initial Setup:** * If you haven't already, follow the installation instructions to set up the system and its dependencies. 2. **Starting the System:** * Execute the `MacM3restart.sh` script to start the Redis and PostgreSQL servers and run the main program: ```bash ./MacM3restart.sh ``` * The system will start running, presenting the terminal-based chat interface for user interaction. 3. **Restarting the System:** * If you need to restart the system (e.g., after making changes to the code or configuration), you can use the same `MacM3restart.sh` script: ```bash ./MacM3restart.sh ``` * This script will restart the Redis and PostgreSQL servers and then re-run the main program. ## Getting Started ### Prerequisites Loading