Introduction: XSight is a lightweight conversational agent powered by a fine-tuned LLaMA 3 8B model using LoRA. Designed as part of a technical challenge, it demonstrates real-world deployment of custom LLMs using serverless infrastructure.
Challenge: Build a responsive, low-latency chat interface served from a serverless endpoint. The system accepts user prompts and context history, then returns a generated reply based on instruction-tuned training.
Tech Stack: Python backend (FastAPI, Hugging Face Transformers, PEFT, Torch), LoRA fine-tuning, deployed via RunPod. Frontend built with plain HTML/CSS/JavaScript.
How It Works: Prompts are formatted using the tokenizer’s chat template, passed to `generate()` for decoding, and streamed back as a response. Model is fine-tuned with curated chat data using 5 epochs of instruction learning.
Usage: Type your question and click “Send.” The model will generate a contextual response using its tuned knowledge base and memory of previous inputs.
Limitations: No external tools, browsing, or code execution. Model may occasionally hallucinate or overextend answers. Cold starts may occur on serverless infra.
Recommendations: Improve response reliability by optimizing chunk handling, context windowing, and prompt engineering. Future versions could include system role prompts or tool-use integration.