CodeGenBot - RAG Code Assistant

Retrieval-augmented Python code generation using HumanEval dataset and DeepSeek-R1

Week 3
Duration: 1 week
RAG, HumanEval, DeepSeek-R1, Semantic Search

Project Overview

Understanding the challenge and solution approach

Problem Statement

Build a retrieval-augmented code generation assistant that can provide context-aware Python code suggestions grounded in real coding examples from the HumanEval dataset, enhancing developer productivity and code quality.

Solution Approach

Implemented a RAG system combining semantic search with Sentence Transformers over HumanEval dataset and DeepSeek-R1-Distill-Qwen-1.5B for high-quality, context-aware Python code generation through a conversational Streamlit interface.

Expected Outcome

Create a production-ready code generation assistant that can understand coding context, retrieve relevant examples, and generate accurate Python code solutions, significantly improving developer productivity and code quality.

Dataset & Foundation

HumanEval dataset and semantic search foundation

HumanEval Dataset

The system leverages the HumanEval dataset - a comprehensive collection of real-world Python coding problems and solutions. This dataset provides the foundation for semantic search and context-aware code generation.

Dataset Source

OpenAI HumanEval

Content Type

Python Problems & Solutions

Format

Parquet Files

Use Case

Semantic Search Context

RAG Architecture & Pipeline

How the Retrieval-Augmented Generation system works

RAG Flow Process

User Query

Python coding prompt input

Semantic Embedding

all-MiniLM-L6-v2 encoding

Vector Search

Retrieve similar problems

Context Fusion

Combine query + examples

Code Generation

DeepSeek-R1 synthesis

Chat Output

Formatted code response

Modular Pipeline Components

Embedding Module

Sentence Transformers for semantic encoding

Retrieval Module

Vector search and context selection

Generation Module

DeepSeek-R1 code synthesis

UI Module

Streamlit conversational interface

Methodology & Implementation

Step-by-step approach to building the RAG code generator

  1. HumanEval Dataset Integration: Integrated the OpenAI HumanEval dataset containing real-world Python coding problems and solutions, providing a comprehensive foundation for semantic search and context retrieval.
  2. Semantic Search Implementation: Implemented semantic search using Sentence Transformers (all-MiniLM-L6-v2) to embed and retrieve similar coding problems from the HumanEval dataset for context-aware code generation.
  3. DeepSeek-R1 Model Integration: Integrated DeepSeek-R1-Distill-Qwen-1.5B via HuggingFace Inference API for high-quality Python code generation, leveraging retrieved context for improved accuracy.
  4. Modular Pipeline Development: Built a clean, modular pipeline with separate components for embedding, retrieval, and generation logic, ensuring easy extension and maintenance.
  5. Context-Aware Generation: Developed algorithms to effectively combine user queries with retrieved coding examples, enabling the LLM to generate more accurate and relevant code solutions.
  6. Streamlit Chat Interface: Created a conversational Streamlit chatbot with code formatting, chat history, error handling, and real-time code generation capabilities.

Results & Performance

Key metrics and achievements from the RAG code generator

HumanEval
Real-world Dataset
DeepSeek-R1
State-of-the-art LLM
Semantic Search
Context-aware Retrieval
Streamlit Chat
Conversational UI

Key Achievements

  • Successfully implemented a production-ready RAG system for Python code generation
  • Integrated HumanEval dataset with semantic search for context-aware code suggestions
  • Built a modular, extensible pipeline with clean separation of concerns
  • Created an intuitive conversational interface for code generation assistance
  • Demonstrated the power of combining retrieval-augmented generation with modern LLMs

Technical Stack

Technologies and frameworks used in the project

DeepSeek-R1 HumanEval RAG Sentence Transformers all-MiniLM-L6-v2 Python Streamlit Hugging Face Vector Search Semantic Search Parquet Chatbot

Project Links & Resources

Access to code, documentation, and resources

Challenges & Solutions

Key obstacles encountered and how they were overcome

  • Semantic Search Accuracy: Implemented advanced filtering and ranking algorithms using Sentence Transformers to ensure retrieved code examples are highly relevant to user queries.
  • Context Integration: Developed sophisticated algorithms to effectively combine user queries with retrieved coding examples, enabling the LLM to generate more accurate and relevant code solutions.
  • Real-time Performance: Optimized the RAG pipeline and implemented efficient vector search to achieve reasonable generation times while maintaining high code quality.
  • Modular Architecture: Designed a clean, modular pipeline with separate components for embedding, retrieval, and generation logic, ensuring easy extension and maintenance for future enhancements.
Back to Cellula Technologies