Sony Internship-LLM/AI


ReAct Framework - Technologies and Components

1. RAG (Retrieval-Augmented Generation)

📌 Purpose:

  • Retrieve relevant skills from the LangChain Chatchat vector skill library.
  • Combine the retrieved skills with the LLM to generate the final task planning and execution strategies.
  • Improve the accuracy of information during robot interactions.

📌 Details:

  • The framework uses vector databases to store skills and assist the LLM in generating optimal strategies by querying the most relevant skills.
  • In complex task handling, the LLM optimizes itself by combining historical interaction data, aligning with RAG’s idea of “combining retrieval with generation”.

2. ReAct (Reasoning + Acting)

📌 Purpose:

  • Allow the LLM not only to answer questions but also to perform reasoning, planning, and execution.
  • The robot can think about its current state, query external information, and then act accordingly.

📌 Details:

  • “ReAct” refers both to the framework’s name and an interaction mode for the LLM.
  • The LLM first performs Reasoning, then Acts.
  • Example:
    • The robot reasons about the target position (combining 3D vision data).
    • Queries the LangChain vector skill library to find the appropriate grabbing method.
    • Executes the grabbing task.

📌 ReAct Paper (Google DeepMind):

  • ReAct combines Chain-of-Thought (CoT) reasoning and Action-based execution, making it suitable for task planning and autonomous decision-making.

3. 3D Vision + Grounded Segmentation

📌 Purpose:

  • The robot uses a depth camera and 3D reconstruction to help the LLM understand environmental information.
  • It combines Grounding DINO / SAM (Segment Anything Model) for semantic segmentation.

📌 Details:

  • The robot retrieves information about the target object’s position and then lets the LLM plan the path.
  • Example:
    • The robot may ask, “Where is the red object on the table?”
    • The vision module segments the object and provides the data to the LLM.
    • The LLM combines the 3D coordinates with task planning and provides operation instructions.

4. LangChain + Vector Database

📌 Purpose:

  • LangChain serves as the Prompt Management and Memory Storage component for the LLM.
  • A vector database (FAISS / ChromaDB / Milvus) stores robot skills and user interaction history.

📌 Details:

  • LangChain enables the LLM to remember tasks previously performed by the robot, avoiding redundant calculations.
  • The robot automatically manages the skill library, supporting:
    • Task Screening
    • Skill Summarization
    • Self-verification

5. Task Decomposition

📌 Purpose:

  • The robot can break down complex tasks into smaller subtasks for gradual completion.
  • Example:
      1. Retrieve object coordinates 🡪 2. Calculate grabbing angle 🡪 3. Perform grabbing 🡪 4. Place the object.

📌 Details:

  • The LLM, combined with LangChain, manages tasks to ensure logical execution.
  • When faced with new tasks, the robot can retrieve similar tasks from the skill library and generate an appropriate strategy.

6. Self-Learning & Feedback Loop

📌 Purpose:

  • After task execution, the robot automatically summarizes its experience to improve future decision-making.
  • With LangChain Memory, the robot can remember past mistakes and optimize strategies.

📌 Details:

  • Task feedback (success or failure) 🡪 Update skill library 🡪 Optimize LLM prompts.
  • Over time, the robot becomes smarter in the same scenario through long-term interaction memory.

Summary: Technologies in ReAct Framework

TechnologyRole in ReAct Framework
RAG (Retrieval-Augmented Generation)Retrieves relevant skills from the vector skill library to improve task execution accuracy.
ReAct (Reasoning + Acting)Enables the robot to reason and execute tasks, facilitating autonomous decision-making.
3D Vision + Grounded SegmentationHelps the LLM understand the robot’s environment and plan tasks accordingly.
LangChain + Vector DatabaseStores task information, historical interactions, and manages skills.
Task DecompositionBreaks complex tasks into smaller steps to improve execution efficiency.
Self-Learning (Feedback Loop)Optimizes robot skills based on task performance and feedback, enhancing future task success.