Fine-Tuning OceanGPT for Task-Oriented QA

🔍 Addressing Pain Points

Flipping through hundreds of pages of a ship's operating manual is time-consuming and tedious, often causing key details to be missed.
When new maritime policies, regulations, or equipment are introduced, information is scattered and updates lag, making it hard to know where to find what you need.
Existing large models have many "blind spots," are difficult to deploy locally, lack industry-specific knowledge, and cannot meet enterprise-level security and privacy requirements.

OceanGPT·Cangyuan is trained on Chinese and English datasets in the maritime domain. It can be combined with your exclusive documents to easily achieve "local private deployment + efficient custom fine-tuning," allowing you to instantly create a privately deployable Q&A engine:

🌟 For Example

1️⃣ Small Fishing Vessel Energy Saving Manual Q&A Assistant

Background: Energy-saving technologies for fishing vessels are updated frequently, making it difficult for boat owners to grasp the key points in real-time.
Implementation: OceanGPT·Cangyuan reads the "Small Fishing Vessel Energy Saving Manual" and is fine-tuned into a Q&A system that supports professional answers to questions like "How to adjust the propeller blade angle?" and "What is needed for regular maintenance of a fishing boat engine?".

2️⃣ Zhejiang Provincial Department of Ocean and Fisheries Document QA Assistant

Background: Government department documents are numerous and lengthy.
Implementation: OceanGPT is fine-tuned with documents from the "Zhejiang Provincial Department of Ocean and Fisheries" to provide instant answers to questions like "What is the 226 formation rule?".

This tutorial is based on the open-source OceanGPT·Cangyuan large model, the EasyDataset open-source tool, and the Llama Factory open-source tool, covering the following key steps:

Model Acquisition
- Download the pre-trained OceanGPT model from HuggingFace/Git/ModelScope
- Supports local deployment of the 8B parameter base version
EasyDataset Data Engineering
- Detailed explanation of the EasyDataset toolchain
- Automated generation of Q&A datasets from maritime literature PDFs
- Full process configuration for text chunking, question generation, and answer construction
Domain Fine-tuning with Llama Factory
- Explanation of using the LLaMA Factory visual training platform
- Configuration of key parameters
Building a Web Application
- Using a combination of LangChain + Streamlit
User Usage and Effect Validation
- Before-and-after comparison tests using typical cases
This guide provides a practical engineering solution to help you quickly build a professional Q&A system for the maritime domain. Fine-tuning requires only 22GB of VRAM, making it usable on an NVIDIA GeForce RTX 3090. During deployment, int4 quantization technology is used to reduce VRAM usage to about 8GB.

Quick Start