InstructGPT란? ChatGPT와 InstructGPT 차이점

ChatGPT의 전신 : InstructGPT

- Training language models to follow instructions with human feedback, Open AI 2022.03

- GPT3 논문 공개 20년 5월, InstructGPT 논문 공개 22년 3월

- 언어 모델을 더 크게 만든다고 본질적으로 사용자의 의도를 잘 따르는 것은 아님. ex) untruthful, toxic 등 사용자에게 도움이 되지 않는 출력을 생성할 수 있음

- 이 논문에서는 사람의 피드백을 통해 미세 조정하여 다양한 작업에 대한 사용자 의도에 맞게 언어 모델을 정렬하는 방법을 보여줌

InstructGPT

- OpenAI API를 통해 제출된 프롬프트와 labeler-written prompts로 supervised learning을 사용하여 GPT-3를 미세 조정하는데 필요한 labeler demonstrations의 데이터 세트를 수집

- 이후 모델 출력 순위 데이터 세트를 수집하여 사람의 피드백에서 강화 학습을 사용하여 지도 모델을 추가로 미세 조정하는 데 사용

- 구체적으로 사람의 피드백을 통한 강화학습(RLHF; Christiano et al., 2017; Stiennon et al., 2020)을 사용하여 GPT-3를 미세 조정

Step 1

- OpenAI API를 통해 제출된 프롬프트와 labeler-written prompts로 desired output behavior에 대한 사람이 작성한 demonstrations 데이터 세트를 수집하고 이를 supervised learning

ex) prompt: 6살 아이에게 달착륙을 설명하다 -> 원하는 출력 : 사람들이 달에 갔었다

위와 같이 주어진 prompt에 대해 사람이 원하는 출력 행동을 작성, 이를 supervised learning한 것이 베이스라인

Step 2

- 모델 출력 간에 human-labeled comparisons 데이터 세트를 수집

- 이 데이터 세트에서 reward model(RM)을 훈련하여 레이블러가 선호하는 모델 출력을 예측

- 즉 특정 prompt에 대해 여러 모델을 돌려서 출력을 여러개 뽑고, 출력의 좋은 순위를 랭킹을 매겨서 모델한테 어떤 출력이 상대적으로 좋은지 학습하게 함

Step 3

- RM을 reward function으로 사용하고, supervised learning baseline을 미세 조정, PPO 알고리즘을 사용하여 보상을 최대화

PPO란?

https://seokhee0516.tistory.com/112

PPO란? Proximal Policy Optimization Algorithms

https://arxiv.org/abs/1707.06347 Proximal Policy Optimization Algorithms We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrog

seokhee0516.tistory.com

ChatGPT와 InstructGPT 차이점

1. InstructGPT는Step3에서 PPO를 초기화를 랜덤하게 가져감. ChatGPT는 supervised policy에 따라 초기화.

2. fine tuning을 위한 예시 데이터 수집이 약간 다름

- 레이블러가 사용자와 AI assistant 양쪽 대화를 모두 제공

- 비교 데이터 생성을 위해 AI 트레이너가 챗봇과 대화를하여 메시지를 선택하고, 대체재를 샘플링

ChatGPT 프로세스 (Step3 PPO 부분이 InstructGPT와 다름)

https://openai.com/research/instruction-following

Aligning language models to follow instructions

We’ve trained language models that are much better at following user intentions than GPT-3 while also making them more truthful and less toxic, using techniques developed through our alignment research. These InstructGPT models, which are trained with

openai.com

https://openai.com/blog/chatgpt

Introducing ChatGPT

We’ve trained a model called ChatGPT which interacts in a conversational way. The dialogue format makes it possible for ChatGPT to answer followup questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests.

openai.com

'AI > NLP' 카테고리의 다른 글

[논문리뷰] GPT Understands, Too // P-tuning 이론 (0)	2023.07.18
[논문리뷰] Prefix-tuning: Optimizing Continuous Prompts for Generation // Prefix-tuning 이론 (0)	2023.07.17
[논문리뷰] ANCE: Approximate Nearest Neighbor Negative Con- Trastive Learning For Dense Text Retrieval (2)	2023.01.26
Passage retrieval(문서 검색) - Sparse Embedding, Dense Embedding, Scaling up with FAISS (0)	2023.01.05
MRC(기계독해, Machine Reading Comprehension), Extraction-based MRC, Generation-based MRC (0)	2023.01.05

쫌쫌따리 공부하깅

Seokhee Jeong

InstructGPT란? ChatGPT와 InstructGPT 차이점

ChatGPT의 전신 : InstructGPT

InstructGPT

Step 1

Step 2

Step 3

ChatGPT와 InstructGPT 차이점

'AI > NLP' 카테고리의 다른 글

티스토리툴바

InstructGPT란? ChatGPT와 InstructGPT 차이점

ChatGPT의 전신 : InstructGPT

InstructGPT

Step 1

Step 2

Step 3

ChatGPT와 InstructGPT 차이점

'AI > NLP' 카테고리의 다른 글

'AI/NLP' Related Articles

티스토리툴바