[20230723] Weekly AI ArXiv 만담 시즌2 - 23회차

하정우님께서 코로나에 걸려 :worried::worried::worried: 오늘 발표는 제가 리드를 맡게 되었습니다.

Weekly ArXiv Talk은 매주 가장 최신 뉴스 및 연구 동향에 대해 가볍게 토의하는 자리로 공유한 모든 내용이 정확하지 않을 수 있습니다.

# News

## LLaMA v2

![image](https://github.com/jungwoo-ha/WeeklyArxivTalk/assets/33523965/f60404e2-ceb7-40e2-ba97-ce6281529a22)

Paper: https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models
Blog: https://ai.meta.com/llama/
GitHub: https://github.com/facebookresearch/llama/tree/main
License Decision: https://blog.opensource.org/metas-llama-2-license-is-not-open-source

![image](https://github.com/jungwoo-ha/WeeklyArxivTalk/assets/33523965/e0286cb3-a6df-47e2-858a-85e9fc13fa1c)

Meta의 LLaMA 모델의 후속작이 공개되었습니다. 기존 LLaMA 모델과 모델 구조 및 pre-training 방법은 동일하지만 데이터 구성 및 RLHF, 등 절차에서 많은 보강이 있었고 현재 Open source model 중에서 SOTA를 기록하고 있습니다 (GPT-4에 비해서는 성능이 낮습니다). 가장 중요한 포인트는 RLHF를 다섯 차례로 나눠 진행한 점과 안전성 모델과 유용함 모델을 별도로 구별한 점이라고 생각됩니다.

기존 LLaMA는 모델 checkpoint를 사용하는데 제약이 커서 많은 비판을 받았는데 LLaMA는 영리 목적으로도 사용할 수 있습니다. 다만, 모델 출시일 기준 7억 명의 월간 유저가 있는 기업은 사용할 수 없으며 LLaMA 또는 다른 LLaMA 기반의 파생 모델 이외 언어 모델의 성능을 향상하는데 활용할 수 없다는 제약이 있어 완전한 오픈소스는 아닙니다.

## FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

PDF: https://tridao.me/publications/flash2/flash2.pdf
Blog: https://crfm.stanford.edu/2023/07/17/flash2.html
GitHub: https://github.com/Dao-AILab/flash-attention
YouTube: https://www.youtube.com/watch?v=IoMSGuiwV3g&ab_channel=AleksaGordi%C4%87-TheAIEpiphany

![image](https://github.com/jungwoo-ha/WeeklyArxivTalk/assets/33523965/b623d358-5fe9-4381-b4b3-19b12d8c7237)

![image](https://github.com/jungwoo-ha/WeeklyArxivTalk/assets/33523965/3e478639-843d-4e50-8da9-ec9cb84f1e60)

Self attention의 QKV 행렬 및 softmax 연산을 GPU에 최적화한 Flash attention의 새로운 업데이트가 나왔습니다. CUTLASS 라이브러리를 기반으로 최적화된 연산을 적용했고 기존과 가장 큰 차이점은 Q,K,V chunking 방법을 바꾸어 메모리 사용 효율을 향상했다는 점입니다. 저자에 의하면 속도가 2배 향상되며 현재 xFormers 및 PyTorch 2.1에 적용되는 중인 것으로 보입니다.

## How is ChatGPT's Behavior Changing Over Time?

ArXiv: https://arxiv.org/abs/2307.09009

![image](https://github.com/jungwoo-ha/WeeklyArxivTalk/assets/33523965/f504f755-b035-4eb1-8df3-756327fc4e03)

ChatGPT 등 언어 모델이 꾸준히 업데이트되면서 특정 task에 대해 성능이 좋아지기도, 안좋아지기도 하는 것은 알려져 있었는데 보다 체계적으로 비교한 글이 나왔습니다. 트위터에서 진행되는 논의에 의하면 비교 시 prompt를 잘 바꾸면 기존의 성능을 그대로 유지할 수 있다는 주장도 있으나 서비스를 하는 입장에서 새로운 문제점에 대해 업데이트를 진행할 경우 수시로 기존 tasdk에 대한 성능 검증이 필요하다는 것을 다시 확인할 수 있습니다.

# Research

## Retentive Network: A Successor to Transformer for Large Language Models

ArXiv: https://arxiv.org/abs/2307.08621

![image](https://github.com/jungwoo-ha/WeeklyArxivTalk/assets/33523965/d4c5da44-5c62-416b-b724-faa8d6a8a284)

![image](https://github.com/jungwoo-ha/WeeklyArxivTalk/assets/33523965/6f3d063a-ba42-4e75-8d32-b1e0a8f1168c)

![image](https://github.com/jungwoo-ha/WeeklyArxivTalk/assets/33523965/21d7a26f-1d52-411b-9c87-b7822ad1a91a)

![image](https://github.com/jungwoo-ha/WeeklyArxivTalk/assets/33523965/9ad7b396-fcf5-44e2-a857-1a1e1b475d08)

기존 Transformer의 slow inference의 문제점을 극복하고자 하는 새로운 모델 구조가 발표되었습니다.

Retentive network는 기존의 self-attention을 retention이라는 메커니즘으로 대체하는데 retention은 xPos positional encoding과 마찬가지로 relative distance에 따라 exponential decay를 진행하는 방식을 적용합니다. 또한, state representation으로 나타낼 수 있어 parallel representation과 recurrent representation가 공존한다는 장점이 있는데 이런 특성으로 인해 inference를 효율적으로 진행할 수 있습니다.

해당 연구는 아직 비교적 작은 모델에서만 진행되었다는 단점이 있지만 조만간 많은 추가 연구가 뒤따를 것으로 생각됩니다. 또한, AMD MI200 GPU를 통해 학습을 진행한 것이 눈에 띄는데 AMD GPU가 아직 NVIDIA GPU에 비해 성능이 부족하지만 실제 학습에 적용되고 있는 것을 확인할 수 있습니다.

## Teaching Arithmetic to Small Transformers

ArXiv: https://arxiv.org/abs/2307.03381

GitHub: https://github.com/lee-ny/teaching_arithmetic

![image](https://github.com/jungwoo-ha/WeeklyArxivTalk/assets/33523965/4eeb683d-42c5-454d-b841-b8b3b7001065)

![image](https://github.com/jungwoo-ha/WeeklyArxivTalk/assets/33523965/0996e82b-f084-49f3-984b-e0e0bba26402)

![image](https://github.com/jungwoo-ha/WeeklyArxivTalk/assets/33523965/4d61dd1f-ef44-4423-b2ef-7bc5181aee9e)

(비교적) 작은 Transformer 모델에 세 자릿수 덧셈 및 뺄셈을 가르치는데 숫자의 representation을 일반적으로 사용하는 큰 수가 먼저 오는 방식이 아닌 작은 수가 먼저 오는 방식을 적용하면 성능이 대폭 향상되는 것을 보여주는 연구입니다. 지금까지 거대언어모델의 수학 능력은 중요한 지표로 활용되어왔는데 산수 문제에 한해서는 데이터의 표기 방법 및 질이 여전히 매우 큰 영향을 미칠 수 있다는 것을 보여주며 딥러닝에서 데이터의 중요성을 다시 확인해줍니다.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[20230723] Weekly AI ArXiv 만담 시즌2 - 23회차 #89

News

LLaMA v2

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

How is ChatGPT's Behavior Changing Over Time?

Research

Retentive Network: A Successor to Transformer for Large Language Models

Teaching Arithmetic to Small Transformers

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[20230723] Weekly AI ArXiv 만담 시즌2 - 23회차 #89

Description

News

LLaMA v2

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

How is ChatGPT's Behavior Changing Over Time?

Research

Retentive Network: A Successor to Transformer for Large Language Models

Teaching Arithmetic to Small Transformers

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions