Document Classification (CNN)

Deep Learning/Natural Language Processing

Document Classification (CNN)

frances._.sb 2022. 3. 26. 15:40

728x90

document classification 관련 예시 코드는 여기를 확인해주세요.

[2014] yoon-kim

Input : n by k matrix representation of sentence

- n : the number of words in a sentence (parameter)

▷ 짧은 문장은 zero padding을 주고 긴 문장은 trimmed

- k : word embedding dimensions

▷ pre-trained word embedding vectors를 사용

▷ vectors는 static (training 과정에서 update를 하지 않는다) or non-static (fine-tuning)

- multi-channel input 가능하다. → 하나의 텐서 형태가 된다.

Convolution

- different size of convolutions can be used

▷ squared convolutions이 image processing에는 흔하지만, k width인 retangular convolution은 text processing에 잘 쓰인다.

▷ convolution stride = 1 자주 쓰인다.

▷ convolution의 크기가 커질수록 한 번에 더 많은 단어를 고려할 수 있다.

max pooling

- vector를 하나의 scalar값으로 변환한다.

- 하나의 문장에 대해 convolution 연산을 하였을 때, 해당 document(또는 senetence)에 대해 중점적으로 봐야 하는 파트 하나만을 뽑아오는 게 average pooling으로 하는 것보다 positive / negative classification 하는데 더 효과적이다.

fully-connected operation

- connected to two output nodes (pos / neg)

- max-pooling에서 fully-connected operation할 때에만 dropout을 사용하였다.

▷ filter window = 3,4,5 with 100 feature maps

▷ dropout (rate = 0.5)

▷ $L_2$ regularization (3)

▷ mini_batch = 50

conclusion

- static 방법만 써도 단순한 모델은 충분하고, pre-trained vectors는 충분히 좋다면 universal feature ectractors로 작동할 수 있다.

- fine-tuning할 땐, 약간의 향상이 될 것이다.

- multi-channel이 기대보다는 좋은 성능은 아니다.

- dropout에서는 일시적으로 2-4% 정도 향상

[2015] character-level CNN

- 70개의 characters를 사용 (26 english + 10 digits + 33 other characters)

Network structure

- input : 70 by 1024 matrix (=large matrix이고 small matrix는 70 by 256)

▷ one-hot vector로, character가 matching 되는 부분의 각 column에 1을 넣어준다. (not distributed representation)

- convolution

▷ first : 70 x 7

▷ second : 1024 x 7

▷ third : 1024 x 3

- max pooling

▷ (window size) 1 by 3 with stride = 3

- fully-connected layer

▷ output nodes의 수는 문제에 따라 다르다.

- data argumentation using thesaurus

▷ 동의어나 유의어가 있을 때, synonym으로 바꾼다.

728x90

저작자표시 (새창열림)

'Deep Learning > Natural Language Processing' 카테고리의 다른 글

자연어 처리 기초 (0)	2022.03.28
Document Classification (RNN) (0)	2022.03.28
Document Classification (vector space model) (0)	2022.03.25
[python] 3D 그래프 (0)	2022.02.10
[python] 데이터 시각화_matplotlib (0)	2022.02.10

현재글Document Classification (CNN)

Subeen lab