[텍스트 마이닝-지표 산출] TF-IDF 계산 및 문서 개수 합계 산출

패키지 임포트

import pandas as pd
import numpy as np
import re
import sys
from tqdm.notebook import tqdm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer

데이터 불러오기

df = pd.read_csv("저장경로/파일명.txt", sep='\t', encoding='UTF-8')
# print(df.shape, df.columns)
corpus = df['morphs'].to_list()
print(len(corpus), type(corpus[0]))
df.head()

불용어 데이터 불러오기

대명사 = pd.read_csv("저장경로/한국어 대명사 목록.txt", encoding='UTF-8')
부사 = pd.read_csv("저장경로/한국어 부사 목록.txt", encoding='UTF-8')
조사 = pd.read_csv("저장경로/한국어 조사 목록.txt", encoding='UTF-8')
삭제단어 = pd.read_csv("저장경로/삭제 단어.txt", sep='\t', encoding='UTF-8')
stopwords = pd.read_csv("저장경로/불용어 목록.txt", encoding='UTF-8')

print(대명사.shape, 부사.shape, 조사.shape, 삭제단어.shape, sanlim_stopwords.shape)
print(대명사.columns, 부사.columns, 조사.columns, 삭제단어.columns, sanlim_stopwords.columns)

객체에 담기

stop_list = list(대명사['대명사']) + list(부사['부사']) + list(조사['조사']) +  list(삭제단어['삭제단어'])+ list(stopwords['불용어']))
print(len(stop_list))
Stopwords = sorted(list(set(stop_list)), reverse=False)
print(len(Stopwords), Stopwords[:10])

TF-IDF 계산

%%time
tfidf = TfidfVectorizer(stop_words = Stopwords, token_pattern=r'\w{1,}')
tdm = tfidf.fit_transform(corpus)

tdm을 넘파이 array에 담아서 데이터프레임으로 변환

tfidf_array = tdm.toarray()
tfidf_DF = pd.DataFrame(tfidf_array)
tfidf_DF

# TF-IDF 단어 목록 추출하기
featurenames = tfidf.get_feature_names_out()
tfidf_DF.columns = featurenames

# TF-IDF 결과 확인
tfidf_DF

#### 한국어 시작점 확인 ####
print(featurenames[숫자])
tfidf_DF2 = tfidf_DF.iloc[:,숫자:]
tfidf_DF2

# 각 단어에 대한 TF-IDF의 합
word_count = pd.DataFrame({'단어': tfidf.get_feature_names_out(), 'TF-IDF 값 합계': tdm.sum(axis=0).flat})

# 상위 10개 확인
word_count.sort_values('tf-idf', ascending=False, inplace=True)
word_count.head(10)

# 단어 - TF-IDF 파일로 저장.
word_count.to_csv("저장경로/TF-IDF합.txt", index=False, sep='\t', encoding='UTF-8')
word_count.to_excel("저장경로/TF-IDF합.xlsx", index=False, encoding='UTF-8')

#### 단어별 TF-IDF 합계의 행 개수 구하기 ####
tfidf = pd.read_csv("저장경로/파일명.txt",sep='\t')
tfidf_word = list(tfidf['단어'])
print(tfidf.shape, len(tfidf_word))

# TF-IDF 개수 구하는 사용자 함수 생성
def count_tfidf(df, wordlist):
    tfidf_count = [(tfidf_DF[word] > 0).sum() if word is not None else '-' for word in tqdm(wordlist)]
    df['count'] = tfidf_count
    return df

# 함수 실행
count_tfidf(tfidf, tfidf_word)

파일 저장하기~

df.to_csv("저장경로/TF-IDF합계_개수.txt", index=False, sep='\t', encoding='UTF-8')
df.to_excel("저장경로/TF-IDF합계_개수.xlsx", index=False)

728x90

'데이터 분석 > Python' 카테고리의 다른 글

[공간 분석-데이터 전처리] 출발지와 도착지로 이동 경로 집계 후 LineString 객체 생성(geopandas) (0)	2023.12.27
[텍스트 마이닝-정제] 주제와 관련 없는 문서 제거하기 (0)	2023.12.27
[텍스트 마이닝-시각화] Ucinet으로 CONCOR 시각화 (0)	2023.10.23
[텍스트 마이닝-분석] 한글 N-gram 분석 (0)	2023.09.18
[텍스트 마이닝-분석] 단어동시출현행렬 및 CONCOR 분석 (0)	2023.09.18

기록으로 기억하기

[텍스트 마이닝-지표 산출] TF-IDF 계산 및 문서 개수 합계 산출

'데이터 분석 > Python' 카테고리의 다른 글

댓글

티스토리툴바

[텍스트 마이닝-지표 산출] TF-IDF 계산 및 문서 개수 합계 산출

'데이터 분석 > Python' 카테고리의 다른 글

관련글

댓글

티스토리툴바