๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

728x90

pandas

1/11 ์ˆ˜ 1. tqdm in list comprehension / in pandas import pandas as pd import numpy as np from tqdm import tqdm # from tqdm.notebook import tqdm # Ver. 1 For Jupyter Notebook # from tqdm.auto import tqdm # Ver. 2 For Jupyter Notebook def process(token): return token['text'] l1 = [{'text': k} for k in range(5000)] l2 = [process(token) for token in tqdm(l1)] # tqdm in list comprehension #------------------.. ๋”๋ณด๊ธฐ
4/6 ์ˆ˜ ์ˆ˜์š”์ผ! ์˜ค๋Š˜์€ Multinomial Classification์„ ๋ฐฐ์šด๋‹ค. Linear Regression(์—ฐ์†์ ์ธ ์ˆซ์ž ๊ฐ’ ์˜ˆ์ธก)์ด ๋ฐœ์ „ํ•œ ๊ฒƒ์ด Logistic Regression → Classification(๋ถ„๋ฅ˜๋ฅผ ํŒ๋‹จํ•˜๋Š” ์˜ˆ์ธก) - Binary Classification(์ดํ•ญ๋ถ„๋ฅ˜) - Multinomial Classification(๋‹คํ•ญ๋ถ„๋ฅ˜) Logistic Regression์€ ์ด์ง„ ๋ถ„๋ฅ˜์— ํŠนํ™”๋จ SKlearn์ด ์ œ๊ณตํ•˜๋Š” ๋ถ„๋ฅ˜๊ธฐ์ธ Gradient Descent(๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•)๊ฐ€ ๋ฐœ์ „ํ•œ ํ˜•ํƒœ์ธ SGD Classifier(Stochastic Gradient Descent, ํ™•๋ฅ ์  ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•) 1. Binary Classification - ์œ„์Šค์ฝ˜์‹  ์œ ๋ฐฉ์•” ๋ฐ์ดํ„ฐ by Gradient Descent Cl.. ๋”๋ณด๊ธฐ
3/30 ์ˆ˜ ์ˆ˜์š”์ผ! ์–ด์ œ ์‚ฌ์šฉํ•œ Ozone data๋ฅผ Python๊ณผ Sklearn์œผ๋กœ Simple Linear Regression(๋‹จ์ˆœ ์„ ํ˜• ํšŒ๊ท€)์„ ๊ตฌํ˜„ํ–ˆ์„ ๋•Œ, ์™œ ๋ชจ์–‘์ด ๋‹ค๋ฅธ์ง€ ์•Œ์•„๋ณด์ž~ ์ด์œ  1. Missing Value(๊ฒฐ์น˜๊ฐ’) ์ฒ˜๋ฆฌ - ์‚ญ์ œ : ์ „์ฒด ๋ฐ์ดํ„ฐ๊ฐ€ 100๋งŒ ๊ฑด ์ด์ƒ์ด๋ฉฐ ๊ฒฐ์น˜๊ฐ’์ด 5% ์ด๋‚ด์ผ ๋•Œ - ๋Œ€์ฒด : ๋Œ€ํ‘œ๊ฐ’์œผ๋กœ ๋Œ€์ฒด(ํ‰๊ท , ์ค‘์œ„, ์ตœ๋Œ€, ์ตœ์†Œ, ์ตœ๋นˆ) ํ˜น์€ ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉ(๋” ์ข‹์€ ๋ฐฉ์‹! ๊ฒฐ์น˜๊ฐ’์ด ์ข…์†๋ณ€์ˆ˜์ผ ๋•Œ) ์ด์œ  2. ์ด์ƒ์น˜ ์ฒ˜๋ฆฌ ์ด์ƒ์น˜๋Š” ๊ฐ’์ด ์ผ๋ฐ˜์ ์ธ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์— ๋น„ํ•ด ํŽธ์ฐจ๊ฐ€ ํฐ ๋ฐ์ดํ„ฐ์ด๊ธฐ ๋•Œ๋ฌธ์— ํ‰๊ท , ๋ถ„์‚ฐ์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นจ → ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ๋‹นํžˆ ๋ถˆ์•ˆํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ์š”์†Œ - ์ง€๋Œ€๊ฐ’ : ๋…๋ฆฝ๋ณ€์ˆ˜(์›์ธ)์— ์žˆ๋Š” ์ด์ƒ์น˜ - Outlier : ์ข…์†๋ณ€์ˆ˜(๊ฒฐ๊ณผ)์— ์žˆ๋Š” ์ด์ƒ์น˜ 1. ์ด์ƒ์น˜.. ๋”๋ณด๊ธฐ
3/29 ํ™” ํ™”์š”์ผ! ์˜ค๋Š˜์€ ์–ด์ œ ๋ฐฐ์šด Simple Linear Regression(๋‹จ์ˆœ ์„ ํ˜• ํšŒ๊ท€)์„ ์ฝ”๋“œ๋กœ ๊ตฌํ˜„ํ•œ๋‹ค. 1. Training Data Set ์ค€๋น„ : Data pre-processing(๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ). ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ํ˜•ํƒœ๋กœ ์ค€๋น„ 2. Linear Regression Model์„ ์ •์˜ : y = Wx+b(์˜ˆ์ธก ๋ชจ๋ธ). hypothesis(๊ฐ€์„ค) 3. ์ตœ์ ์˜ W(weight, ๊ฐ€์ค‘์น˜), b(bias, ํŽธ์ฐจ)๋ฅผ ๊ตฌํ•˜๋ ค๋ฉด loss function(์†์‹คํ•จ์ˆ˜)/cost function(๋น„์šฉํ•จ์ˆ˜) → MSE 4. Gradient Descent Algorithm(๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•) : loss function์„ ํŽธ๋ฏธ๋ถ„(W, b) × learning rate 5. ๋ฐ˜๋ณตํ•™์Šต ์ง„ํ–‰ 1. Training Dat.. ๋”๋ณด๊ธฐ
3/28 ์›” ์›”์š”์ผ! ๊ธˆ์š”์ผ์— ์ด์–ด ๋จธ์‹ ๋Ÿฌ๋‹ ๋“ค์–ด๊ฐ„๋‹ค~ Weak AI์˜ ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฒ•๋“ค : ์ง€๋„ ํ•™์Šต, ๋น„์ง€๋„ ํ•™์Šต, ๊ฐ•ํ™” ํ•™์Šต 1. Regression(ํšŒ๊ท€) : ๋ฐ์ดํ„ฐ์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ์กฐ๊ฑด๋“ค์˜ ์˜ํ–ฅ๋ ฅ์„ ๊ณ ๋ คํ•ด์„œ, ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์กฐ๊ฑด๋ถ€ ํ‰๊ท ์„ ๊ตฌํ•˜๋Š” ๊ธฐ๋ฒ• * ํ‰๊ท ์„ ๊ตฌํ•  ๋•Œ ์ฃผ์˜ํ•ด์•ผ ํ•  ์  : ํ‰๊ท ์„ ๊ตฌํ•˜๋Š” ๋ฐ์ดํ„ฐ์— ์ด์ƒ์น˜๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ ๋Œ€ํ‘œ๊ฐ’์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์–ด๋ ค์›€. ์ •๊ทœ๋ถ„ํฌ์—ฌ์•ผ ํ•จ! ๊ณ ์ „์  ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ(Classical Linear Regression Model) ๋‹จ์ˆœ ์„ ํ˜• ํšŒ๊ท€(Simple Linear Regression) import numpy as np import pandas as pd import matplotlib.pyplot as plt df = pd.DataFrame({'๊ณต๋ถ€์‹œ๊ฐ„(x)': [1,2,3.. ๋”๋ณด๊ธฐ
3/23 ์ˆ˜ ์ˆ˜์š”์ผ! ์˜ค๋Š˜์€ ๊ธฐ์ˆ ํ†ต๊ณ„๋ฅผ ๋ฐฐ์šด๋‹ค. 1์ฐจ์› ๋ฐ์ดํ„ฐ์˜ ํŠน์ง• ํŒŒ์•… - ์ˆ˜์น˜์ง€ํ‘œ → ๋Œ€ํ‘œ๊ฐ’ : ํ‰๊ท , ์ค‘์œ„๊ฐ’, ์ตœ๋Œ€/์ตœ์†Œ๊ฐ’, ํŽธ์ฐจ, ๋ถ„์‚ฐ, ํ‘œ์ค€ํŽธ์ฐจ... - ์‹œ์ž‘์  ํ‘œํ˜„ → ๋„์ˆ˜๋ถ„ํฌํ‘œ, Histogram, Box plot * ์ตœ๋Œ€/์ตœ์†Œ๊ฐ’์€ ๋Œ€ํ‘œ๊ฐ’์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ์— ๋ฌด๋ฆฌ๊ฐ€ ์žˆ์Œ 2์ฐจ์› ๋ฐ์ดํ„ฐ์˜ ํŠน์ง• ํŒŒ์•… - ์ˆ˜์น˜์ง€ํ‘œ → ๊ณต๋ถ„์‚ฐ, ์ƒ๊ด€๊ณ„์ˆ˜ - ์‹œ์ž‘์  ํ‘œํ˜„ → Scatter ์‚ฐํฌ๋„(dispersion) : ๋ฐ์ดํ„ฐ๊ฐ€ ์–ผ๋งˆ๋‚˜, ์–ด๋–ป๊ฒŒ ํผ์ ธ ์žˆ๋‚˜๊ฐ€ ๊ด€์  ๋ฐ์ดํ„ฐ๊ฐ€ ํฉ์–ด์ง„ ์ •๋„(๋ณ€์‚ฐ์„ฑ)๋ฅผ ์ˆ˜์น˜๋กœ ํ‘œํ˜„ํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด → ๋ฒ”์œ„(range), ์‚ฌ๋ถ„์œ„ ๋ฒ”์œ„(IQR, Interquatile range), ํŽธ์ฐจ(deviation), ๋ถ„์‚ฐ(variance), ํ‘œ์ค€ํŽธ์ฐจ(standard deviation) 1์ฐจ์› ๋ฐ์ดํ„ฐ์˜ ์ˆ˜์น˜์ง€ํ‘œ → ํ‰๊ท , ์ค‘์œ„.. ๋”๋ณด๊ธฐ
3/22 ํ™” ํ™”์š”์ผ! ๋ฐ์ดํ„ฐ๋ฅผ ์‹œ๊ฐํ™”ํ•˜๋Š” ๋Œ€ํ‘œ์ ์ธ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ธ Matplotlib์— ๋Œ€ํ•ด ๋ฐฐ์šด๋‹ค! Matplotlib ์•ˆ์— Pyplot์ด๋ผ๋Š” sub package๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. Line plot(์„  ๊ทธ๋ž˜ํ”„), Histogram(๋„์ˆ˜ํ‘œ), Scatter(์‚ฐ์ ๋„), Box plot, ๊ทธ ์™ธ Area plot, Bar chart(๋ง‰๋Œ€ ๊ทธ๋ž˜ํ”„) 1. Line plot(์„  ๊ทธ๋ž˜ํ”„) : ์—ฐ์†์ ์ธ ๊ฐ’์˜ ๊ฒฝํ–ฅ์„ ํŒŒ์•…ํ•  ๋•Œ ์ฃผ๋กœ ์‚ฌ์šฉ(์‹œ๊ณ„์—ด) import pandas as pd import matplotlib.pyplot as plt # 1. Line plot(์„  ๊ทธ๋ž˜ํ”„) plt.title('Line Plot') # plot์˜ ์ œ๋ชฉ์„ ์„ค์ • plt.plot([1, 5, 12, 25]) # x์ถ•์˜ ์ž๋ฃŒ ์œ„์น˜(x ์ถ• ๋ˆˆ๊ธˆ) -> tick์€.. ๋”๋ณด๊ธฐ
3/21 ์›” ์›”์š”์ผ! ์˜ค๋Š˜์€ Pandas์˜ DataFrame(DataFrame ์—ฐ๊ฒฐ · ๊ฒฐํ•ฉ, Mapping, Grouping)์„ ๋งˆ๋ฌด๋ฆฌ ์ง“๊ณ , ๋‚ด์ผ๋ถ€ํ„ฐ ๋ฐ์ดํ„ฐ์˜ ์‹œ๊ฐํ™”์— ๋Œ€ํ•ด ๋ฐฐ์šด๋‹ค. 1. DataFrame ์—ฐ๊ฒฐ : pd.concat(). default๋Š” ํ–‰ ๋ฐฉํ–ฅ์œผ๋กœ ์—ฐ๊ฒฐ. ์ปฌ๋Ÿผ ๋ช…์ด ๊ฐ™์€ ๊ฒƒ๋“ค์ด ์„œ๋กœ ๊ฒฐํ•ฉ๋จ import numpy as np import pandas as pd df1 = pd.DataFrame({'a':['a0', 'a1', 'a2', 'a3'], 'b':[1, 2, 3, 4], 'c':['c0', 'c1', 'c2', 'c3']}, index=[0, 1, 2, 3]) display(df1) df2 = pd.DataFrame({'b':[5, 6, 7, 8], 'c':['c0', 'c1', 'c2'.. ๋”๋ณด๊ธฐ

728x90