10 minutes to pandas 3

참고:Data Girls

String Methods

import pandas as pd
import numpy as np
s = pd.Series(['A','B','C','Adba','Baca',np.nan,'CABA','dog','cat'])
s.str.lower()
0       a
1       b
2       c
3    adba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

6. Merge (병합)

Concat(연결)

Series, 데이터프레임, Panel 객체를 결합할 수 있도록 다양한 기능을 pandas에서 제공
concat()으로 pandas 객체를 연결

df = pd.DataFrame(np.random.randn(10,4))
df
0 1 2 3
0 -0.824875 -0.824654 -0.695564 -0.670425
1 -1.429506 0.560008 0.388066 0.480411
2 -0.748083 -0.516179 0.617229 -0.378735
3 1.037984 0.062974 -1.082505 1.830984
4 1.564869 -0.468556 2.032555 -1.341256
5 1.668148 0.474790 -1.230549 -0.616061
6 -0.747773 -0.762073 -0.261499 0.579439
7 1.044983 -1.062747 0.331895 0.860687
8 0.050739 -0.711202 -0.960243 0.053964
9 0.229273 1.506168 0.569877 -2.598289
# df를 여러 조각으로 나눔
pieces = [df[:3],df[3:7],df[7:]]
pd.concat(pieces)
0 1 2 3
0 1.998959 0.906875 0.911310 -0.178316
1 -0.815463 -0.730130 -0.917500 -0.856537
2 0.765737 0.602990 0.664101 1.430402
3 0.893200 3.248430 -1.376075 -1.087387
4 -0.626230 0.606616 -1.112283 0.776637
5 -1.657879 -0.408940 0.685019 -1.500637
6 1.750616 -0.977861 1.404166 -0.131984
7 -1.720036 -0.938531 0.955148 0.675361
8 -0.006234 0.210820 0.806121 1.343536
9 1.435335 1.874641 1.118181 0.467289

Join (결합)

SQL 방식으로 병합

left = pd.DataFrame({'key':['foo','foo'],'lval':[1,2]})

right = pd.DataFrame({'key':['foo','foo'],'rval':[4,5]})
left
key lval
0 foo 1
1 foo 2
right
key rval
0 foo 4
1 foo 5
pd.merge(left, right, on ='key')
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
left = pd.DataFrame({'key' : ['foo', 'bar'], 'lval' : [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
left
key lval
0 foo 1
1 bar 2
right
key rval
0 foo 4
1 bar 5
pd.merge(left,right,on= 'key')
key lval rval
0 foo 1 4
1 bar 2 5

Append(추가)

df = pd.DataFrame(np.random.randn(8,4),columns=['A','B','C','D'])
df
A B C D
0 -1.340598 0.191867 -0.613625 0.570954
1 -1.340356 0.751859 0.263701 -1.099900
2 0.097820 1.977037 -1.494453 -0.803257
3 0.719918 -0.186450 0.753288 -0.796492
4 0.850745 0.853605 -0.876059 -0.122839
5 -0.184993 -0.213349 -2.743835 2.213326
6 -1.348992 -1.538484 -1.057468 0.773608
7 1.962987 -0.021108 -0.091024 0.704555
s = df.iloc[3]
# 결과를 보면 행이 추가된것을 볼 수 있다.
df.append(s, ignore_index = True)
A B C D
0 -1.340598 0.191867 -0.613625 0.570954
1 -1.340356 0.751859 0.263701 -1.099900
2 0.097820 1.977037 -1.494453 -0.803257
3 0.719918 -0.186450 0.753288 -0.796492
4 0.850745 0.853605 -0.876059 -0.122839
5 -0.184993 -0.213349 -2.743835 2.213326
6 -1.348992 -1.538484 -1.057468 0.773608
7 1.962987 -0.021108 -0.091024 0.704555
8 0.719918 -0.186450 0.753288 -0.796492

7. Grouping (그룹화)

그룹화는 다음 세개 중 하나 이상을 포함하는 과정

df = pd.DataFrame(
    {
        'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
        'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
        'C' : np.random.randn(8),
        'D' : np.random.randn(8)
    })
df
A B C D
0 foo one 0.232057 -0.752738
1 bar one -0.407037 0.194269
2 foo two -1.057581 0.310914
3 bar three -0.305416 0.396293
4 foo two -0.206986 2.310486
5 bar two -0.489458 1.015415
6 foo one -0.121633 1.055399
7 foo three 0.707281 0.559738
#A컬럼을 기준으로 그룹화 한 후 값을 sum 함수로 합해준다
df.groupby('A').sum()
C D
A
bar -1.201911 1.605977
foo -0.446863 3.483799

8.Reshaping (변형)

Stack(스택)

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names = ['first', 'second'])
df = pd.DataFrame(np.random.randn(8,2),index=index,columns=['A','B'])
df2 = df[:4]
#Multi Index를 가지고 있는것을 확인 할 수 있다.
df2.index
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two')],
           names=['first', 'second'])
df2
A B
first second
bar one 0.524658 -0.697061
two 0.150886 0.522951
baz one 0.166711 1.799605
two -1.407891 0.718423

stack() 메소드는 이런 MultiIndex 같은 계층을 압축한다

stacked = df2.stack()
stacked
first  second   
bar    one     A    0.524658
               B   -0.697061
       two     A    0.150886
               B    0.522951
baz    one     A    0.166711
               B    1.799605
       two     A   -1.407891
               B    0.718423
dtype: float64

unstack()은 stack()의 역 연산이며, 기본적으로 마지막 계층을 unstack 한다.

stacked.unstack()
A B
first second
bar one 0.524658 -0.697061
two 0.150886 0.522951
baz one 0.166711 1.799605
two -1.407891 0.718423
stacked.unstack(1)
second one two
first
bar A 0.524658 0.150886
B -0.697061 0.522951
baz A 0.166711 -1.407891
B 1.799605 0.718423
stacked.unstack(0)
first bar baz
second
one A 0.524658 0.166711
B -0.697061 1.799605
two A 0.150886 -1.407891
B 0.522951 0.718423

Pivot Tables (피봇 테이블)

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df
A B C D E
0 one A foo -0.859551 0.034023
1 one B foo -0.347332 0.839146
2 two C foo 0.583359 0.019192
3 three A bar -0.622852 -1.135847
4 one B bar 0.110046 -0.868158
5 one C bar 0.732343 0.135214
6 two A foo 0.499511 -0.133067
7 three B foo 0.425500 0.612787
8 one C foo 0.802145 2.167107
9 one A bar -1.544572 -0.512272
10 two B bar -0.310567 -0.802878
11 three C bar 0.601209 -0.514944

pd.pivot_table() 함수를 통해 쉽게 피봇 테이블 생성할 수 있다.

pd.pivot_table(df, values = 'D', index=['A','B'],columns=['C'])
C bar foo
A B
one A -1.544572 -0.859551
B 0.110046 -0.347332
C 0.732343 0.802145
three A -0.622852 NaN
B NaN 0.425500
C 0.601209 NaN
two A NaN 0.499511
B -0.310567 NaN
C NaN 0.583359