10 minutes to pandas 4

참고: 10 Minutes to Pandas

import pandas as pd
import matplotlib.pyplot as plt

9. Time Series (시계열)

rng  = pd.date_range('1/1/2012',periods=100,freq='S')
ts = pd.Series(np.random.randint(0,500,len(rng)), index=rng)
ts.resample('5Min').sum()
2012-01-01    24106
Freq: 5T, dtype: int32
rng = pd.date_range('3/6/2012 00:00',periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)),rng)
ts
2012-03-06   -0.228536
2012-03-07    1.182960
2012-03-08   -0.189565
2012-03-09   -0.968019
2012-03-10   -0.550340
Freq: D, dtype: float64
ts_utc = ts.tz_localize('UTC')
ts_utc
2012-03-06 00:00:00+00:00   -0.228536
2012-03-07 00:00:00+00:00    1.182960
2012-03-08 00:00:00+00:00   -0.189565
2012-03-09 00:00:00+00:00   -0.968019
2012-03-10 00:00:00+00:00   -0.550340
Freq: D, dtype: float64

다른 시간대로 변환

ts_utc.tz_convert('US/Eastern')
2012-03-05 19:00:00-05:00   -0.228536
2012-03-06 19:00:00-05:00    1.182960
2012-03-07 19:00:00-05:00   -0.189565
2012-03-08 19:00:00-05:00   -0.968019
2012-03-09 19:00:00-05:00   -0.550340
Freq: D, dtype: float64

시간 표현 <-> 기간 표현으로 변환합니다.

rng = pd.date_range('1/1/2012',periods=5,freq='M')
ts = pd.Series(np.random.randn(len(rng)),index=rng)
ts
2012-01-31    0.031629
2012-02-29    0.875231
2012-03-31    0.005173
2012-04-30   -0.383027
2012-05-31    0.054017
Freq: M, dtype: float64
ps= ts.to_period()
ps
2012-01    0.031629
2012-02    0.875231
2012-03    0.005173
2012-04   -0.383027
2012-05    0.054017
Freq: M, dtype: float64
ps.to_timestamp()
2012-01-01    0.031629
2012-02-01    0.875231
2012-03-01    0.005173
2012-04-01   -0.383027
2012-05-01    0.054017
Freq: MS, dtype: float64
prng = pd.period_range('1990Q1','2000Q4',freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)),prng)
ts.index = (prng.asfreq('M','e')+1).asfreq('H','s')+9
ts.head()
1990-03-01 09:00   -1.165074
1990-06-01 09:00    0.790822
1990-09-01 09:00    2.920755
1990-12-01 09:00    0.491993
1991-03-01 09:00   -0.491173
Freq: H, dtype: float64

10. Categoricals (범주화)

Pandas는 데이터프레임 내에 범주형 데이터 포함 할 수 있다.

df = pd.DataFrame({"id":[1,2,3,4,5,6],"raw_grade":['a','b','b','a','a','e']})
df["grade"] = df["raw_grade"].astype('category')
df["grade"]
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

범주에 label 부여 가능합니다.

df["grade"].cat.categories = ["very good","good","very bad"]

범주의 순서를 바꾸고 동시에 누락된 범주 추가

df["grade"] = df["grade"].cat.set_categories(["very bad","bad","medium","good","very good"])
df["grade"]
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
df.sort_values(by="grade")
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
df.groupby("grade").size()
grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

11. Plotting (그래프)

ts = pd.Series(np.random.randn(1000), index = pd.date_range('1/1/2000',periods=1000))
ts = ts.cumsum()
ts.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1d24340db08>

!output_38_1

데이터프레임에서 plot() 메소드는 라벨이 존재하는 모든 열을 그릴 때 편리함.

df = pd.DataFrame(np.random.randn(1000,4),index=ts.index,
                 columns=['A','B','C','D'])
df = df.cumsum()
plt.figure(); df.plot(); plt.legend(loc='best')
<matplotlib.legend.Legend at 0x1d2440e30c8>




<Figure size 432x288 with 0 Axes>

output_42_2

12. Getting Data In / Out (데이터 입 / 출력)

CSV

df.to_csv('foo.csv')
pd.read_csv('foo.csv',index_col=0)

HDF5

df.to_hdf('foo.h5','df')
pd.read_hdf('foo.h5','df')