'Python/pandas' 카테고리의 글 목록

Python/pandas

1-3. Pandas.Series - Pandas.Series.isnull 2022.03.12 2
1-2. Pandas.Series - index를 이용한 데이터 다루기 2022.03.12
1-1. Pandas.Series - parameter 설명 2022.03.10
0. Pandas 소개 2022.03.09

1-3. Pandas.Series - Pandas.Series.isnull

2022. 3. 12. 19:59

앞서 Pandas에 대해 전반적으로 알아보았다.

앞으로는 데이터를 다루는 데 있어 자주 사용하는 Pandas.Series의 함수를 정리하도록 하겠다.

첫 번째 Series의 함수는 isnull()이다.

이 함수는 Series의 데이터 값이 NA여부를 판단하여 데이터가 True or False로 나와있는 bool로 되어있는 Series로 반환한다. 여기에서 NA는 None과 numpy.NaN과 같은 것을 의미한다.

import pandas as pd
import numpy as np

a = pd.Series([1,np.NaN,3,4,None],
              index = ['a','b','c','d','e'])

a.isnull()

#결과
a    False
b     True
c    False
d    False
e     True
dtype: bool

한 가지 주의해야 할 점은 numpy.inf는 Pandas.isnull()에서 NA값으로 정의하지 않는다

a = pd.Series([1,np.NaN,3,4,np.inf],
              index = ['a','b','c','d','e'])
              
a.isnull()

#결과
a    False
b     True
c    False
d    False
e    False
dtype: bool

따라서 다음과 같이 Pandas.options.mode.use_inf_as_na를 True로 바꾸어 주어야 한다.

pd.options.mode.use_inf_as_na = True

a = pd.Series([1,np.NaN,3,4,np.inf],
              index = ['a','b','c','d','e'])

a.isnull()

#결과
a    False
b     True
c    False
d    False
e     True
dtype: bool

저작자표시 비영리 변경금지 (새창열림)

'Python > pandas' 카테고리의 다른 글

1-2. Pandas.Series - index를 이용한 데이터 다루기 (0)	2022.03.12
1-1. Pandas.Series - parameter 설명 (0)	2022.03.10
0. Pandas 소개 (0)	2022.03.09

1-2. Pandas.Series - index를 이용한 데이터 다루기

2022. 3. 12. 16:14

1-1에서는 pd.Series의 parameter에 대해 설명하였다.

pandas.Series에 대한 선언과 parameter를 모른다면 다음의 링크에 들어가서 공부하고 온다면 좋다.

https://2vs5.tistory.com/6

1-1. Pandas.Series - parameter 설명

Series는 일련의 객체를 담을 수 있는 1차원 배열 같은 자료구조이다. 또한 배열의 데이터와 연관된 이름인 index를 가지고 있다. Pandas의 series의 parameter는 다음과 같다. class pandas.Series(data = None,..

2vs5.tistory.com

이번에는 pd.Pandas를 사용할 때 index를 이용하여 데이터를 다루는 방법을 알아보겠다.

1. index를 통한 데이터 획득

앞서 우리는 다음과 같이 선언한 pd.Series의 index를 통하여 값을 가져 올 수 있는 다양한 방법이 있는 것을 알고 있다.

python의 dictionary의 key값처럼 index의 값을 입력하면 데이터를 얻을 수 있다.

a = pd.Series([1,2,3,4,5], index = ['a', 'b', 'c', 'd', 'e'])

a['a']

#결과
1

array_like형 데이터가 index로 들어가면 관련 데이터로만 이루어진 Series형 데이터를 반환한다.

b = np.array(['c', 'a', 'b'])
a[b]

#결과
c    3
a    1
b    2
dtype: int64

python의 slicing처럼 index또한 다음과 같이 사용 가능하다.

a['a':'c']

#결과
a    1
b    2
c    3
dtype: int64

2. index를 이용한 산술연산

파이썬에서 쓰는 산술 연산자는 pd.Series에서 사용이 가능하다. 다음을 보면 알수 있다.

data = [1,2,3,4,5]

a = pd.Series(data, index = ['a', 'b', 'c', 'd', 'e'])

a > 1

#결과
a    False
b     True
c     True
d     True
e     True
dtype: bool

Series형 데이터인 a를 이용하여 a > 1을 사용하면 a에 대한 index를 이용하여 1보다 큰 값에는 True 작은 값에는 False인 bool형 데이터를 갖는 Series형 데이터를 반환한다. Series형 데이터 또한 array_like형 데이터 이므로 index로 들어가는게 가능하다.

이를 index값으로 사용하여 새로운 index값으로 입력한다면 우리는 Series형 데이터에서 원하는 데이터만 습득가능하다. 즉 , 원하는 Series 데이터에서 산술 연산자를 통해 원하는 값만을 습득 가능하다.

다음의 예시는 a에 들어있는 데이터에서 1보다 큰 값을 얻는 것이다.

data = [1,2,3,4,5]

a = pd.Series(data, index = ['a', 'b', 'c', 'd', 'e'])

a[a > 1]

#결과
b    2
c    3
d    4
e    5
dtype: int64

위의 방법들을 이용하여 원하는 방식대로 데이터를 추출하는 것이 가능한다.

저작자표시 비영리 변경금지 (새창열림)

'Python > pandas' 카테고리의 다른 글

1-3. Pandas.Series - Pandas.Series.isnull (2)	2022.03.12
1-1. Pandas.Series - parameter 설명 (0)	2022.03.10
0. Pandas 소개 (0)	2022.03.09

1-1. Pandas.Series - parameter 설명

2022. 3. 10. 17:15

Series는 일련의 객체를 담을 수 있는 1차원 array_like 자료구조이다. 또한 배열의 데이터와 연관된 이름인 index를 가지고 있다.

Pandas의 series의 parameter는 다음과 같다.

class pandas.Series(data = None, index = None, dtype = None, name = None, copy = False, fastpath = False)

Parameter

data : array-like, iterable, dict, scalar value
index : array-like, index(1d)
dtype : str, numpy.dtype, ExtensionDtype, optional
name : str, optional
copy : bool, default False

위의 data의 정의를 보면 array-like, iterable, dict, scalar value의 데이터 형태가 들어갈 수 있으며 이는 유연하게 데이터를 받을 수 있다는 것이다.

array-like는 다음의 링크에 정의되어 있다.

https://2vs5.tistory.com/5

array_like 데이터 타입(data type)

클래스나 함수 선언을 보면 자주 input으로 array_like형 데이터 타입을 요구하는 경우가 많다. 그러나 array_like라고 하면 이해하기 힘든 경우가 많아 한번 정리하려고 한다. Numpy의 공식 홈페이지에

2vs5.tistory.com

Data

array-like에서 가장 간단한 list형태로 Series 객체를 생성해 보자

import pandas as pd

a = pd.Series([1,2,3,4,5,6,7,8,9,10])

#결과
0     1
1     2
2     3
3     4
4     5
5     6
6     7
7     8
8     9
9    10
dtype: int64

다음과 같이 결과가 나오는 것을 확인 가능하다.

결과는 리스트의 값이 행으로 들어가며 각 행의 첫 번째 열이 index이다.

다음은 iterable 데이터 타입으로 Series를 생성해 보았다.

iterable = (x*x for x in range(5))

iterable

#결과
<generator object <genexpr> at 0x7f5688c02ad0>

b = pd.Series(iterable)

#결과
0     0
1     1
2     4
3     9
4    16
dtype: int64

np.array에 값을 넣을 때와는 다르게 값의 형태로 나온 것을 확인 가능하다.

data를 python의 dictionary를 넣는다면 key값이 index가 되며 value값이 data가 되어 나온다.

dic = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}

a = pd.Series(dic)

#결과
a    1
b    2
c    3
d    4
e    5
dtype: int64

만약 dictionary의 key값을 사용하지 않고 직접 지정하고 싶으면 다음과 같이 index를 지정해 주면 된다.

dic = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
index = ['b', 'e', 'a', 'd', 'c']

a = pd.Series(dic, index = index)

#결과
b    2
e    5
a    1
d    4
c    3
dtype: int64

만약 index의 수가 data의 수보다 많으면 다음과 같이 NaN(not a number)가 들어가게 된다. NaN은 Pandas에서 누락된 값 또는 NA으로 취급된다.

dic = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
index = ['b', 'e', 'a', 'd', 'c', 'f']

a = pd.Series(dic, index = index)

#결과
b    2.0
e    5.0
a    1.0
d    4.0
c    3.0
f    NaN
dtype: float64

scalar value와 값이 없는 Series또한 만들 수 있다.

a = pd.Series()
b = pd.Series(42)

a
#결과
Series([], dtype: float64)

b
#결과
0    42
dtype: int64

Index

index는 우리가 데이터의 이름을 붙여 구분하기 편하게 사용할 수 있을 뿐 아니라 다양한 기능을 제공한다.

pd.Series에서 index를 이용하여 단일 값을 선택 또는 여러 값을 선택할 수 있다.

index는 앞서 pd.Series를 생성하면서 보았듯이 각 데이터 값의 이름이라고 할 수 있으며 default값을 0부터 시작하는 자연수가 붙게 된다. 또한 pd.Series를 선언할 때 index 값을 선언 가능하다. 또한 index만을 따로 뽑을 수 있으며 array_like형 데이터를 넣어주면 Series에서 index 변경이 가능하다.

a = pd.Series([1,2,3,4,5], index = ['a', 'b', 'c', 'd', 'e'])

#결과
a    1
b    2
c    3
d    4
e    5
dtype: int64

a.index
# a.keys()를 써도 같은 결과가 나온다
#결과
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

a.index = ['b', 'a', 'c', 'd', 'e']

#결과
b    1
a    2
c    3
d    4
e    5
dtype: int64

우리가 선언한 index를 이용하여 Series 데이터를 다룰 수 있다.

a = pd.Series([1,2,3,4,5], index = ['a', 'b', 'c', 'd', 'e'])

a['a']

#결과
1

a[['c', 'a', 'b']]

#결과
c    3
a    1
b    2
dtype: int64

a['a':'c']

#결과
a    1
b    2
c    3
dtype: int64

다음은 index를 이용할 수 있는 다양한 결과들이다.

pd.Series는 index에 데이터 값을 매핑하고 있으므로 파이썬의 dictionary와 비슷하다.

a.keys()

#결과
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

list(a.items())

#결과
[('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)]

'b' in a

#index값이 a에 있는지 확인 가능하다.
#결과
True

Dtype

pd.Series의 data type을 설명한다. optional이기 때문에 딱히 지정해 주지 않아도 상관없다. 만약 지정하지 않는다면 우리가 집어넣어주는 data에서 data type을 추론한다.

자세한 data type은 다음 링크를 참조하면 된다.

https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes

Essential basic functionality — pandas 1.4.1 documentation

Here we discuss a lot of the essential functionality common to the pandas data structures. To begin, let’s create some example objects like we did in the 10 minutes to pandas section: Descriptive statistics There exists a large number of methods for comp

pandas.pydata.org

다음과 같이 data에 int64만 넣었으므로 dtype는 int64가 나와야 하지만 dtype을 float64로 설정해 주었기 때문에 data가 float64형태로 변하였다.

a = pd.Series([1,2,3,4,5], index = ['a', 'b', 'c', 'd', 'e'], dtype = 'float64')

#결과
a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
dtype: float64

Name

pd.Series에 주는 이름이다.

다음의 예시를 보면 출력값에 이름이 나오는 것을 확인 가능하다. 또한 추가 함수인 name을 이용하여 이름만 꺼내거나 바꿀 수 있다.

a = pd.Series([1,2,3,4,5], index = ['a', 'b', 'c', 'd', 'e'], name = 'a')

#출력
a    1
b    2
c    3
d    4
e    5
Name: a, dtype: int64

a.name

#출력
'a'

a.name = b

#출력
a    1
b    2
c    3
d    4
e    5
Name: b, dtype: int64

Copy

input data를 copy하며 이 parameter에 영향받는 input data는 Series와 1d ndarray만 영향을 받는다.

자세히 말하면 Series와 1d ndarray를 input data로 사용하여 copy를 True로 사용하지 않는다면 input data에서 데이터를 연결시켜 사용하기 때문에 데이터를 수정하면 기존 input data도 수정하게 된다.

다음의 예시를 보면 편하다.

data = np.array([1,2,3,4,5])
a = pd.Series(data, index = ['a', 'b', 'c', 'd', 'e'])

#data 결과
array([1, 2, 3, 4, 5])

#a 결과

a    1
b    2
c    3
d    4
e    5
dtype: int64

a['a'] = 777

#a 결과
a    777
b      2
c      3
d      4
e      5
dtype: int64

#data 결과
array([777,   2,   3,   4,   5])

이를 방지하기 위해 copy = False를 넣어준다면 input data를 사용하는 게 아닌 복사한 데이터를 사용하기 때문에 기존 input data의 값은 변하지 않는다.

data = np.array([1,2,3,4,5])
a = pd.Series(data, index = ['a', 'b', 'c', 'd', 'e'], copy = True)

#data 결과
array([1, 2, 3, 4, 5])

#a 결과

a    1
b    2
c    3
d    4
e    5
dtype: int64

a['a'] = 777

#a 결과
a    777
b      2
c      3
d      4
e      5
dtype: int64

#data 결과
array([1,   2,   3,   4,   5])

만약 input data가 Series나 1d ndarray가 아닌 다른 data라면 copy를 False로 해도 영향을 받지 않는다.

data = [1,2,3,4,5]
a = pd.Series(data, index = ['a', 'b', 'c', 'd', 'e'])

#data 결과
array([1, 2, 3, 4, 5])

#a 결과

a    1
b    2
c    3
d    4
e    5
dtype: int64

a['a'] = 777

#a 결과
a    777
b      2
c      3
d      4
e      5
dtype: int64

#data 결과
[1,   2,   3,   4,   5]

'Python > pandas' 카테고리의 다른 글

1-3. Pandas.Series - Pandas.Series.isnull (2)	2022.03.12
1-2. Pandas.Series - index를 이용한 데이터 다루기 (0)	2022.03.12
0. Pandas 소개 (0)	2022.03.09

0. Pandas 소개

2022. 3. 9. 12:18

Pandas는 데이터를 다루는데 있어 자주 사용되는 패키지이다. 주로 NumPy, SciPy, scikit-learn, matplotlib와 같이 사용한다. Pandas는 주로 NumPy의 스타일을 차용하였지만, NumPy는 단일 산술 배열 데이터를 다루는데 특화되어 있는 것과는 다르게 Pandas는 표(table), 시계열(series)등 다양한 형식의 데이터를 다루는데 초점을 맞춰 설계했다.

Pandas를 import할 때 컨밴션은 주로 다음과 같이 사용한다.

import pandas as pd

'Python > pandas' 카테고리의 다른 글

1-3. Pandas.Series - Pandas.Series.isnull (2)	2022.03.12
1-2. Pandas.Series - index를 이용한 데이터 다루기 (0)	2022.03.12
1-1. Pandas.Series - parameter 설명 (0)	2022.03.10

PREV 1 NEXT

2vs5

Python/pandas

1-3. Pandas.Series - Pandas.Series.isnull

'Python > pandas' 카테고리의 다른 글

1-2. Pandas.Series - index를 이용한 데이터 다루기

1. index를 통한 데이터 획득

2. index를 이용한 산술연산

'Python > pandas' 카테고리의 다른 글

1-1. Pandas.Series - parameter 설명

class pandas.Series(data = None, index = None, dtype = None, name = None, copy = False, fastpath = False)

Data

Index

Dtype

Name

Copy

'Python > pandas' 카테고리의 다른 글

0. Pandas 소개

'Python > pandas' 카테고리의 다른 글

+ Recent posts

티스토리툴바