我正在编写一个程序,通过熵离散化来离散化一组属性。目标是解析数据集
A,Class
5,1
12.5,1
11.5,2
8.6,2
7,1
6,1
5.9,2
1.5,2
9,2
7.8,1
2.1,1
13.5,2
12.45,2
进入
A,Class
1,1
3,1
3,2
2,2
2,1
2,1
1,2
1,2
3,2
2,1
1,1
3,2
3,2
我在程序中遇到的具体问题是确定数据集中的类的数量。这发生在numberOfClasses = s['Class'].value_counts()
上。我想使用pandas方法返回不同类的数量。在这个例子中只有两个。但是我得到了
Number of classes: 2 5
1 4
从print语句。
import pandas as pd
import numpy as np
import entropy_based_binning as ebb
from math import log2
def main():
df = pd.read_csv('S1.csv')
s = df
s = entropy_discretization(s)
# This method discretizes s A1
# If the information gain is 0, i.e the number of
# distinct class is 1 or
# If min f/ max f < 0.5 and the number of distinct values is floor(n/2)
# Then that partition stops splitting.
def entropy_discretization(s):
informationGain = {}
# while(uniqueValue(s)):
# Step 1: pick a threshold
threshold = 6
# Step 2: Partititon the data set into two parttitions
s1 = s[s['A'] < threshold]
print("s1 after spitting")
print(s1)
print("******************")
s2 = s[s['A'] >= threshold]
print("s2 after spitting")
print(s2)
print("******************")
# Step 3: calculate the information gain.
informationGain = information_gain(s1,s2,s)
print(informationGain)
# # Step 5: calculate the max information gain
# minInformationGain = min(informationGain)
# # Step 6: keep the partitions of S based on the value of threshold_i
# s = bestPartition(minInformationGain, s)
def uniqueValue(s):
# are records in s the same? return true
if s.nunique()['A'] == 1:
return False
# otherwise false
else:
return True
def bestPartition(maxInformationGain):
# determine be threshold_i
threshold_i = 6
return
def information_gain(s1, s2, s):
# calculate cardinality for s1
cardinalityS1 = len(pd.Index(s1['A']).value_counts())
print(f'The Cardinality of s1 is: {cardinalityS1}')
# calculate cardinality for s2
cardinalityS2 = len(pd.Index(s2['A']).value_counts())
print(f'The Cardinality of s2 is: {cardinalityS2}')
# calculate cardinality of s
cardinalityS = len(pd.Index(s['A']).value_counts())
print(f'The Cardinality of s is: {cardinalityS}')
# calculate informationGain
informationGain = (cardinalityS1/cardinalityS) * entropy(s1) + (cardinalityS2/cardinalityS) * entropy(s2)
print(f'The total informationGain is: {informationGain}')
return informationGain
def entropy(s):
# calculate the number of classes in s
numberOfClasses = s['Class'].value_counts()
print(f'Number of classes: {numberOfClasses}')
# TODO calculate pi for each class.
# calculate the frequency of class_i in S1
p1 = 2/4
p2 = 3/4
ent = -(p1*log2(p2)) - (p2*log2(p2))
return ent
main()
理想情况下,我想打印Number of classes: 2
。这样我就可以循环遍历类,并从数据集中计算属性A
的频率。我已经查看了pandas文档,但我在value_counts()
上卡住了。接下来我可以尝试什么?
1条答案
按热度按时间e5nqia271#
不妨试试:
这将返回列
Class
中的唯一类的数量。或者更短: