如何计算类在Pandas数据集中的频率

j9per5c4  于 2023-02-14  发布在  其他
关注(0)|答案(1)|浏览(137)

我正在编写一个程序,通过熵离散化来离散化一组属性。目标是解析数据集

A,Class
5,1
12.5,1
11.5,2
8.6,2
7,1
6,1
5.9,2
1.5,2
9,2
7.8,1
2.1,1
13.5,2
12.45,2

进入

A,Class
1,1
3,1
3,2
2,2
2,1
2,1
1,2
1,2
3,2
2,1
1,1
3,2
3,2

我的程序所面临的具体问题是确定class列中1和2的频率。

df = s['Class']
    df['freq'] = df.groupby('Class')['Class'].transform('count')
    print("*****************")
    print(df['freq'])

我想使用Pandas方法返回1和2的频率,以便计算概率p1和p2。

import pandas as pd
import numpy as np
import entropy_based_binning as ebb
from math import log2

def main():
    df = pd.read_csv('S1.csv')
    s = df
    s = entropy_discretization(s)

# This method discretizes s A1
# If the information gain is 0, i.e the number of 
# distinct class is 1 or
# If min f/ max f < 0.5 and the number of distinct values is floor(n/2)
# Then that partition stops splitting.
def entropy_discretization(s):

    informationGain = {}
    # while(uniqueValue(s)):
    # Step 1: pick a threshold
    threshold = 6

    # Step 2: Partititon the data set into two parttitions
    s1 = s[s['A'] < threshold]
    print("s1 after spitting")
    print(s1)
    print("******************")
    s2 = s[s['A'] >= threshold]
    print("s2 after spitting")
    print(s2)
    print("******************")
        
    # Step 3: calculate the information gain.
    informationGain = information_gain(s1,s2,s)

    print(informationGain)

    # # Step 5: calculate the max information gain
    # minInformationGain = min(informationGain)

    # # Step 6: keep the partitions of S based on the value of threshold_i
    # s = bestPartition(minInformationGain, s)

def uniqueValue(s):
    # are records in s the same? return true
    if s.nunique()['A'] == 1:
        return False
    # otherwise false 
    else:
        return True

def bestPartition(maxInformationGain):
    # determine be threshold_i
    threshold_i = 6

    return 

def information_gain(s1, s2, s):
    # calculate cardinality for s1
    cardinalityS1 = len(pd.Index(s1['A']).value_counts())
    print(f'The Cardinality of s1 is: {cardinalityS1}')
    # calculate cardinality for s2
    cardinalityS2 = len(pd.Index(s2['A']).value_counts())
    print(f'The Cardinality of s2 is: {cardinalityS2}')
    # calculate cardinality of s
    cardinalityS = len(pd.Index(s['A']).value_counts())
    print(f'The Cardinality of s is: {cardinalityS}')
    # calculate informationGain
    informationGain = (cardinalityS1/cardinalityS) * entropy(s1) + (cardinalityS2/cardinalityS) * entropy(s2)
    print(f'The total informationGain is: {informationGain}')
    return informationGain


def entropy(s):
    # calculate the number of classes in s
    numberOfClasses = s['Class'].nunique()
    print(f'Number of classes: {numberOfClasses}')
    # TODO calculate pi for each class.
    # calculate the frequency of class_i in S1
    value_counts = s['Class'].value_counts()
    print(f'value_counts : {value_counts}')
    df = s['Class']
    df['freq'] = df.groupby('Class')['Class'].transform('count')
    print("*****************")
    print(df['freq'])
    # p1 = s.groupby('Class').count()
    # p2 = s.groupby('Class').count()
    # print(f'p1: {p1}')
    # print(f'p2: {p2}')
    p1 = 2/4
    p2 = 3/4
    ent = -(p1*log2(p2)) - (p2*log2(p2))

    return ent

理想情况下,我希望打印Number of classes: 2,这样我就可以循环遍历类,并从数据集中计算属性Class的频率,我已经查看了panda文档,但我在试图从类部分计算1和2的频率时遇到了麻烦。

niwlg2el

niwlg2el1#

使用value_counts

>>> df.value_counts('Class')
Class
2    7
1    6
dtype: int64

更新

如何获取value_counts方法返回的单个频率?

counts = df.value_counts('Class')

print(counts[1])  # Freq of 1
6

print(counts[2])  # Freq of 2
7

相关问题