Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Describe : add shortest, longest, avg/max/min length #59897

Open
1 of 3 tasks
simonaubertbd opened this issue Sep 26, 2024 · 5 comments
Open
1 of 3 tasks

ENH: Describe : add shortest, longest, avg/max/min length #59897

simonaubertbd opened this issue Sep 26, 2024 · 5 comments
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data

Comments

@simonaubertbd
Copy link

simonaubertbd commented Sep 26, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Hello,

As of now, Describe is mainly oriented for numerical analysis. It's less useful when you have text, string values.

Feature Description

Adding five statistics dedicated to string analysis for each concerned column:
-avg length
-max length
-min length
-shortest : one of the string with the minimum length
-longest : one of the string with the maximum length

Alternative Solutions

writing something like that but that means more work to do (sorry for the formatting)
import pandas as pd

Sample DataFrame for illustration

data = {
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'city': ['New York', 'Los Angeles', 'San Francisco', 'Chicago'],
'country': ['USA', 'USA', 'USA', 'USA']
}

df = pd.DataFrame(data)

Function to get string statistics

def string_column_statistics(df):
stats = {}

for col in df.select_dtypes(include='object').columns:
    string_lengths = df[col].str.len()

    avg_length = string_lengths.mean()
    max_length = string_lengths.max()
    min_length = string_lengths.min()

    max_length_string = df[col][string_lengths.idxmax()]
    min_length_string = df[col][string_lengths.idxmin()]

    stats[col] = {
        'average_length': avg_length,
        'max_length': max_length,
        'min_length': min_length,
        'example_max_length': max_length_string,
        'example_min_length': min_length_string
    }

return pd.DataFrame(stats)

Call the function

string_stats_df = string_column_statistics(df)
print(string_stats_df)

Additional Context

Best regards,

Simon

@simonaubertbd simonaubertbd added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 26, 2024
@rhshadrach
Copy link
Member

Thanks for the request. I'm curious about the use cases of wanting to know the min/max/average length of strings. In the examples you give, I view these as labels for which the length of the strings is not particularly important (e.g. What's in a name?).

cc @WillAyd

@rhshadrach rhshadrach added Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 30, 2024
@simonaubertbd
Copy link
Author

simonaubertbd commented Sep 30, 2024

@rhshadrach Yeah, the example wasn't exactly a use case example, you're pretty right about that.

Now let's have a few use cases :
-financial account (I will take french accounting, don't know foreign). They must have the same length. So min and max length have to be the same.
-french department number can be either 2 or 3 characters. I want to be sure there is not at 1 or more than 3
-also, if i have a string field with 10 values with min 0 and max 9, I can suppose it worth a look to see if I can transform it as integer
-Also, about names, in my previous example : there are studies about name length distribution like https://www.researchgate.net/figure/First-names-and-last-names-lengths-distributions_fig1_328894441 and my average length is really different (like 4 or 10, I may have some issues).

To add some personal context : I'm an old Alteryx user and it's a feature in their data investigation tools, very common, very useful and I was surprised that describe doesn't cover it. Plus, there is this very nice project, Amphi, that aims to be a visual data preparation/etl tool and that relies on Python and I would like it to incorporate a data investigation tool. Having it all in Describe would definitly help a lot.

Best regards and thanks for your prompt answer to my issue

Simon

@rhshadrach
Copy link
Member

In the first two of your examples, it seems to me you wish to validate data. I think describe is meant to output summary statistics, the idea being that a user can get a sense of what data the DataFrame contains at a glance. I do not think we should expand on the API of this function for the purposes of data validation.

-also, if i have a string field with 10 values with min 0 and max 9, I can suppose it worth a look to see if I can transform it as integer
-Also, about names, in my previous example : there are studies about name length distribution like https://www.researchgate.net/figure/First-names-and-last-names-lengths-distributions_fig1_328894441 and my average length is really different (like 4 or 10, I may have some issues).

These seem quite uncommon uses to me.

I am negative on expanding the API here.

@simonaubertbd
Copy link
Author

Hello @rhshadrach "In the first two of your examples, it seems to me you wish to validate data. I think describe is meant to output summary statistics, the idea being that a user can get a sense of what data the DataFrame contains at a glance"

Validating data would be another thing, like what happens if the field X doesn"t follow the rule Y. Here, that's more in the spirit : do I have suprises with this dataset or is the data quality good? But it can also help for different purposes like finding the max length of string in order to have the good type when sending it to a database (varchar(10) is not the same than a varchar(32)).

Moreover, the goal of the Panda Describe function is

Generate descriptive statistics.

And when I ask chatgpt about it , here the answer :

Why Generate Descriptive Statistics?

Understanding Data Distribution: Helps you understand the general shape and behavior of your data.
Detecting Outliers: Standard deviation, range, and IQR can help identify extreme values that may require special attention.
Summarizing Large Datasets: Allows you to condense complex datasets into understandable summaries, aiding in decision-making and analysis.
Data Cleaning: Helps detect potential issues like missing values, anomalies, or inconsistent data patterns.

So, the 4th point is not out of the scope, as you can see.

Best regards,

Simon

@simonaubertbd
Copy link
Author

Hello @rhshadrach for your information, it was added on skimpy today aeturrell/skimpy#840 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

2 participants