ENH: Describe : add shortest, longest, avg/max/min length #59897

simonaubertbd · 2024-09-26T05:17:29Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

Hello,

As of now, Describe is mainly oriented for numerical analysis. It's less useful when you have text, string values.

Feature Description

Adding five statistics dedicated to string analysis for each concerned column:
-avg length
-max length
-min length
-shortest : one of the string with the minimum length
-longest : one of the string with the maximum length

Alternative Solutions

writing something like that but that means more work to do (sorry for the formatting)
import pandas as pd

Sample DataFrame for illustration

data = {
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'city': ['New York', 'Los Angeles', 'San Francisco', 'Chicago'],
'country': ['USA', 'USA', 'USA', 'USA']
}

df = pd.DataFrame(data)

Function to get string statistics

def string_column_statistics(df):
stats = {}

for col in df.select_dtypes(include='object').columns:
    string_lengths = df[col].str.len()

    avg_length = string_lengths.mean()
    max_length = string_lengths.max()
    min_length = string_lengths.min()

    max_length_string = df[col][string_lengths.idxmax()]
    min_length_string = df[col][string_lengths.idxmin()]

    stats[col] = {
        'average_length': avg_length,
        'max_length': max_length,
        'min_length': min_length,
        'example_max_length': max_length_string,
        'example_min_length': min_length_string
    }

return pd.DataFrame(stats)

Call the function

string_stats_df = string_column_statistics(df)
print(string_stats_df)

Additional Context

Best regards,

Simon

The text was updated successfully, but these errors were encountered:

rhshadrach · 2024-09-30T20:47:37Z

Thanks for the request. I'm curious about the use cases of wanting to know the min/max/average length of strings. In the examples you give, I view these as labels for which the length of the strings is not particularly important (e.g. What's in a name?).

cc @WillAyd

simonaubertbd · 2024-09-30T21:17:34Z

@rhshadrach Yeah, the example wasn't exactly a use case example, you're pretty right about that.

Now let's have a few use cases :
-financial account (I will take french accounting, don't know foreign). They must have the same length. So min and max length have to be the same.
-french department number can be either 2 or 3 characters. I want to be sure there is not at 1 or more than 3
-also, if i have a string field with 10 values with min 0 and max 9, I can suppose it worth a look to see if I can transform it as integer
-Also, about names, in my previous example : there are studies about name length distribution like https://www.researchgate.net/figure/First-names-and-last-names-lengths-distributions_fig1_328894441 and my average length is really different (like 4 or 10, I may have some issues).

To add some personal context : I'm an old Alteryx user and it's a feature in their data investigation tools, very common, very useful and I was surprised that describe doesn't cover it. Plus, there is this very nice project, Amphi, that aims to be a visual data preparation/etl tool and that relies on Python and I would like it to incorporate a data investigation tool. Having it all in Describe would definitly help a lot.

Best regards and thanks for your prompt answer to my issue

Simon

rhshadrach · 2024-09-30T21:37:33Z

In the first two of your examples, it seems to me you wish to validate data. I think describe is meant to output summary statistics, the idea being that a user can get a sense of what data the DataFrame contains at a glance. I do not think we should expand on the API of this function for the purposes of data validation.

-also, if i have a string field with 10 values with min 0 and max 9, I can suppose it worth a look to see if I can transform it as integer
-Also, about names, in my previous example : there are studies about name length distribution like https://www.researchgate.net/figure/First-names-and-last-names-lengths-distributions_fig1_328894441 and my average length is really different (like 4 or 10, I may have some issues).

These seem quite uncommon uses to me.

I am negative on expanding the API here.

simonaubertbd · 2024-10-01T04:56:25Z

Hello @rhshadrach "In the first two of your examples, it seems to me you wish to validate data. I think describe is meant to output summary statistics, the idea being that a user can get a sense of what data the DataFrame contains at a glance"

Validating data would be another thing, like what happens if the field X doesn"t follow the rule Y. Here, that's more in the spirit : do I have suprises with this dataset or is the data quality good? But it can also help for different purposes like finding the max length of string in order to have the good type when sending it to a database (varchar(10) is not the same than a varchar(32)).

Moreover, the goal of the Panda Describe function is

Generate descriptive statistics.

And when I ask chatgpt about it , here the answer :

Why Generate Descriptive Statistics?

Understanding Data Distribution: Helps you understand the general shape and behavior of your data.
Detecting Outliers: Standard deviation, range, and IQR can help identify extreme values that may require special attention.
Summarizing Large Datasets: Allows you to condense complex datasets into understandable summaries, aiding in decision-making and analysis.
Data Cleaning: Helps detect potential issues like missing values, anomalies, or inconsistent data patterns.

So, the 4th point is not out of the scope, as you can see.

Best regards,

Simon

simonaubertbd · 2024-12-18T21:54:15Z

Hello @rhshadrach for your information, it was added on skimpy today aeturrell/skimpy#840 (comment)

simonaubertbd added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 26, 2024

rhshadrach added Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Describe : add shortest, longest, avg/max/min length #59897

ENH: Describe : add shortest, longest, avg/max/min length #59897

simonaubertbd commented Sep 26, 2024 •

edited

Loading

rhshadrach commented Sep 30, 2024

simonaubertbd commented Sep 30, 2024 •

edited

Loading

rhshadrach commented Sep 30, 2024

simonaubertbd commented Oct 1, 2024

simonaubertbd commented Dec 18, 2024

ENH: Describe : add shortest, longest, avg/max/min length #59897

ENH: Describe : add shortest, longest, avg/max/min length #59897

Comments

simonaubertbd commented Sep 26, 2024 • edited Loading

Feature Type

Problem Description

Feature Description

Alternative Solutions

Sample DataFrame for illustration

Function to get string statistics

Call the function

Additional Context

rhshadrach commented Sep 30, 2024

simonaubertbd commented Sep 30, 2024 • edited Loading

rhshadrach commented Sep 30, 2024

simonaubertbd commented Oct 1, 2024

simonaubertbd commented Dec 18, 2024

simonaubertbd commented Sep 26, 2024 •

edited

Loading

simonaubertbd commented Sep 30, 2024 •

edited

Loading