pyspark.pandas.groupby.GroupBy.median¶
-
GroupBy.
median
(numeric_only: bool = True, accuracy: int = 10000) → FrameLike[source]¶ Compute median of groups, excluding missing values.
For multiple groupings, the result index will be a MultiIndex
Note
Unlike pandas’, the median in pandas-on-Spark is an approximated median based upon approximate percentile computation because computing median across a large dataset is extremely expensive.
- Parameters
- numeric_onlybool, default True
Include only float, int, boolean columns. False is not supported. This parameter is mainly for pandas compatibility.
- Returns
- Series or DataFrame
Median of values within each group.
Examples
>>> psdf = ps.DataFrame({'a': [1., 1., 1., 1., 2., 2., 2., 3., 3., 3.], ... 'b': [2., 3., 1., 4., 6., 9., 8., 10., 7., 5.], ... 'c': [3., 5., 2., 5., 1., 2., 6., 4., 3., 6.]}, ... columns=['a', 'b', 'c'], ... index=[7, 2, 4, 1, 3, 4, 9, 10, 5, 6]) >>> psdf a b c 7 1.0 2.0 3.0 2 1.0 3.0 5.0 4 1.0 1.0 2.0 1 1.0 4.0 5.0 3 2.0 6.0 1.0 4 2.0 9.0 2.0 9 2.0 8.0 6.0 10 3.0 10.0 4.0 5 3.0 7.0 3.0 6 3.0 5.0 6.0
DataFrameGroupBy
>>> psdf.groupby('a').median().sort_index() b c a 1.0 2.0 3.0 2.0 8.0 2.0 3.0 7.0 4.0
SeriesGroupBy
>>> psdf.groupby('a')['b'].median().sort_index() a 1.0 2.0 2.0 8.0 3.0 7.0 Name: b, dtype: float64