These are the changes in pandas 3.0.0. See Release notes for a full changelog including other versions of pandas.
Enhancements#
Dedicated string data type by default#
Historically, pandas represented string columns with NumPy object data type. This representation has numerous problems: it is not specific to strings (any Python object can be stored in an object-dtype array, not just strings) and it is often not very efficient (both performance wise and for memory usage).
Starting with pandas 3.0, a dedicated string data type is enabled by default (backed by PyArrow under the hood, if installed,…
These are the changes in pandas 3.0.0. See Release notes for a full changelog including other versions of pandas.
Enhancements#
Dedicated string data type by default#
Historically, pandas represented string columns with NumPy object data type. This representation has numerous problems: it is not specific to strings (any Python object can be stored in an object-dtype array, not just strings) and it is often not very efficient (both performance wise and for memory usage).
Starting with pandas 3.0, a dedicated string data type is enabled by default (backed by PyArrow under the hood, if installed, otherwise falling back to being backed by NumPy object-dtype). This means that pandas will start inferring columns containing string data as the new str data type when creating pandas objects, such as in constructors or IO functions.
Old behavior:
>>> ser = pd.Series(["a", "b"])
0 a
1 b
dtype: object
New behavior:
>>> ser = pd.Series(["a", "b"])
0 a
1 b
dtype: str
The string data type that is used in these scenarios will mostly behave as NumPy object would, including missing value semantics and general operations on these columns.
The main characteristic of the new string data type:
Inferred by default for string data (instead of object dtype)
The str dtype can only hold strings (or missing values), in contrast to object dtype. (setitem with non string fails)
The missing value sentinel is always NaN (np.nan) and follows the same missing value semantics as the other default dtypes.
Those intentional changes can have breaking consequences, for example when checking for the .dtype being object dtype or checking the exact missing value sentinel. See the Migration guide for the new string data type (pandas 3.0) for more details on the behaviour changes and how to adapt your code to the new default.
Copy-on-Write#
The new “copy-on-write” behaviour in pandas 3.0 brings changes in behavior in how pandas operates with respect to copies and views. A summary of the changes:
The result of any indexing operation (subsetting a DataFrame or Series in any way, i.e. including accessing a DataFrame column as a Series) or any method returning a new DataFrame or Series, always behaves as if it were a copy in terms of user API. 1.
As a consequence, if you want to modify an object (DataFrame or Series), the only way to do this is to directly modify that object itself.
The main goal of this change is to make the user API more consistent and predictable. There is now a clear rule: any subset or returned series/dataframe always behaves as a copy of the original, and thus never modifies the original (before pandas 3.0, whether a derived object would be a copy or a view depended on the exact operation performed, which was often confusing).
Because every single indexing step now behaves as a copy, this also means that “chained assignment” (updating a DataFrame with multiple setitem steps) will stop working. Because this now consistently never works, the SettingWithCopyWarning is removed.
The new behavioral semantics are explained in more detail in the user guide about Copy-on-Write.
A secondary goal is to improve performance by avoiding unnecessary copies. As mentioned above, every new DataFrame or Series returned from an indexing operation or method behaves as a copy, but under the hood pandas will use views as much as possible, and only copy when needed to guarantee the “behaves as a copy” behaviour (this is the actual “copy-on-write” mechanism used as an implementation detail).
Some of the behaviour changes described above are breaking changes in pandas 3.0. When upgrading to pandas 3.0, it is recommended to first upgrade to pandas 2.3 to get deprecation warnings for a subset of those changes. The migration guide explains the upgrade process in more detail.
Setting the option mode.copy_on_write no longer has any impact. The option is deprecated and will be removed in pandas 4.0.
pd.col syntax can now be used in DataFrame.assign() and DataFrame.loc()#
You can now use pd.col to create callables for use in dataframe methods which accept them. For example, if you have a dataframe
In [1]: df = pd.DataFrame({'a': [1, 1, 2], 'b': [4, 5, 6]})
and you want to create a new column 'c' by summing 'a' and 'b', then instead of
In [2]: df.assign(c = lambda df: df['a'] + df['b'])
Out[2]:
a b c
0 1 4 5
1 1 5 6
2 2 6 8
you can now write:
In [3]: df.assign(c = pd.col('a') + pd.col('b'))
Out[3]:
a b c
0 1 4 5
1 1 5 6
2 2 6 8
New Deprecation Policy#
pandas 3.0.0 introduces a new 3-stage deprecation policy: using DeprecationWarning initially, then switching to FutureWarning for broader visibility in the last minor version before the next major release, and then removal of the deprecated functionality in the major release. This was done to give downstream packages more time to adjust to pandas deprecations, which should reduce the amount of warnings that a user gets from code that isn’t theirs. See PDEP 17 for more details.
All warnings for upcoming changes in pandas will have the base class pandas.errors.PandasChangeWarning. Users may also use the following subclasses to control warnings.
pandas.errors.Pandas4Warning: Warnings which will be enforced in pandas 4.0.
pandas.errors.Pandas5Warning: Warnings which will be enforced in pandas 5.0.
pandas.errors.PandasPendingDeprecationWarning: Base class of all warnings which emit a PendingDeprecationWarning, independent of the version they will be enforced.
pandas.errors.PandasDeprecationWarning: Base class of all warnings which emit a DeprecationWarning, independent of the version they will be enforced.
pandas.errors.PandasFutureWarning: Base class of all warnings which emit a FutureWarning, independent of the version they will be enforced.
Other enhancements#
pandas.NamedAgg now supports passing *args and **kwargs to calls of aggfunc (GH 58283)
pandas.merge() propagates the attrs attribute to the result if all inputs have identical attrs, as has so far already been the case for pandas.concat().
pandas.api.typing.FrozenList is available for typing the outputs of MultiIndex.names, MultiIndex.codes and MultiIndex.levels (GH 58237)
pandas.api.typing.SASReader is available for typing the output of read_sas() (GH 55689)
Added Styler.to_typst() to write Styler objects to file, buffer or string in Typst format (GH 57617)
Added missing pandas.Series.info() to API reference (GH 60926)
pandas.api.typing.NoDefault is available for typing no_default
DataFrame.to_excel() now raises an UserWarning when the character count in a cell exceeds Excel’s limitation of 32767 characters (GH 56954)
pandas.merge() now validates the how parameter input (merge type) (GH 59435)
pandas.merge(), DataFrame.merge() and DataFrame.join() now support anti joins (left_anti and right_anti) in the how parameter (GH 42916)
read_spss() now supports kwargs to be passed to pyreadstat (GH 56356)
read_stata() now returns datetime64 resolutions better matching those natively stored in the stata format (GH 55642)
DataFrame.agg() called with axis=1 and a func which relabels the result index now raises a NotImplementedError (GH 58807).
Index.get_loc() now accepts also subclasses of tuple as keys (GH 57922)
Styler.set_tooltips() provides alternative method to storing tooltips by using title attribute of td elements. (GH 56981)
Added missing parameter weights in DataFrame.plot.kde() for the estimation of the PDF (GH 59337)
Allow dictionaries to be passed to pandas.Series.str.replace() via pat parameter (GH 51748)
Support passing a Series input to json_normalize() that retains the Series Index (GH 51452)
Support reading value labels from Stata 108-format (Stata 6) and earlier files (GH 58154)
Users can globally disable any PerformanceWarning by setting the option mode.performance_warnings to False (GH 56920)
Styler.format_index_names() can now be used to format the index and column names (GH 48936 and GH 47489)
errors.DtypeWarning improved to include column names when mixed data types are detected (GH 58174)
Rolling and Expanding now support pipe method (GH 57076)
Series now supports the Arrow PyCapsule Interface for export (GH 59518)
DataFrame.to_excel() argument merge_cells now accepts a value of "columns" to only merge MultiIndex column header header cells (GH 35384)
set_option() now accepts a dictionary of options, simplifying configuration of multiple settings at once (GH 61093)
DataFrame.corrwith() now accepts min_periods as optional arguments, as in DataFrame.corr() and Series.corr() (GH 9490)
DataFrame.cummin(), DataFrame.cummax(), DataFrame.cumprod() and DataFrame.cumsum() methods now have a numeric_only parameter (GH 53072)
DataFrame.ewm() now allows adjust=False when times is provided (GH 54328)
DataFrame.fillna() and Series.fillna() can now accept value=None; for non-object dtype the corresponding NA value will be used (GH 57723)
DataFrame.pivot_table() and pivot_table() now allow the passing of keyword arguments to aggfunc through **kwargs (GH 57884)
DataFrame.to_json() now encodes Decimal as strings instead of floats (GH 60698)
Series.cummin() and Series.cummax() now supports CategoricalDtype (GH 52335)
Series.plot() now correctly handle the ylabel parameter for pie charts, allowing for explicit control over the y-axis label (GH 58239)
DataFrame.plot.scatter() argument c now accepts a column of strings, where rows with the same string are colored identically (GH 16827 and GH 16485)
Series.nlargest() uses a ‘stable’ sort internally and will preserve original ordering.
ArrowDtype now supports pyarrow.JsonType (GH 60958)
DataFrameGroupBy and SeriesGroupBy methods sum, mean, median, prod, min, max, std, var and sem now accept skipna parameter (GH 15675)
Easter has gained a new constructor argument method which specifies the method used to calculate Easter — for example, Orthodox Easter (GH 61665)
Holiday constructor argument days_of_week will raise a ValueError when type is something other than None or tuple (GH 61658)
Holiday has gained the constructor argument and field exclude_dates to exclude specific datetimes from a custom holiday calendar (GH 54382)
Rolling and Expanding now support nunique (GH 26958)
Rolling and Expanding now support aggregations first and last (GH 33155)
DataFrame.to_excel() has a new autofilter parameter to add automatic filters to all columns (GH 61194)
read_parquet() accepts to_pandas_kwargs which are forwarded to pyarrow.Table.to_pandas() which enables passing additional keywords to customize the conversion to pandas, such as maps_as_pydicts to read the Parquet map data type as python dictionaries (GH 56842)
to_numeric() on big integers converts to object datatype with python integers when not coercing. (GH 51295)
DataFrameGroupBy.transform(), SeriesGroupBy.transform(), DataFrameGroupBy.agg(), SeriesGroupBy.agg(), SeriesGroupBy.apply(), DataFrameGroupBy.apply() now support kurt (GH 40139)
DataFrame.apply() supports using third-party execution engines like the Bodo.ai JIT compiler (GH 60668)
DataFrame.iloc() and Series.iloc() now support boolean masks in __getitem__ for more consistent indexing behavior (GH 60994)
DataFrame.to_csv() and Series.to_csv() now support Python’s new-style format strings (e.g., "{:.6f}") for the float_format parameter, in addition to old-style % format strings and callables. This allows for more flexible and modern formatting of floating point numbers when exporting to CSV. (GH 49580)
DataFrameGroupBy.transform(), SeriesGroupBy.transform(), DataFrameGroupBy.agg(), SeriesGroupBy.agg(), RollingGroupby.apply(), ExpandingGroupby.apply(), Rolling.apply(), Expanding.apply(), DataFrame.apply() with engine="numba" now supports positional arguments passed as kwargs (GH 58995)
Rolling.agg(), Expanding.agg() and ExponentialMovingWindow.agg() now accept NamedAgg aggregations through **kwargs (GH 28333)
Series.map() can now accept kwargs to pass on to func (GH 59814)
Series.map() now accepts an engine parameter to allow execution with a third-party execution engine (GH 61125)
Series.rank() and DataFrame.rank() with numpy-nullable dtypes preserve NA values and return UInt64 dtype where appropriate instead of casting NA to NaN with float64 dtype (GH 62043)
Series.str.get_dummies() now accepts a dtype parameter to specify the dtype of the resulting DataFrame (GH 47872)
pandas.concat() will raise a ValueError when ignore_index=True and keys is not None (GH 59274)
frozenset elements in pandas objects are now natively printed (GH 60690)
Add "delete_rows" option to if_exists argument in DataFrame.to_sql() deleting all records of the table before inserting data (GH 37210).
Added half-year offset classes HalfYearBegin, HalfYearEnd, BHalfYearBegin and BHalfYearEnd (GH 60928)
Added support for axis=1 with dict or Series arguments into DataFrame.fillna() (GH 4514)
Added support to read and write from and to Apache Iceberg tables with the new read_iceberg() and DataFrame.to_iceberg() functions (GH 61383)
Errors occurring during SQL I/O will now throw a generic DatabaseError instead of the raw Exception type from the underlying driver manager library (GH 60748)
Implemented Series.str.isascii() and Series.str.isascii() (GH 59091)
Improve error reporting through outputting the first few duplicates when merge() validation fails (GH 62742)
Improve the resulting dtypes in DataFrame.where() and DataFrame.mask() with ExtensionDtype other (GH 62038)
Improved deprecation message for offset aliases (GH 60820)
Many type aliases are now exposed in the new submodule pandas.api.typing.aliases (GH 55231)
Multiplying two DateOffset objects will now raise a TypeError instead of a RecursionError (GH 59442)
Restore support for reading Stata 104-format and enable reading 103-format dta files (GH 58554)
Support passing a Iterable[Hashable] input to DataFrame.drop_duplicates() (GH 59237)
Support reading Stata 102-format (Stata 1) dta files (GH 58978)
Support reading Stata 110-format (Stata 7) dta files (GH 47176)
Switched wheel upload to PyPI Trusted Publishing (OIDC) for release-tag pushes in wheels.yml. (GH 61718)
Added a new DataFrame.from_arrow() method to import any Arrow-compatible tabular data object into a pandas DataFrame through the Arrow PyCapsule Protocol (GH 59631)
Notable bug fixes#
These are bug fixes that might have notable behavior changes.
Improved behavior in groupby for observed=False#
A number of bugs have been fixed due to improved handling of unobserved groups (GH 55738). All remarks in this section equally impact SeriesGroupBy.
In previous versions of pandas, a single grouping with DataFrameGroupBy.apply() or DataFrameGroupBy.agg() would pass the unobserved groups to the provided function, resulting in 0 below.
In [4]: df = pd.DataFrame(
...: {
...: "key1": pd.Categorical(list("aabb"), categories=list("abc")),
...: "key2": [1, 1, 1, 2],
...: "values": [1, 2, 3, 4],
...: }
...: )
...:
In [5]: df
Out[5]:
key1 key2 values
0 a 1 1
1 a 1 2
2 b 1 3
3 b 2 4
In [6]: gb = df.groupby("key1", observed=False)
In [7]: gb[["values"]].apply(lambda x: x.sum())
Out[7]:
values
key1
a 3
b 7
c 0
However this was not the case when using multiple groupings, resulting in NaN below.
In [1]: gb = df.groupby(["key1", "key2"], observed=False)
In [2]: gb[["values"]].apply(lambda x: x.sum())
Out[2]:
values
key1 key2
a 1 3.0
2 NaN
b 1 3.0
2 4.0
c 1 NaN
2 NaN
Now using multiple groupings will also pass the unobserved groups to the provided function.
In [8]: gb = df.groupby(["key1", "key2"], observed=False)
In [9]: gb[["values"]].apply(lambda x: x.sum())
Out[9]:
values
key1 key2
a 1 3
2 0
b 1 3
2 4
c 1 0
2 0
Similarly:
In previous versions of pandas the method DataFrameGroupBy.sum() would result in 0 for unobserved groups, but DataFrameGroupBy.prod(), DataFrameGroupBy.all(), and DataFrameGroupBy.any() would all result in NA values. Now these methods result in 1, True, and False respectively.
DataFrameGroupBy.groups() did not include unobserved groups and now does.
These improvements also fixed certain bugs in groupby:
DataFrameGroupBy.agg() would fail when there are multiple groupings, unobserved groups, and as_index=False (GH 36698)
DataFrameGroupBy.groups() with sort=False would sort groups; they now occur in the order they are observed (GH 56966)
DataFrameGroupBy.nunique() would fail when there are multiple groupings, unobserved groups, and as_index=False (GH 52848)
DataFrameGroupBy.sum() would have incorrect values when there are multiple groupings, unobserved groups, and non-numeric data (GH 43891)
DataFrameGroupBy.value_counts() would produce incorrect results when used with some categorical and some non-categorical groupings and observed=False (GH 56016)
notable_bug_fix2#
Backwards incompatible API changes#
Datetime resolution inference#
Converting a sequence of strings, datetime objects, or np.datetime64 objects to a datetime64 dtype now performs inference on the appropriate resolution (AKA unit) for the output dtype. This affects Series, DataFrame, Index, DatetimeIndex, and to_datetime().
Previously, these would always give nanosecond resolution:
In [1]: dt = pd.Timestamp("2024-03-22 11:36").to_pydatetime()
In [2]: pd.to_datetime([dt]).dtype
Out[2]: dtype('<M8[ns]')
In [3]: pd.Index([dt]).dtype
Out[3]: dtype('<M8[ns]')
In [4]: pd.DatetimeIndex([dt]).dtype
Out[4]: dtype('<M8[ns]')
In [5]: pd.Series([dt]).dtype
Out[5]: dtype('<M8[ns]')
This now infers the unit microsecond unit “us” from the pydatetime object, matching the scalar Timestamp behavior.
In [10]: In [1]: dt = pd.Timestamp("2024-03-22 11:36").to_pydatetime()
In [11]: In [2]: pd.to_datetime([dt]).dtype
Out[11]: dtype('<M8[us]')
In [12]: In [3]: pd.Index([dt]).dtype
Out[12]: dtype('<M8[us]')
In [13]: In [4]: pd.DatetimeIndex([dt]).dtype
Out[13]: dtype('<M8[us]')
In [14]: In [5]: pd.Series([dt]).dtype
Out[14]: dtype('<M8[us]')
Similar when passed a sequence of np.datetime64 objects, the resolution of the passed objects will be retained (or for lower-than-second resolution, second resolution will be used).
When passing strings, the resolution will depend on the precision of the string, again matching the Timestamp behavior. Previously:
In [2]: pd.to_datetime(["2024-03-22 11:43:01"]).dtype
Out[2]: dtype('<M8[ns]')
In [3]: pd.to_datetime(["2024-03-22 11:43:01.002"]).dtype
Out[3]: dtype('<M8[ns]')
In [4]: pd.to_datetime(["2024-03-22 11:43:01.002003"]).dtype
Out[4]: dtype('<M8[ns]')
In [5]: pd.to_datetime(["2024-03-22 11:43:01.002003004"]).dtype
Out[5]: dtype('<M8[ns]')
The inferred resolution now matches that of the input strings for nanosecond-precision strings, otherwise defaulting to microseconds:
In [15]: In [2]: pd.to_datetime(["2024-03-22 11:43:01"]).dtype
Out[15]: dtype('<M8[us]')
In [16]: In [3]: pd.to_datetime(["2024-03-22 11:43:01.002"]).dtype
Out[16]: dtype('<M8[us]')
In [17]: In [4]: pd.to_datetime(["2024-03-22 11:43:01.002003"]).dtype
Out[17]: dtype('<M8[us]')
In [18]: In [5]: pd.to_datetime(["2024-03-22 11:43:01.002003004"]).dtype
Out[18]: dtype('<M8[ns]')
This is also a change for the Timestamp constructor with a string input, which in version 2.x.y could give second or millisecond unit, which users generally disliked (GH 52653)
In cases with mixed-resolution inputs, the highest resolution is used:
In [2]: pd.to_datetime([pd.Timestamp("2024-03-22 11:43:01"), "2024-03-22 11:43:01.002"]).dtype
Out[2]: dtype('<M8[ns]')
Warning
Many users will now get “M8[us]” dtype data in cases when they used to get “M8[ns]”. For most use cases they should not notice a difference. One big exception is converting to integers, which will give integers 1000x smaller.
Similarly, the Timedelta constructor and to_timedelta() with a string input now defaults to a microsecond unit, using nanosecond unit only in cases that actually have nanosecond precision.
concat() no longer ignores sort when all objects have a DatetimeIndex#
When all objects passed to concat() have a DatetimeIndex, passing sort=False will now result in the non-concatenation axis not being sorted. Previously, the result would always be sorted along the non-concatenation axis even when sort=False is passed. GH 57335
If you do not specify the sort argument, pandas will continue to return a sorted result but this behavior is deprecated and you will receive a warning. In order to make this less noisy for users, pandas checks if not sorting would impact the result and only warns when it would. This check can be expensive, and users can skip the check by explicitly specifying sort=True or sort=False.
This deprecation can also impact pandas’ internal usage of concat(). Here cases where concat() was sorting a DatetimeIndex but not other indexes are considered bugs and have been fixed as noted below. However it is possible some have been missed. In order to be cautious here, pandas has not added sort=False to any internal calls where we believe behavior should not change. If we have missed something, users will not experience a behavior change but they will receive a warning about concat() even though they are not directly calling this function. If this does occur, we ask users to open an issue so that we may address any potential behavior changes.
In [19]: idx1 = pd.date_range("2025-01-02", periods=3, freq="h")
In [20]: df1 = pd.DataFrame({"a": [1, 2, 3]}, index=idx1)
In [21]: df1
Out[21]:
a
2025-01-02 00:00:00 1
2025-01-02 01:00:00 2
2025-01-02 02:00:00 3
In [22]: idx2 = pd.date_range("2025-01-01", periods=3, freq="h")
In [23]: df2 = pd.DataFrame({"b": [1, 2, 3]}, index=idx2)
In [24]: df2
Out[24]:
b
2025-01-01 00:00:00 1
2025-01-01 01:00:00 2
2025-01-01 02:00:00 3
Old behavior
In [3]: pd.concat([df1, df2], axis=1, sort=False)
Out[3]:
a b
2025-01-01 00:00:00 NaN 1.0
2025-01-01 01:00:00 NaN 2.0
2025-01-01 02:00:00 NaN 3.0
2025-01-02 00:00:00 1.0 NaN
2025-01-02 01:00:00 2.0 NaN
2025-01-02 02:00:00 3.0 NaN
New behavior
In [25]: pd.concat([df1, df2], axis=1, sort=False)
Out[25]:
a b
2025-01-02 00:00:00 1.0 NaN
2025-01-02 01:00:00 2.0 NaN
2025-01-02 02:00:00 3.0 NaN
2025-01-01 00:00:00 NaN 1.0
2025-01-01 01:00:00 NaN 2.0
2025-01-01 02:00:00 NaN 3.0
Cases where pandas’ internal usage of concat() resulted in inconsistent sorting that are now fixed in this release are as follows.
Series.apply() and DataFrame.apply() with a list-like or dict-like func argument.
Series.shift(), DataFrame.shift(), SeriesGroupBy.shift(), DataFrameGroupBy.shift() with the periods argument a list of length greater than 1.
DataFrame.join() with other a list of one or more Series or DataFrames and how="inner", how="left", or how="right".
Series.str.cat() with others a Series or DataFrame.
Changed behavior in DataFrame.value_counts() and DataFrameGroupBy.value_counts() when sort=False#
In previous versions of pandas, DataFrame.value_counts() with sort=False would sort the result by row labels (as was documented). This was nonintuitive and inconsistent with Series.value_counts() which would maintain the order of the input. Now DataFrame.value_counts() will maintain the order of the input.
In [26]: df = pd.DataFrame(
....: {
....: "a": [2, 2, 2, 2, 1, 1, 1, 1],
....: "b": [2, 1, 3, 1, 2, 3, 1, 1],
....: }
....: )
....:
In [27]: df
Out[27]:
a b
0 2 2
1 2 1
2 2 3
3 2 1
4 1 2
5 1 3
6 1 1
7 1 1
Old behavior
In [3]: df.value_counts(sort=False)
Out[3]:
a b
1 1 2
2 1
3 1
2 1 2
2 1
3 1
Name: count, dtype: int64
New behavior
In [28]: df.value_counts(sort=False)
Out[28]:
a b
2 2 1
1 2
3 1
1 2 1
3 1
1 2
Name: count, dtype: int64
This change also applies to DataFrameGroupBy.value_counts(). Here, there are two options for sorting: one sort passed to DataFrame.groupby() and one passed directly to [DataFrameGroupBy.value_counts()](h