These are the changes in pandas 3.0.0. See Release notes for a full changelog including other versions of pandas.
Note
The pandas 3.0 release removed a lot of functionality that was deprecated in previous releases (see below for an overview). It is recommended to first upgrade to pandas 2.3 and to ensure your code is working without warnings, before upgrading to pandas 3.0.
Enhancements#
Dedicated string data type by default#
Historically, pandas represented string columns with NumPy object data type. This representation has numerous problems: it is not specific …
These are the changes in pandas 3.0.0. See Release notes for a full changelog including other versions of pandas.
Note
The pandas 3.0 release removed a lot of functionality that was deprecated in previous releases (see below for an overview). It is recommended to first upgrade to pandas 2.3 and to ensure your code is working without warnings, before upgrading to pandas 3.0.
Enhancements#
Dedicated string data type by default#
Historically, pandas represented string columns with NumPy object data type. This representation has numerous problems: it is not specific to strings (any Python object can be stored in an object-dtype array, not just strings) and it is often not very efficient (both performance wise and for memory usage).
Starting with pandas 3.0, a dedicated string data type is enabled by default (backed by PyArrow under the hood, if installed, otherwise falling back to being backed by NumPy object-dtype). This means that pandas will start inferring columns containing string data as the new str data type when creating pandas objects, such as in constructors or IO functions.
Old behavior:
>>> ser = pd.Series(["a", "b"])
0 a
1 b
dtype: object
New behavior:
>>> ser = pd.Series(["a", "b"])
0 a
1 b
dtype: str
The string data type that is used in these scenarios will mostly behave as NumPy object would, including missing value semantics and general operations on these columns.
The main characteristic of the new string data type:
Inferred by default for string data (instead of object dtype)
The str dtype can only hold strings (or missing values), in contrast to object dtype. (setitem with non string fails)
The missing value sentinel is always NaN (np.nan) and follows the same missing value semantics as the other default dtypes.
Those intentional changes can have breaking consequences, for example when checking for the .dtype being object dtype or checking the exact missing value sentinel. See the Migration guide for the new string data type (pandas 3.0) for more details on the behaviour changes and how to adapt your code to the new default.
Consistent copy/view behaviour with Copy-on-Write#
The new “copy-on-write” behaviour in pandas 3.0 brings changes in behavior in how pandas operates with respect to copies and views. A summary of the changes:
The result of any indexing operation (subsetting a DataFrame or Series in any way, i.e. including accessing a DataFrame column as a Series) or any method returning a new DataFrame or Series, always behaves as if it were a copy in terms of user API. 1.
As a consequence, if you want to modify an object (DataFrame or Series), the only way to do this is to directly modify that object itself.
The main goal of this change is to make the user API more consistent and predictable. There is now a clear rule: any subset or returned series/dataframe always behaves as a copy of the original, and thus never modifies the original (before pandas 3.0, whether a derived object would be a copy or a view depended on the exact operation performed, which was often confusing).
Because every single indexing step now behaves as a copy, this also means that “chained assignment” (updating a DataFrame with multiple setitem steps) will stop working. Because this now consistently never works, the SettingWithCopyWarning is removed, and defensive .copy() calls to silence the warning are no longer needed.
The new behavioral semantics are explained in more detail in the user guide about Copy-on-Write.
A secondary goal is to improve performance by avoiding unnecessary copies. As mentioned above, every new DataFrame or Series returned from an indexing operation or method behaves as a copy, but under the hood pandas will use views as much as possible, and only copy when needed to guarantee the “behaves as a copy” behaviour (this is the actual “copy-on-write” mechanism used as an implementation detail).
Some of the behaviour changes described above are breaking changes in pandas 3.0. When upgrading to pandas 3.0, it is recommended to first upgrade to pandas 2.3 to get deprecation warnings for a subset of those changes. The migration guide explains the upgrade process in more detail.
Setting the option mode.copy_on_write no longer has any impact. The option is deprecated and will be removed in pandas 4.0.
Initial support for pd.col() syntax to create expressions#
This release introduces col() to refer to a DataFrame column by name and build up expressions.
This can be used as a simplified syntax to create callables for use in methods such as DataFrame.assign(). In practice, where you would have to use a lambda function before, you can now use pd.col() instead.
For example, if you have a dataframe
In [1]: df = pd.DataFrame({'a': [1, 1, 2], 'b': [4, 5, 6]})
and you want to create a new column 'c' by summing 'a' and 'b', then instead of
In [2]: df.assign(c = lambda df: df['a'] + df['b'])
Out[2]:
a b c
0 1 4 5
1 1 5 6
2 2 6 8
you can now write:
In [3]: df.assign(c = pd.col('a') + pd.col('b'))
Out[3]:
a b c
0 1 4 5
1 1 5 6
2 2 6 8
The expression object returned by col() supports all standard operators (like +, -, *, /, etc.) and all Series methods and namespaces (like pd.col("name").sum(), pd.col("name").str.upper(), etc.).
Currently, the pd.col() syntax can be used in any place which accepts a callable that takes the calling DataFrame as first argument and returns a Series, like lambda df: df[col_name]. This includes DataFrame.assign(), DataFrame.loc(), and getitem/setitem.
It is expected that the support for pd.col() will be expanded to more methods in future releases.
Support for the Arrow PyCapsule Interface#
The Arrow C data interface allows moving data between different DataFrame libraries through the Arrow format, and is designed to be zero-copy where possible. In Python, this interface is exposed through the Arrow PyCapsule Protocol.
DataFrame and Series now support the Arrow PyCapsule Interface for both export and import of data (GH 56587, GH 63208, GH 59518, GH 59631).
The dedicated DataFrame.from_arrow() and Series.from_arrow() methods are added to import any Arrow-compatible data object into a pandas object through the interface.
For export, the DataFrame and Series implement the C stream interface (__arrow_c_stream__) method.
Those methods currently rely on pyarrow to convert the tabular object in Arrow format to/from pandas.
Updated deprecation policy#
pandas 3.0.0 updates the deprecation policy to clarify which deprecation warnings will be issued, using a new 3-stage policy: using DeprecationWarning initially, then switching to FutureWarning for broader visibility in the last minor version before the next major release, and then removal of the deprecated functionality in the major release. This was done to give downstream packages more time to adjust to pandas deprecations, which should reduce the amount of warnings that a user gets from code that isn’t theirs. See PDEP 17 for more details.
All warnings for upcoming changes in pandas will have the base class pandas.errors.PandasChangeWarning. Users may also use the following subclasses to control warnings.
pandas.errors.Pandas4Warning: Warnings which will be enforced in pandas 4.0.
pandas.errors.Pandas5Warning: Warnings which will be enforced in pandas 5.0.
pandas.errors.PandasPendingDeprecationWarning: Base class of all warnings which emit a PendingDeprecationWarning, independent of the version they will be enforced.
pandas.errors.PandasDeprecationWarning: Base class of all warnings which emit a DeprecationWarning, independent of the version they will be enforced.
pandas.errors.PandasFutureWarning: Base class of all warnings which emit a FutureWarning, independent of the version they will be enforced.
Deprecations added in 3.x using pandas.errors.Pandas4Warning will initially inherit from pandas.errors.PandasDeprecationWarning. In the last minor release of the 3.x series, these deprecations will switch to inherit from pandas.errors.PandasFutureWarning for broader visibility.
Other enhancements#
I/O:
errors.DtypeWarning improved to include column names when mixed data types are detected (GH 58174)
DataFrame.to_excel() argument merge_cells now accepts a value of "columns" to only merge MultiIndex column header header cells (GH 35384)
DataFrame.to_excel() has a new autofilter parameter to add automatic filters to all columns (GH 61194)
DataFrame.to_excel() now raises a UserWarning when the character count in a cell exceeds Excel’s limitation of 32767 characters (GH 56954)
read_parquet() accepts to_pandas_kwargs which are forwarded to pyarrow.Table.to_pandas() which enables passing additional keywords to customize the conversion to pandas, such as maps_as_pydicts to read the Parquet map data type as python dictionaries (GH 56842)
read_spss() now supports kwargs to be passed to pyreadstat (GH 56356)
read_stata() now returns datetime64 resolutions better matching those natively stored in the stata format (GH 55642)
DataFrame.to_csv() and Series.to_csv() now support f-strings (e.g., "{:.6f}") for the float_format parameter, in addition to the % format strings and callables (GH 49580)
DataFrame.to_json() now encodes Decimal as strings instead of floats (GH 60698)
Added "delete_rows" option to if_exists argument in DataFrame.to_sql() deleting all records of the table before inserting data (GH 37210).
Added support to read and write from and to Apache Iceberg tables with the new read_iceberg() and DataFrame.to_iceberg() functions (GH 61383)
Errors occurring during SQL I/O will now throw a generic DatabaseError instead of the raw Exception type from the underlying driver manager library (GH 60748)
Restore support for reading Stata 104-format and enable reading 103-format dta files (GH 58554)
Support reading Stata 102-format (Stata 1) dta files (GH 58978)
Support reading Stata 110-format (Stata 7) dta files (GH 47176)
Support reading value labels from Stata 108-format (Stata 6) and earlier files (GH 58154)
Groupby/resample/rolling:
pandas.NamedAgg now supports passing *args and **kwargs to calls of aggfunc (GH 58283)
DataFrameGroupBy and SeriesGroupBy methods sum, mean, median, prod, min, max, std, var and sem now accept skipna parameter (GH 15675)
DataFrameGroupBy.transform(), SeriesGroupBy.transform(), DataFrameGroupBy.agg(), SeriesGroupBy.agg(), RollingGroupby.apply(), ExpandingGroupby.apply(), Rolling.apply(), Expanding.apply(), DataFrame.apply() with engine="numba" now supports positional arguments passed as kwargs (GH 58995)
DataFrameGroupBy.transform(), SeriesGroupBy.transform(), DataFrameGroupBy.agg(), SeriesGroupBy.agg(), SeriesGroupBy.apply(), DataFrameGroupBy.apply() now support kurt (GH 40139)
Rolling.aggregate(), Expanding.aggregate() and ExponentialMovingWindow.aggregate() now accept NamedAgg aggregations through **kwargs (GH 28333)
Added Rolling.first(), Rolling.last(), Expanding.first(), and Expanding.last() (GH 33155)
Added Rolling.nunique() and Expanding.nunique() (GH 26958)
Added Rolling.pipe() and Expanding.pipe() (GH 57076)
Reshaping:
pandas.merge() propagates the attrs attribute to the result if all inputs have identical attrs, as has so far already been the case for pandas.concat().
pandas.merge() now validates the how parameter input (merge type) (GH 59435)
pandas.merge(), DataFrame.merge() and DataFrame.join() now support anti joins (left_anti and right_anti) in the how parameter (GH 42916)
DataFrame.pivot_table() and pivot_table() now allow the passing of keyword arguments to aggfunc through **kwargs (GH 57884)
pandas.concat() will raise a ValueError when ignore_index=True and keys is not None (GH 59274)
Improve error reporting through outputting the first few duplicates when merge() validation fails (GH 62742)
Missing:
DataFrame.fillna() and Series.fillna() can now accept value=None; for non-object dtype the corresponding NA value will be used (GH 57723)
Added support for axis=1 with dict or Series arguments in DataFrame.fillna() (GH 4514)
Numeric:
DataFrame.agg() called with axis=1 and a func which relabels the result index now raises a NotImplementedError (GH 58807).
DataFrame.corrwith() now accepts min_periods as optional arguments, as in DataFrame.corr() and Series.corr() (GH 9490)
DataFrame.cummin(), DataFrame.cummax(), DataFrame.cumprod() and DataFrame.cumsum() methods now have a numeric_only parameter (GH 53072)
DataFrame.ewm() now allows adjust=False when times is provided (GH 54328)
Series.cummin() and Series.cummax() now supports CategoricalDtype (GH 52335)
Series.map() can now accept kwargs to pass on to func (GH 59814)
Series.nlargest() uses stable sort internally and will preserve original ordering in the case of equality (GH 55767)
Series.round() now supports object dtypes when the underlying Python objects implement __round__ (GH 63444)
Support passing a Iterable[Hashable] input to DataFrame.drop_duplicates() (GH 59237)
Strings:
Series.str.get_dummies() now accepts a dtype parameter to specify the dtype of the resulting DataFrame (GH 47872)
Added Series.str.isascii() (GH 59091)
Allow dictionaries to be passed to Series.str.replace() via pat parameter (GH 51748)
Datetimelike:
Easter has gained a new constructor argument method which specifies the method used to calculate Easter — for example, Orthodox Easter (GH 61665)
Holiday constructor argument days_of_week will raise a ValueError when type is something other than None or tuple (GH 61658)
Holiday has gained the constructor argument and field exclude_dates to exclude specific datetimes from a custom holiday calendar (GH 54382)
Added half-year offset classes HalfYearBegin, HalfYearEnd, BHalfYearBegin and BHalfYearEnd (GH 60928)
Improved deprecation message for offset aliases (GH 60820)
Multiplying two DateOffset objects will now raise a TypeError instead of a RecursionError (GH 59442)
Indexing:
DataFrame.iloc() and Series.iloc() now support boolean masks in __getitem__ for more consistent indexing behavior (GH 60994)
Index.get_loc() now accepts also subclasses of tuple as keys (GH 57922)
Styler / output formatting:
Styler.set_tooltips() provides alternative method to storing tooltips by using title attribute of td elements. (GH 56981)
Added Styler.to_typst() to write Styler objects to file, buffer or string in Typst format (GH 57617)
Styler.format_index_names() can now be used to format the index and column names (GH 48936 and GH 47489)
frozenset elements in pandas objects are now natively printed (GH 60690)
Typing:
pandas.api.typing.FrozenList is available for typing the outputs of MultiIndex.names, MultiIndex.codes and MultiIndex.levels (GH 58237)
pandas.api.typing.NoDefault is available for typing no_default (GH 60696)
pandas.api.typing.SASReader is available for typing the output of read_sas() (GH 55689)
Many type aliases are now exposed in the new submodule pandas.api.typing.aliases (GH 55231)
Plotting:
Series.plot() now correctly handle the ylabel parameter for pie charts, allowing for explicit control over the y-axis label (GH 58239)
Added missing parameter weights in DataFrame.plot.kde() for the estimation of the PDF (GH 59337)
DataFrame.plot.scatter() argument c now accepts a column of strings, where rows with the same string are colored identically (GH 16827 and GH 16485)
ExtensionArray:
ArrowDtype now supports pyarrow.JsonType (GH 60958)
Series.rank() and DataFrame.rank() with numpy-nullable dtypes preserve NA values and return UInt64 dtype where appropriate instead of casting NA to NaN with float64 dtype (GH 62043)
Improve the resulting dtypes in DataFrame.where() and DataFrame.mask() with ExtensionDtype other (GH 62038)
Other:
set_option() now accepts a dictionary of options, simplifying configuration of multiple settings at once (GH 61093)
DataFrame.apply() supports using third-party execution engines like the Bodo.ai JIT compiler (GH 60668)
Series.map() now accepts an engine parameter to allow execution with a third-party execution engine (GH 61125)
Support passing a Series input to json_normalize() that retains the Index (GH 51452)
Users can globally disable any PerformanceWarning by setting the option mode.performance_warnings to False (GH 56920)
Packaging:
Switched wheel upload to PyPI Trusted Publishing (OIDC) for release-tag pushes in wheels.yml. (GH 61718)
Wheels are now available for Windows ARM64 architecture (GH 61462)
Wheels are now available for free-threading Python builds on Windows (in addition to the other platforms) (GH 61463)
Notable bug fixes#
These are bug fixes that might have notable behavior changes.
Improved behavior in groupby for observed=False#
A number of bugs have been fixed due to improved handling of unobserved groups. All remarks in this section equally impact SeriesGroupBy. (GH 55738)
In previous versions of pandas, a single grouping with DataFrameGroupBy.apply() or DataFrameGroupBy.agg() would pass the unobserved groups to the provided function, resulting correctly in 0 below.
In [4]: df = pd.DataFrame(
...: {
...: "key1": pd.Categorical(list("aabb"), categories=list("abc")),
...: "key2": [1, 1, 1, 2],
...: "values": [1, 2, 3, 4],
...: }
...: )
...:
In [5]: df
Out[5]:
key1 key2 values
0 a 1 1
1 a 1 2
2 b 1 3
3 b 2 4
In [6]: gb = df.groupby("key1", observed=False)
In [7]: gb[["values"]].apply(lambda x: x.sum())
Out[7]:
values
key1
a 3
b 7
c 0
However this was not the case when using multiple groupings, resulting in NaN below.
In [1]: gb = df.groupby(["key1", "key2"], observed=False)
In [2]: gb[["values"]].apply(lambda x: x.sum())
Out[2]:
values
key1 key2
a 1 3.0
2 NaN
b 1 3.0
2 4.0
c 1 NaN
2 NaN
Now using multiple groupings will also pass the unobserved groups to the provided function.
In [8]: gb = df.groupby(["key1", "key2"], observed=False)
In [9]: gb[["values"]].apply(lambda x: x.sum())
Out[9]:
values
key1 key2
a 1 3
2 0
b 1 3
2 4
c 1 0
2 0
Similarly:
In previous versions of pandas the method DataFrameGroupBy.sum() would result in 0 for unobserved groups, but DataFrameGroupBy.prod(), DataFrameGroupBy.all(), and DataFrameGroupBy.any() would all result in NA values. Now these methods result in 1, True, and False respectively.
DataFrameGroupBy.groups() did not include unobserved groups and now does.
These improvements also fixed certain bugs in groupby:
DataFrameGroupBy.agg() would fail when there are multiple groupings, unobserved groups, and as_index=False (GH 36698)
DataFrameGroupBy.groups() with sort=False would sort groups; they now occur in the order they are observed (GH 56966)
DataFrameGroupBy.nunique() would fail when there are multiple groupings, unobserved groups, and as_index=False (GH 52848)
DataFrameGroupBy.sum() would have incorrect values when there are multiple groupings, unobserved groups, and non-numeric data (GH 43891)
DataFrameGroupBy.value_counts() would produce incorrect results when used with some categorical and some non-categorical groupings and observed=False (GH 56016)
Backwards incompatible API changes#
Datetime/timedelta resolution inference#
Prior to pandas 3.0, whenever converting a sequence of of strings, stdlib datetime objects, np.datetime64 objects, or integers to a datetime64 / timedelta64 dtype, this would always have resulted in nanosecond resolution (or raised an out-of-bounds error). Now it performs inference on the appropriate resolution (a.k.a. unit) for the output dtype. This affects both the generric constructors (Series, DataFrame, Index, DatetimeIndex) as specific conversion or creation functions (to_datetime(), to_timedelta(), date_range(), timedelta_range(), Timestamp, Timedelta).
The general rules for the various types of input:
The new default resolution when parsing strings is microseconds, falling back to nanoseconds when the precision of the string requires it.
The resolution of the input is preserved for stdlib datetime objects (i.e. microseconds) or np.datetime64/np.timedelta64 objects (i.e. the unit, capped to the supported range of seconds to nanoseconds).
For integer input, the resolution how the integer value is interpreted is used as the resulting resolution (or capped to the supported range of seconds to nanoseconds).
For example, the following would always have given nanosecond resolution previously, but now infer:
# parsing strings
In [10]: print(pd.to_datetime(["2024-03-22 11:36"]).dtype)
datetime64[us]
# converting integers
In [11]: print(pd.to_datetime([0], unit="s").dtype)
datetime64[s]
# converting stdlib datetime object
In [12]: dt = pd.Timestamp("2024-03-22 11:36").to_pydatetime()
In [13]: print(pd.to_datetime([dt]).dtype)
datetime64[us]
# the same when inferring a datetime dtype in the generic constructors
In [14]: print(pd.Series([dt]).dtype)
datetime64[us]
# converting numpy objects
In [15]: print(pd.Series([np.datetime64("2024-03-22", "ms")]).dtype)
datetime64[ms]
Similar when passed a sequence of np.datetime64 objects, the resolution of the passed objects will be retained (or for lower-than-second resolution, second resolution will be used).
When parsing strings, the default is now microseconds (which also impacts I/O methods reading from text files, such as read_csv() and read_json()). Except when the string has nanosecond precision, in which case nanosecond resolution is used:
In [16]: print(pd.to_datetime(["2024-03-22 11:43:01.123"]).dtype)
datetime64[us]
In [17]: print(pd.to_datetime(["2024-03-22 11:43:01.123456"]).dtype)
datetime64[us]
In [18]: print(pd.to_datetime(["2024-03-22 11:43:01.123456789"]).dtype)
datetime64[ns]
This is also a change for the Timestamp constructor with a string input, which in version 2.x.y could give second or millisecond unit (GH 52653).
Warning
Many users will now get “datetime64[us]” dtype data in cases when they used to get “datetime64[ns]”. For most use cases they should not notice a difference. One big exception is converting to integers, which will give integers 1000x smaller.
When converting datetime-like da