pyspark.sql.functions.count_distinct#

pyspark.sql.functions.count_distinct(col, *cols)[source]#

Returns a new Column for distinct count of col or cols.

New in version 3.2.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
colColumn or str

first column to compute on.

colsColumn or str

other columns to compute on.

Returns
Column

distinct values of these two column values.

Examples

Example 1: Counting distinct values of a single column

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([(1,), (1,), (3,)], ["value"])
>>> df.select(sf.count_distinct(df.value)).show()
+---------------------+
|count(DISTINCT value)|
+---------------------+
|                    2|
+---------------------+

Example 2: Counting distinct values of multiple columns

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([(1, 1), (1, 2)], ["value1", "value2"])
>>> df.select(sf.count_distinct(df.value1, df.value2)).show()
+------------------------------+
|count(DISTINCT value1, value2)|
+------------------------------+
|                             2|
+------------------------------+

Example 3: Counting distinct values with column names as strings

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([(1, 1), (1, 2)], ["value1", "value2"])
>>> df.select(sf.count_distinct("value1", "value2")).show()
+------------------------------+
|count(DISTINCT value1, value2)|
+------------------------------+
|                             2|
+------------------------------+