Pandas: The Essential Reference

This table summarizes all core Pandas concepts, methods, attributes, and best practices for quick learning and revision.

Category	Topic / Method / Attribute	Syntax / Formula	Use Case / Explanation	Key Tip / Interview Point
Core Concepts	What is Pandas?	`import pandas as pd`	Python library for data manipulation and analysis, built on NumPy. Provides high-performance, easy-to-use data structures.	Mention: Fast (NumPy/C), Flexible (I/O), Powerful (features).
Core Concepts	Series (1D)	`pd.Series([data], index=[labels])`	A one-dimensional labeled array. Think of it as a single column from a spreadsheet.	Has both data and an index (labels).
Core Concepts	DataFrame (2D)	`pd.DataFrame({'col': [data]})`	A two-dimensional labeled table. Collection of Series, sharing a common index. Your primary Pandas object.	A DataFrame is essentially a dictionary of Series.
Data Structures: Attributes	`.index`	`series.index` `df.index`	Returns the row labels (index) of a Series or DataFrame.	Crucial for label-based selection (`.loc`).
Data Structures: Attributes	`.values`	`series.values` `df.values`	Returns the data as a NumPy array.	Underlying data structure for performance.
Data Structures: Attributes	`.dtype` / `.dtypes`	`series.dtype` `df.dtypes`	Returns the data type of a Series / Series of data types for DataFrame columns.	Essential for data cleaning and debugging type issues.
Data Structures: Attributes	`.shape`	`df.shape`	Returns a tuple `(rows, columns)` representing DataFrame dimensions.	Quickly understand dataset size.
Data Structures: Attributes	`.columns`	`df.columns`	Returns the column labels of a DataFrame.	Useful for iterating or checking column names.
I/O	`pd.read_csv()`	`pd.read_csv('file.csv', sep=',')`	Reads a CSV file into a DataFrame. Handles various delimiters.	Can read TXT files if they are comma-separated.
I/O	`df.to_csv()`	`df.to_csv('output.csv', index=False)`	Writes a DataFrame to a CSV file.	Always use `index=False` unless you want the DataFrame index as a column.
I/O	Other I/O	`pd.read_excel()` `df.to_excel()`	Read/write Excel, JSON, SQL, etc.	Pandas supports many file formats.
Inspection	`.head()` / `.tail()`	`df.head(n)` `df.tail(n)`	View the first/last `n` rows (default 5). Quick visual check.	First step after loading data.
Inspection	`.info()`	`df.info()`	Provides a concise summary: non-null counts, data types, memory usage.	Crucial for quickly identifying missing data and incorrect dtypes.
Inspection	`.describe()`	`df.describe()`	Generates descriptive statistics (mean, std, min, max, quartiles) for numeric columns.	Great for understanding data distribution.
Selection	Select Column(s)	`df['col_name']` (Series) `df[['col1', 'col2']]` (DataFrame)	Access one or more columns.	Single brackets for Series, double for DataFrame.
Selection	`.loc` (Label-based)	`df.loc['row_label', 'col_label']` `df.loc['start':'end', ['col1', 'col2']]`	Select by row/column labels. Slicing is inclusive.	Preferred for explicit, readable selection; avoids `SettingWithCopyWarning`.
Selection	`.iloc` (Position-based)	`df.iloc[row_pos, col_pos]` `df.iloc[0:2, 1]`	Select by integer position. Slicing is exclusive (like Python lists).	Useful for programmatic selection (e.g., first N rows).
Filtering / Masking	Boolean Indexing	`df[df['col'] > value]`	Select rows where a condition is `True`.	The most common way to filter data.
Filtering / Masking	Combining Conditions	`df[(cond1) & (cond2)]` `df[(cond1) \| (cond2)]`	Use `&` for AND, `\|` for OR. Each condition must be in parentheses.	Do NOT use Python's `and`/`or` keywords.
Filtering / Masking	`.isin()`	`df[df['col'].isin(['val1', 'val2'])]`	Efficiently filter rows where a column's value is in a list of values.	Much cleaner than multiple `\|` conditions.
Cleaning	Missing Data Count	`df.isnull().sum()`	Counts the number of `NaN` (missing) values per column.	First step in handling missing data.
Cleaning	`.dropna()`	`df.dropna(axis=0/1, how='any/all')`	Removes rows (`axis=0`, default) or columns (`axis=1`) with `NaN` values. `how='any'` (default) drops if any NaN, `how='all'` drops if all are NaN.	Use when NaNs are a small percentage; be careful about data loss.
Cleaning	`.fillna()`	`df['col'].fillna(value)` `df['col'].fillna(method='ffill')`	Replaces `NaN` values. Can use a constant, mean/median, or `ffill` (forward-fill) / `bfill` (backward-fill).	Generally preferred over `dropna` to retain data.
Cleaning	`.astype()`	`df['col'].astype(int/float/bool)`	Converts a column to a specified data type.	Crucial for enabling correct operations and memory efficiency.
Cleaning	`pd.to_numeric()` `pd.to_datetime()`	`pd.to_numeric(df['col'], errors='coerce')` `pd.to_datetime(df['col'])`	Robust conversion functions. `errors='coerce'` turns unparseable values into `NaN`.	Use these for safer conversions, especially with mixed data.
Cleaning	`.rename()`	`df.rename(columns={'old': 'new'})`	Renames columns or index labels.	Standardize names (e.g., `snake_case`) for clarity.
Reshaping & Combining	`groupby()`	`df.groupby('col')['val'].sum()` `df.groupby(['col1', 'col2'])['val'].agg(['mean', 'max'])`	Splits data into groups, applies an aggregation, and combines results. (Split-Apply-Combine).	Fundamental for summarizing data.
Reshaping & Combining	`groupby().transform()`	`df.groupby('col')['val'].transform('mean')`	Performs group-wise operation but returns a result of the same size as the original DataFrame.	Useful for normalization (e.g., score vs. class average).
Reshaping & Combining	`groupby().apply()`	`df.groupby('col').apply(custom_func)`	Most flexible. Applies an arbitrary function to each group. Can return results of any size.	Use as a last resort if `agg()` or `transform()` don't fit.
Reshaping & Combining	`pd.merge()`	`pd.merge(df1, df2, on='key', how='inner/left/right/outer')`	Combines DataFrames based on common column(s) or index. (SQL JOIN equivalent).	Know all `how` types (`inner` is default).
Reshaping & Combining	`pd.concat()`	`pd.concat([df1, df2], axis=0, ignore_index=True)`	Stacks DataFrames vertically (`axis=0`, default) or horizontally (`axis=1`).	Use `ignore_index=True` to reset index after stacking.
Reshaping & Combining	`pd.pivot_table()`	`pd.pivot_table(df, index='rows', columns='cols', values='vals', aggfunc='sum')`	Reshapes data from "long" to "wide" format, creating a summary table.	Excellent for creating human-readable reports.
Reshaping & Combining	`pd.melt()`	`pd.melt(df, id_vars=['id_cols'], var_name='new_col', value_name='new_val')`	Reshapes data from "wide" to "long" format. Inverse of pivoting.	Useful for preparing data for certain ML models or visualizations.
Advanced	Time-Series: `pd.to_datetime()`	`df['Date'] = pd.to_datetime(df['Date'])`	Converts string/object columns to proper datetime objects.	Essential first step for any time-series analysis.
Advanced	Time-Series: `DatetimeIndex`	`df.set_index('Date', inplace=True)`	Setting a datetime column as the DataFrame's index enables powerful time-based slicing.	Allows `df.loc['YYYY-MM']` for easy date range selection.
Advanced	Time-Series: `.resample()`	`df_indexed.resample('W').sum()`	Changes the frequency of time-series data (e.g., daily to weekly totals).	Common codes: `D`, `W`, `M`, `Q`, `A`.
Advanced	Text Data: `.str` accessor	`df['col'].str.strip()` `df['col'].str.lower()` `df['col'].str.replace('old', 'new')` `df['col'].str.contains('pattern')`	Applies Python string methods to an entire Series of text data.	Vectorized string operations are much faster than loops.
Advanced	MultiIndex	`df.groupby(['col1', 'col2'])` `df.loc[('Level1', 'Level2')]`	Hierarchical indexing, allowing multiple levels of row/column labels. Often a result of `groupby()`.	Organizes complex, grouped data elegantly.
Best Practices	Vectorized Operations	`df['col'] * 2` (vs. `for x in list: x*2`)	Always prefer built-in Pandas/NumPy operations over Python loops for performance.	This is the #1 performance tip in Pandas.
Best Practices	Proper Data Types	`df['col'].astype('category')`	Use the most memory-efficient data types (e.g., `int8`, `float32`, `category`).	Saves memory, especially for large datasets.
Best Practices	Method Chaining	`df.method1().method2().method3()`	Link multiple operations together in a single, fluid line of code.	Improves readability and often performance by avoiding intermediate variables.
Best Practices	Explicit Indexing	`df.loc[row, col] = value`	Always use `.loc` or `.iloc` for selection and assignment to prevent `SettingWithCopyWarning`.	Avoid chained indexing like `df['col'][row]` for assignment.