Pandas: The Essential Reference
This table summarizes all core Pandas concepts, methods, attributes, and best practices for quick learning and revision.
Category | Topic / Method / Attribute | Syntax / Formula | Use Case / Explanation | Key Tip / Interview Point |
---|---|---|---|---|
Core Concepts | What is Pandas? | `import pandas as pd` | Python library for data manipulation and analysis, built on NumPy. Provides high-performance, easy-to-use data structures. | Mention: Fast (NumPy/C), Flexible (I/O), Powerful (features). |
Core Concepts | Series (1D) | `pd.Series([data], index=[labels])` | A one-dimensional labeled array. Think of it as a single column from a spreadsheet. | Has both data and an index (labels). |
Core Concepts | DataFrame (2D) | `pd.DataFrame({'col': [data]})` | A two-dimensional labeled table. Collection of Series, sharing a common index. Your primary Pandas object. | A DataFrame is essentially a dictionary of Series. |
Data Structures: Attributes | `.index` | `series.index` `df.index` |
Returns the row labels (index) of a Series or DataFrame. | Crucial for label-based selection (`.loc`). |
Data Structures: Attributes | `.values` | `series.values` `df.values` |
Returns the data as a NumPy array. | Underlying data structure for performance. |
Data Structures: Attributes | `.dtype` / `.dtypes` | `series.dtype` `df.dtypes` |
Returns the data type of a Series / Series of data types for DataFrame columns. | Essential for data cleaning and debugging type issues. |
Data Structures: Attributes | `.shape` | `df.shape` | Returns a tuple `(rows, columns)` representing DataFrame dimensions. | Quickly understand dataset size. |
Data Structures: Attributes | `.columns` | `df.columns` | Returns the column labels of a DataFrame. | Useful for iterating or checking column names. |
I/O | `pd.read_csv()` | `pd.read_csv('file.csv', sep=',')` | Reads a CSV file into a DataFrame. Handles various delimiters. | Can read TXT files if they are comma-separated. |
I/O | `df.to_csv()` | `df.to_csv('output.csv', index=False)` | Writes a DataFrame to a CSV file. | Always use `index=False` unless you want the DataFrame index as a column. |
I/O | Other I/O | `pd.read_excel()` `df.to_excel()` |
Read/write Excel, JSON, SQL, etc. | Pandas supports many file formats. |
Inspection | `.head()` / `.tail()` | `df.head(n)` `df.tail(n)` |
View the first/last `n` rows (default 5). Quick visual check. | First step after loading data. |
Inspection | `.info()` | `df.info()` | Provides a concise summary: non-null counts, data types, memory usage. | Crucial for quickly identifying missing data and incorrect dtypes. |
Inspection | `.describe()` | `df.describe()` | Generates descriptive statistics (mean, std, min, max, quartiles) for numeric columns. | Great for understanding data distribution. |
Selection | Select Column(s) | `df['col_name']` (Series) `df[['col1', 'col2']]` (DataFrame) |
Access one or more columns. | Single brackets for Series, double for DataFrame. |
Selection | `.loc` (Label-based) | `df.loc['row_label', 'col_label']` `df.loc['start':'end', ['col1', 'col2']]` |
Select by row/column labels. Slicing is inclusive. | Preferred for explicit, readable selection; avoids `SettingWithCopyWarning`. |
Selection | `.iloc` (Position-based) | `df.iloc[row_pos, col_pos]` `df.iloc[0:2, 1]` |
Select by integer position. Slicing is exclusive (like Python lists). | Useful for programmatic selection (e.g., first N rows). |
Filtering / Masking | Boolean Indexing | `df[df['col'] > value]` | Select rows where a condition is `True`. | The most common way to filter data. |
Filtering / Masking | Combining Conditions | `df[(cond1) & (cond2)]` `df[(cond1) | (cond2)]` |
Use `&` for AND, `|` for OR. Each condition must be in parentheses. | Do NOT use Python's `and`/`or` keywords. |
Filtering / Masking | `.isin()` | `df[df['col'].isin(['val1', 'val2'])]` | Efficiently filter rows where a column's value is in a list of values. | Much cleaner than multiple `|` conditions. |
Cleaning | Missing Data Count | `df.isnull().sum()` | Counts the number of `NaN` (missing) values per column. | First step in handling missing data. |
Cleaning | `.dropna()` | `df.dropna(axis=0/1, how='any/all')` | Removes rows (`axis=0`, default) or columns (`axis=1`) with `NaN` values. `how='any'` (default) drops if any NaN, `how='all'` drops if all are NaN. | Use when NaNs are a small percentage; be careful about data loss. |
Cleaning | `.fillna()` | `df['col'].fillna(value)` `df['col'].fillna(method='ffill')` |
Replaces `NaN` values. Can use a constant, mean/median, or `ffill` (forward-fill) / `bfill` (backward-fill). | Generally preferred over `dropna` to retain data. |
Cleaning | `.astype()` | `df['col'].astype(int/float/bool)` | Converts a column to a specified data type. | Crucial for enabling correct operations and memory efficiency. |
Cleaning | `pd.to_numeric()` `pd.to_datetime()` |
`pd.to_numeric(df['col'], errors='coerce')` `pd.to_datetime(df['col'])` |
Robust conversion functions. `errors='coerce'` turns unparseable values into `NaN`. | Use these for safer conversions, especially with mixed data. |
Cleaning | `.rename()` | `df.rename(columns={'old': 'new'})` | Renames columns or index labels. | Standardize names (e.g., `snake_case`) for clarity. |
Reshaping & Combining | `groupby()` | `df.groupby('col')['val'].sum()` `df.groupby(['col1', 'col2'])['val'].agg(['mean', 'max'])` |
Splits data into groups, applies an aggregation, and combines results. (Split-Apply-Combine). | Fundamental for summarizing data. |
Reshaping & Combining | `groupby().transform()` | `df.groupby('col')['val'].transform('mean')` | Performs group-wise operation but returns a result of the same size as the original DataFrame. | Useful for normalization (e.g., score vs. class average). |
Reshaping & Combining | `groupby().apply()` | `df.groupby('col').apply(custom_func)` | Most flexible. Applies an arbitrary function to each group. Can return results of any size. | Use as a last resort if `agg()` or `transform()` don't fit. |
Reshaping & Combining | `pd.merge()` | `pd.merge(df1, df2, on='key', how='inner/left/right/outer')` | Combines DataFrames based on common column(s) or index. (SQL JOIN equivalent). | Know all `how` types (`inner` is default). |
Reshaping & Combining | `pd.concat()` | `pd.concat([df1, df2], axis=0, ignore_index=True)` | Stacks DataFrames vertically (`axis=0`, default) or horizontally (`axis=1`). | Use `ignore_index=True` to reset index after stacking. |
Reshaping & Combining | `pd.pivot_table()` | `pd.pivot_table(df, index='rows', columns='cols', values='vals', aggfunc='sum')` | Reshapes data from "long" to "wide" format, creating a summary table. | Excellent for creating human-readable reports. |
Reshaping & Combining | `pd.melt()` | `pd.melt(df, id_vars=['id_cols'], var_name='new_col', value_name='new_val')` | Reshapes data from "wide" to "long" format. Inverse of pivoting. | Useful for preparing data for certain ML models or visualizations. |
Advanced | Time-Series: `pd.to_datetime()` | `df['Date'] = pd.to_datetime(df['Date'])` | Converts string/object columns to proper datetime objects. | Essential first step for any time-series analysis. |
Advanced | Time-Series: `DatetimeIndex` | `df.set_index('Date', inplace=True)` | Setting a datetime column as the DataFrame's index enables powerful time-based slicing. | Allows `df.loc['YYYY-MM']` for easy date range selection. |
Advanced | Time-Series: `.resample()` | `df_indexed.resample('W').sum()` | Changes the frequency of time-series data (e.g., daily to weekly totals). | Common codes: `D`, `W`, `M`, `Q`, `A`. |
Advanced | Text Data: `.str` accessor | `df['col'].str.strip()` `df['col'].str.lower()` `df['col'].str.replace('old', 'new')` `df['col'].str.contains('pattern')` |
Applies Python string methods to an entire Series of text data. | Vectorized string operations are much faster than loops. |
Advanced | MultiIndex | `df.groupby(['col1', 'col2'])` `df.loc[('Level1', 'Level2')]` |
Hierarchical indexing, allowing multiple levels of row/column labels. Often a result of `groupby()`. | Organizes complex, grouped data elegantly. |
Best Practices | Vectorized Operations | `df['col'] * 2` (vs. `for x in list: x*2`) | Always prefer built-in Pandas/NumPy operations over Python loops for performance. | This is the #1 performance tip in Pandas. |
Best Practices | Proper Data Types | `df['col'].astype('category')` | Use the most memory-efficient data types (e.g., `int8`, `float32`, `category`). | Saves memory, especially for large datasets. |
Best Practices | Method Chaining | `df.method1().method2().method3()` | Link multiple operations together in a single, fluid line of code. | Improves readability and often performance by avoiding intermediate variables. |
Best Practices | Explicit Indexing | `df.loc[row, col] = value` | Always use `.loc` or `.iloc` for selection and assignment to prevent `SettingWithCopyWarning`. | Avoid chained indexing like `df['col'][row]` for assignment. |