Pandas Mastery: Comprehensive Tabular Infographic

Pandas: The Essential Reference

This table summarizes all core Pandas concepts, methods, attributes, and best practices for quick learning and revision.

Category Topic / Method / Attribute Syntax / Formula Use Case / Explanation Key Tip / Interview Point
Core Concepts What is Pandas? `import pandas as pd` Python library for data manipulation and analysis, built on NumPy. Provides high-performance, easy-to-use data structures. Mention: Fast (NumPy/C), Flexible (I/O), Powerful (features).
Core Concepts Series (1D) `pd.Series([data], index=[labels])` A one-dimensional labeled array. Think of it as a single column from a spreadsheet. Has both data and an index (labels).
Core Concepts DataFrame (2D) `pd.DataFrame({'col': [data]})` A two-dimensional labeled table. Collection of Series, sharing a common index. Your primary Pandas object. A DataFrame is essentially a dictionary of Series.
Data Structures: Attributes `.index` `series.index`
`df.index`
Returns the row labels (index) of a Series or DataFrame. Crucial for label-based selection (`.loc`).
Data Structures: Attributes `.values` `series.values`
`df.values`
Returns the data as a NumPy array. Underlying data structure for performance.
Data Structures: Attributes `.dtype` / `.dtypes` `series.dtype`
`df.dtypes`
Returns the data type of a Series / Series of data types for DataFrame columns. Essential for data cleaning and debugging type issues.
Data Structures: Attributes `.shape` `df.shape` Returns a tuple `(rows, columns)` representing DataFrame dimensions. Quickly understand dataset size.
Data Structures: Attributes `.columns` `df.columns` Returns the column labels of a DataFrame. Useful for iterating or checking column names.
I/O `pd.read_csv()` `pd.read_csv('file.csv', sep=',')` Reads a CSV file into a DataFrame. Handles various delimiters. Can read TXT files if they are comma-separated.
I/O `df.to_csv()` `df.to_csv('output.csv', index=False)` Writes a DataFrame to a CSV file. Always use `index=False` unless you want the DataFrame index as a column.
I/O Other I/O `pd.read_excel()`
`df.to_excel()`
Read/write Excel, JSON, SQL, etc. Pandas supports many file formats.
Inspection `.head()` / `.tail()` `df.head(n)`
`df.tail(n)`
View the first/last `n` rows (default 5). Quick visual check. First step after loading data.
Inspection `.info()` `df.info()` Provides a concise summary: non-null counts, data types, memory usage. Crucial for quickly identifying missing data and incorrect dtypes.
Inspection `.describe()` `df.describe()` Generates descriptive statistics (mean, std, min, max, quartiles) for numeric columns. Great for understanding data distribution.
Selection Select Column(s) `df['col_name']` (Series)
`df[['col1', 'col2']]` (DataFrame)
Access one or more columns. Single brackets for Series, double for DataFrame.
Selection `.loc` (Label-based) `df.loc['row_label', 'col_label']`
`df.loc['start':'end', ['col1', 'col2']]`
Select by row/column labels. Slicing is inclusive. Preferred for explicit, readable selection; avoids `SettingWithCopyWarning`.
Selection `.iloc` (Position-based) `df.iloc[row_pos, col_pos]`
`df.iloc[0:2, 1]`
Select by integer position. Slicing is exclusive (like Python lists). Useful for programmatic selection (e.g., first N rows).
Filtering / Masking Boolean Indexing `df[df['col'] > value]` Select rows where a condition is `True`. The most common way to filter data.
Filtering / Masking Combining Conditions `df[(cond1) & (cond2)]`
`df[(cond1) | (cond2)]`
Use `&` for AND, `|` for OR. Each condition must be in parentheses. Do NOT use Python's `and`/`or` keywords.
Filtering / Masking `.isin()` `df[df['col'].isin(['val1', 'val2'])]` Efficiently filter rows where a column's value is in a list of values. Much cleaner than multiple `|` conditions.
Cleaning Missing Data Count `df.isnull().sum()` Counts the number of `NaN` (missing) values per column. First step in handling missing data.
Cleaning `.dropna()` `df.dropna(axis=0/1, how='any/all')` Removes rows (`axis=0`, default) or columns (`axis=1`) with `NaN` values. `how='any'` (default) drops if any NaN, `how='all'` drops if all are NaN. Use when NaNs are a small percentage; be careful about data loss.
Cleaning `.fillna()` `df['col'].fillna(value)`
`df['col'].fillna(method='ffill')`
Replaces `NaN` values. Can use a constant, mean/median, or `ffill` (forward-fill) / `bfill` (backward-fill). Generally preferred over `dropna` to retain data.
Cleaning `.astype()` `df['col'].astype(int/float/bool)` Converts a column to a specified data type. Crucial for enabling correct operations and memory efficiency.
Cleaning `pd.to_numeric()`
`pd.to_datetime()`
`pd.to_numeric(df['col'], errors='coerce')`
`pd.to_datetime(df['col'])`
Robust conversion functions. `errors='coerce'` turns unparseable values into `NaN`. Use these for safer conversions, especially with mixed data.
Cleaning `.rename()` `df.rename(columns={'old': 'new'})` Renames columns or index labels. Standardize names (e.g., `snake_case`) for clarity.
Reshaping & Combining `groupby()` `df.groupby('col')['val'].sum()`
`df.groupby(['col1', 'col2'])['val'].agg(['mean', 'max'])`
Splits data into groups, applies an aggregation, and combines results. (Split-Apply-Combine). Fundamental for summarizing data.
Reshaping & Combining `groupby().transform()` `df.groupby('col')['val'].transform('mean')` Performs group-wise operation but returns a result of the same size as the original DataFrame. Useful for normalization (e.g., score vs. class average).
Reshaping & Combining `groupby().apply()` `df.groupby('col').apply(custom_func)` Most flexible. Applies an arbitrary function to each group. Can return results of any size. Use as a last resort if `agg()` or `transform()` don't fit.
Reshaping & Combining `pd.merge()` `pd.merge(df1, df2, on='key', how='inner/left/right/outer')` Combines DataFrames based on common column(s) or index. (SQL JOIN equivalent). Know all `how` types (`inner` is default).
Reshaping & Combining `pd.concat()` `pd.concat([df1, df2], axis=0, ignore_index=True)` Stacks DataFrames vertically (`axis=0`, default) or horizontally (`axis=1`). Use `ignore_index=True` to reset index after stacking.
Reshaping & Combining `pd.pivot_table()` `pd.pivot_table(df, index='rows', columns='cols', values='vals', aggfunc='sum')` Reshapes data from "long" to "wide" format, creating a summary table. Excellent for creating human-readable reports.
Reshaping & Combining `pd.melt()` `pd.melt(df, id_vars=['id_cols'], var_name='new_col', value_name='new_val')` Reshapes data from "wide" to "long" format. Inverse of pivoting. Useful for preparing data for certain ML models or visualizations.
Advanced Time-Series: `pd.to_datetime()` `df['Date'] = pd.to_datetime(df['Date'])` Converts string/object columns to proper datetime objects. Essential first step for any time-series analysis.
Advanced Time-Series: `DatetimeIndex` `df.set_index('Date', inplace=True)` Setting a datetime column as the DataFrame's index enables powerful time-based slicing. Allows `df.loc['YYYY-MM']` for easy date range selection.
Advanced Time-Series: `.resample()` `df_indexed.resample('W').sum()` Changes the frequency of time-series data (e.g., daily to weekly totals). Common codes: `D`, `W`, `M`, `Q`, `A`.
Advanced Text Data: `.str` accessor `df['col'].str.strip()`
`df['col'].str.lower()`
`df['col'].str.replace('old', 'new')`
`df['col'].str.contains('pattern')`
Applies Python string methods to an entire Series of text data. Vectorized string operations are much faster than loops.
Advanced MultiIndex `df.groupby(['col1', 'col2'])`
`df.loc[('Level1', 'Level2')]`
Hierarchical indexing, allowing multiple levels of row/column labels. Often a result of `groupby()`. Organizes complex, grouped data elegantly.
Best Practices Vectorized Operations `df['col'] * 2` (vs. `for x in list: x*2`) Always prefer built-in Pandas/NumPy operations over Python loops for performance. This is the #1 performance tip in Pandas.
Best Practices Proper Data Types `df['col'].astype('category')` Use the most memory-efficient data types (e.g., `int8`, `float32`, `category`). Saves memory, especially for large datasets.
Best Practices Method Chaining `df.method1().method2().method3()` Link multiple operations together in a single, fluid line of code. Improves readability and often performance by avoiding intermediate variables.
Best Practices Explicit Indexing `df.loc[row, col] = value` Always use `.loc` or `.iloc` for selection and assignment to prevent `SettingWithCopyWarning`. Avoid chained indexing like `df['col'][row]` for assignment.

25+ AI side Hustle Idea, One made me $4821/month. Subscribe to Get Free PDF