This is the first part of Pandas manual that I’ve composed as a quick reference. I’ve been using it for some time for personal needs and after some updates and changes, I share it here.

The content is compiled from various sources, primarily the official Pandas manual, as well as from videos by Daniel Chen, Pandas is for Everyone – PyCon 2019, Kevin Markham’s video Data Science Best Practices with pandas (PyCon 2019). I also incorporate some tips and tricks I’ve learned from my projects.

Pandas can work with a variety of data sources. Data sources can be textual, binary or SQL. Popular data formats are:

• CSV (text)
• JSON (text)
• Microsoft Excel (binary)
• SQL (SQL)
• HTML (text)

Besides these, there are several other sources that Pandas can read directly (OpenDocument, Python Pickle, Google Big Query and more).

For each of these types, Pandas offers read and write functions, prefixed as read_ and to_. So the popular data formats can be read and written using these functions:

• CSV: read_csvto_csv
• JSON: read_jsonto_json
• Excel: readexcelto_excel
• SQL: read_sqlto_sql
• HTML: read_htmlto_html

Most frequently I deal with tabular data contained in text files. For example, a dataset of weekly product views that I can import with the read_csv  function.

Output:

The only parameter I pass in this example is the file path (filepath_or_buffer). This parameter can be a file system path, URL (http, ftp, S3). I can also accept an opened file object or an in-memory stream for text I/O (StringIO) or any other object that implements a read() method and returns text data.

You can load data from an in-memory stream. If for some reason, the source parameter is not in the first position, you can explicitly specify its name filepath_or_buffer.

Output:

Pandas gives you a way to load data from a remote location via a URL:

Output:

## Data Parsing

Below are several aspects of Pandas data IO that influence how the data is read and parsed.

### Separator (delimiter)

Parameters:

• sep, or
• delimiter

The default separator used by read_csv is comma (,). In a case of data that is uses a different separator (e.g., tab), we need to pass it as a value to the sep parameter. The alternative name for this parameter is delimiter.

Output:

Output:

#### Detecting the separator

When the separator parameter is set to None, Pandas is sometimes able to detect it using a special tool csv.Sniffer. this feature is available only with the Python parsing engine, so unless you specify it, you get a warning message.

Output:

Detecting the tab-separator and suppressing the warning message by specifying the parsing engine:

Output:

### Multicharacter (regex) separator

When the data is separated with multiple characters, you can pass a regular expression to the sep parameter.

Possible problem: when your data is quoted, the regex separator parsing may not work.

Output:

The regex separators work only with the somewhat slower Python parser:

Output:

### Quoted and escaped data

Often data may contain separator characters in the column values. For example, columns of a text containing commas in a comma-separated data file. In this case, column values are quoted so that the parser can ignore separator characters between the quotes.

Most often, the data is quoted with the double-quote character ‘”‘, Pandas can recognize the quotation automatically:

Output:

If the quoting character is different, you can specify it with the quotechar parameter:

Output:

Sometimes, column values containing separator characters are not quoted. Instead, separators are escaped. In this case, you need to use the escapechar parameter:

Output:

If in the quoted data, the quote character appears in the column value and is self-escaped (i.e., given twice: ""), you can tell Pandas to treat it as a single character and a part of the column value:

Output:

### Quoted data and regex separator

To illustrate the problem that regex separator has with quoted data look at the data string below. The first-row column values contain the character sequence--- that is also used to separate columns. These values are quoted to prevent their mixing with separators. To indicate that you are using quoted data, you can pass the parameter quotechar. Still, the regex separator cannot detect them properly:

Output:

### Dialect

Separators, quoting, and escaping can be controlled collectively by the dialect. You can override all default values used to handle separators, quotes, and escape characters by specifying one of the available csv dialects:

• default
• excel
• excel-tab
• unix

When you specify a dialect, you tell Pandas to use its values for the following parameters:

• delimiter
• doublequote
• escapechar
• lineterminator
• quotechar
• quoting
• skipinitialspace
• strict

Below are the values supplied by various dialects.

Output:

#### Custom dialects

For more flexibility, you can define your own dialect and provide such values for its properties that you need to load your data source.

Output:

You can use an existing dialect by simply modifying one or more of its properties.

In the example below, one of the column values contains a quote character. The dialect used by default by read_csv is excel. This dialect treats " as a quoting character and will fail if it finds an unclosed quote.

The following custom dialect is based on excel but is set up to ignore quoting:

Output:

Some data sources may include comments. Comments can come in two ways:

• Entire lines commented out, or
• A column value or part of it commented out.

Pandas allow specifying the comment character in the comment parameter.

Without specifying the comment parameter, commented lines and values will be imported:

Output:

If you need to hide comments, you must specify the comment character. Only single characters are accepted, e.g. #:

Output:

Note that the comment suppression hides the entire content of the first line past the comment character – the separators are ignored and the values for columns col2 and col3 are not read.

### Empty lines

Empty lines are ignored by default. If empty lines need to be imported (e.g., because they actually mean missing data that you will have to impute later), You can set the skip_blank_lines parameter to False:

Output:

### Missing values

Pandas allow me to control, which values must be interpreted as missing (NA) values. Some strings are recongnized by default as empty. They are:

• -1.#IND
• 1.#QNAN
• 1.#IND
• -1.#QNAN
• #N/A
• N/A
• #N/A
• N/A
• n/a
• NA
• #NA
• NULL
• null
• NaN
• -NaN
• nan
• -nan
• ” (no character)

Output:

When necessary, you can extend the list with my custom entries. The custom values can be strings and numbers. For example, you want that the letter ‘c’ and the number 1 are processed as missing values:

Output:

If you want to ignore the default missing values, you can that tell pandas by setting the parameter keep_default_na to False:

Output:

## Data Columns

When Pandas imports data, it assumes that the first row contains column names. If the first row contains data other than column names, you can specify, what row to use as the header:

Output:

### Duplicate column names

If the data contains duplicate column names, Pandas will automatically append a counter to the duplicate.

Output:

### Custom column names

You can supply the parameter names with a list of the column names that you want to use instead of those in the data source. If you omit the header parameter or set it to None, the row with the original header will be imported as a row of data.

Output:

To avoid importing the original header as data, I need to specify the row containing it in the header parameter.

Output:

### Filtering by columns

If you need only a few of the columns in the data source, you can specify them in the parameter usecols:

Output:

### Index columns

Sometimes, the data source contains fewer columns in the headers row than there are columns in the following rows. In this case, Pandas will thread the first column as an index column:

You can also explicitly specify the location of the index column by using the index_col. This is handy when the number of header columns and the value columns is the same. The index will get the name from the respective header column:

Output:

### Trailing delimiters

Sometimes only some of the rows in the data source have an extra column. This can happen when such rows end in a trailing delimiter. In this case, the index column parser will not work. To avoid the problem, you will have to set index_col to False:

### MultiIndex

If the data has a multi-index structure and the header row does not include the index columns, Pandas is capable to recognize the MultiIndex automatically:

Output:

When the MultiIndex columns are named in the header row, you can pass a list of the index column numbers to the index_col parameter:

Output:

## Data Types

### Explicit column types

If you are sure about the data types present in the data source, you can explicitly specify what data type to use. You can do it to the entire data frame or to specific columns.

When you pass object to the dtype parameter, the data is read as strings:

Output:

When you have mixed-type data, you can import some columns as strings and some as integers or floats:

Output:

### Coercing a data type

When a numerical column contains strings, you can enforce numerical conversion after the import by using the to_numeric method. To this method, you can pass the column number and the errors='coerce':

Output:

### Parsing numbers

Parsing of integers is straightforward and happens automatically:

Output:

If the source data has numbers formatted as a float (i.e., have a decimal separator), they also will be parsed as float data type automatically. It is enough that one value is float-formatted in a column of numbers to have the whole column parsed as float:

Output:

During a data import, if you want to parse a number column that has float formatted values to ints, Pandas will throw a TypeError. However, casting float columns to int is possible after the import.

To handle commas as the decimal separator, pass decimal=','. In a case of a comma-decimals, make sure that the decimal values are quoted in a comma-separated data source:

Output:

To parse numbers containing a thousand separator, pass the thousands parameter:

Output:

### Parsing Boolean values

If a data source contains values like ‘True’, ‘true’, ‘TRUE’, ‘False’, ‘false’, ‘False’, columns containing these values will be automatically parsed as booleans. Note that if there a single value in a column that cannot be parsed as a boolean, the entire column will be parsed as a string (object):

Output:

You  can add more values that should be parsed as boolean by passing them in a list to true_values and false_values. Passing integers 0 and 1 will not work, but these values can be cast to boolean after the import.

Output:

### Parsing DateTime

By default, datetime values are parsed as strings. By passing True to the parameter parse_dates, it is possible to parse them into Python datetime objects. Below, the resulting DataFrame gets a DatetimeIndex:

Output:

The parameter parse_dates can be used to indicate what columns should be parsed into dates.

By passing a list of column numbers to it, you can tell it to try and parse each column into a datetime object. Note that when encountering a time string, the parser assumes the current date. So, the column tea is interpreted as time on 2019-09-26:

Output:

If you need to combine data from several columns to build datetime values out of them, you can pass a nested list to the parse_dates parameter. So, you will get date and time in the columns date_lunch and date_tea:

Output:

By default, the original columns are discarded after the parsing. You can tell Pandas to keep them by passing true to the parameter keep_date_col:

Output:

Parsing variously formatted date strings. Note that the data in the column date2 could not be parsed. The values there put the day before the month and cannot be parsed to datetime objects without using the dayfirst parameter:

Output:

The same data, not parsed with the day_first=True. Note that values in the first 4 columns are parsed assuming the day being before the month. However, the last column is still parsed with the assumption that the month precedes the day:

Output:

Date strings without delimiters are parsed without regard to the dayfirst parameter unless there is a value that can be parsed only with the day first assumption (e.g., ‘20191703’).

Output:

## Error Processing

Sometimes, the data source can contain lines that have more separators than other lines. In this case, columns cannot be recognized and the parser cannot generate a DataFrame.

You can choose to skip bad lines by passing error_bad_lines=False.

Output:

The parser will also output a warning that says what line was skipped. To subdue this warning I can also pass warn_bad_lines=False.

Output: