pandas+#

Overview#

Finplicity provides basic operations for cleaning and reformatting a table in Excel. However, sometimes you need even more power and flexibility.

pandas+ incorporates the power of the SQL without the complexity of the SQL language. Use pandas+ to write a Script to:

select rows from an existing Excel table based on flexible conditions
drop columns from a larger table
add new columns based on the values of existing columns, including using conditional logic where the values in the new column are based on other columns only under specified conditions
rename a column
construct advanced “groupby” and many other advanced operations that would normally require many lines of code in SQL
combine multiple pandas+ commands into a single orderly script to create a “data pipeline”
save this script as part of Workflow so that the pipeline automatically processes new data

At the top of the Script box, there is a script command line generator comprising three elements: the “insert a command” dropdown, the “insert a column” dropdown, and the “cell select” button. These tools are designed to save you time when typing commands.

Insert a command dropdown: provides six basic commands, which are discussed below under Basic Commands. When you select a command from this dropdown, it will automatically fill into the Script box at the current cursor location.
Insert a column dropdown: presents options that correspond to the headers of your original data table. Upon selecting a value from this dropdown, the chosen column name will be inserted into the Script box at the current cursor location.
Single cell select button: facilitates the insertion of an Excel cell location into the Script box. Simply select the desired cell location in the Excel worksheet and click the button. The selected Excel cell location will then be automatically filled into the Script box at the current cursor location.

Saved Scripts also allows for easy auditing of data manipulation, thereby removing a major shortcoming of Excel usage in the business world.

Examples#

Here is a step-by-step example of using the following data and pandas+ to select data for the years between the values in cells A4 and A7.

Prepare an Excel sheet with data to manipulate.
Open Finplicity, and then click Data Select > pandas+. The form of pandas+ will appear, as shown below.
Select the “Original table location” in the spreadsheet. In this example, select B1:J33.
In this example, firstly, use insert command dropdown to select “SELECT ROWS WHERE.” You will see the command autofilled in the Script box. And there is a command hit which you can refer to at the bottom of the Script box.
In Excel worksheet select A4. And click the cell select button to insert the single cell range into Script box.
In the Script box, type “ <= “ using your keyboard.
Use the insert a column dropdown, select “Year.” You will see the column name autofilled in the Script box. The insert column dropdown options are the original data table headers.
In the Script box, type “ <= “ using your keyboard.
In Excel worksheet select A7. And click the cell select button to insert the single cell range into Script box. Now you have added one command into the Script box. You can add multiple lines of command by separating each command using the <Enter> key on your keyboard.
Decide if you want the new table to overwrite the existing table or to be saved at a new “Output location.” In this example, choose “Output location” and select M1.
Now you have finished all the required fields in this form, and you can click “Run” to see the select data result or click “Save” to save this task. The selected data will automatically be shown in its output location. As you can see from the result table, the data only contain rows for the years between the values in cells A4 and A7. Even though the values in cell A4 and A7 were set as static numbers 2007 and 2012, you can dynamically change the values using Control Board.

About Commands#

Basic Commands#

We have provided six basic commands in the insert command dropdown for your convenience. They will be explained using the following table (“Original Table”) in Excel that tracks basic population statistics across many countries. The first 25 rows of the table are shown here:

Original Table: Population Statistics

This table is selected as the “Original table location” and becomes the initial _ Active Table _ residing in computer memory. In the Script Box, the following commands modify Active Table. Each command alters Active Table sequentially for subsequent commands, thereby creating an orderly process of data selection and manipulation.

SELECT ROWS WHERE <comparison>#

Description: Updates Active Table by selecting only the rows based on the conditional logic in <comparison>. All non-selected rows are, therefore, dropped from Active Table.

Example #1:

SELECT ROWS WHERE 1960 < year <= 1970 and pop count < 10000000

Explanation: Active Table is updated to include all table rows with a year greater than 1960 and less or equal to 1970 and with pop count less than 10000000.

Example #2:

SELECT ROWS WHERE (1960 < year <= 1970 and pop count < 10000000 and year != 1967) or continent == ‘Asia’ or country in [‘Afghanistan’,’Bahrain’]

Explanation: Active Table is updated to include all table rows for the countries Afghanistan and Bahrain as well as all countries in the continent of Asia. For other countries, only rows are selected with a year greater than 1960 and less or equal to 1970 but not including 1967, and with pop count less than 10000000.

Usage tips:
- No commas in numbers: Do not include commas in numbers, as it creates ambiguity with text comparisons.
- Comparison Operators: Comparison operators follow the convention of the Python programming language (https://www.w3schools.com/python/gloss_python_comparison_operators.asp). The comparison operator != stands for “not equal” while operator == stands for equality. The other comparison operators, including < and <= , take their usual meaning.
- In [list] operation. Many Python set and list operations are also permitted. Probably the most important operation is the in [list] sequence that allows for compact notation where country in [‘Afghanistan’,’Bahrain’] is a more compact way of writing (country == ‘Afghanistan’) or (country == ‘Bahrain’).
- Double equals: Double equal comparison notation, X == Y, is used to logically compare whether X is equal to Y. As discussed below, a single equal sign, X = Y, will instead attempt to assign Y’s value to X, which is never intended when used SELECT ROWS WHERE. pandas+ will typically catch the invalid use of = when == is intended.
- Quotes for referencing text values in table rows: When referring to a value in a data row that contains text, the text value in the <comparison> must be enclosed in single quotes (e.g., ‘Afghanistan’) or double quotes (e.g., “Afghanistan”). Row values can have spaces in them within quotes (e.g., “The United States of America”). Back ticks (discussed below) are never used for comparisons with row values.
- Back ticks for referencing column names that contain spaces between text: column names should never be placed in quotes (e.g., continent ). Column names are not text values but identifiers. If the column name does not contain a space, just enter the column name without additional notation continent. However, back ticks must be used for a column name that contains a space between text (e.g., pop count ).
- Ignore leading and trailing spaces: Leading and trailing spaces in column names in the Excel table are automatically stripped out to allow for easy comparisons. For example, if a column named ˽˽˽ continent ˽˽ has three leading spaces and two trailing spaces in the Excel table, those spaces are dropped so that the pandas+ Script can just refer to continent instead of worrying about keeping track of spaces. However, spaces between text words are never dropped (e.g., the column pop count or the row value “The United States of America”).
- Working with Excel Dates: In the current example, the date (year) is represented by a simple number that allows for an easy comparison against an integer value like 1960 that is not in quotes. If date information in the Excel table takes on more complex formats, comparisons can typically be made against a date specified in ‘YYYY-MM-DD’ format.

CALCULATE COLUMN <assignment>#

Description: Updates Active Table based on rules specified in <assignment> by modifying an existing column being assigned or by adding a new column if the assigned column does not already exist. You can add or modify a column using CALCULATE COLUMN command.

Example:

CALCULATE COLUMN GDP = gdpPercap * pop count

Explanation: Active Table is updated by adding column GDP if this column does not already exist. If column GDP already existed in Active Table, the row information in that column is modified according to the formula shown.

Usage tips:
- Single equals. Single equal (=) notation represents assignment. pandas+ might produce an error or unexpected results if double equal notation (==) is encountered, unless == is correctly used as part of a Conditional Assignment discussed below.
- Mathematical operators. A wide range of Python mathematical assignment operators can be used (https://www.w3schools.com/python/gloss_python_arithmetic_operators.asp). One major difference to Excel pertains to the use of the ^ sign, which represents exponent in Excel. pandas+ uses ** notation to represent exponent. In pandas+, the ^ sign refers to an advanced binary (bit) command and so its usage might not produce an error message even if ** were intended.
- Compound operations. Compound mathematical expressions can be used as well, e.g.,

CALCULATE COLUMN weird GDP = ((gdpPercap * pop count)**2) / 100

Explanation: a new column is added to Active Table called weird GDP (which contains a space surrounding by back ticks) where each row value is calculated by taking the square of the value in the same row from column GDP, divided by 100.

Conditional Assignments. Assignment (=) operations can be combined with comparison (==) operations to allow new row values to depend on various comparison conditions across different columns.

SELECT COLUMNS <columns>#

Description: Updates Active Table by selecting the columns with the names listed in <columns> separated by spaces; all other columns are, therefore, dropped from Active Table. Columns will be ordered as shown in <columns> regardless of the previous order. You can use SELECT COLUMNS command to select a single column.

Example:

SELECT COLUMNS year country gdpPercap pop count

Explanation: Active Table is updated to include only the columns shown, and in the column order shown. All other columns, therefore, are dropped.

Usage tips:
- Consider DROP COLUMNS: If the table has many columns to be selected, it might be easier to instead DROP COLUMNS the columns to be dropped.
- Reordering columns: In the last example, suppose Active Record contained the shown columns but in a different column order. SELECT COLUMNS will then reorder the columns to the desired order even if there are no columns to drop.

DROP COLUMNS <columns>#

Description: Updates Active Table by dropping the columns with the names listed in <columns> separated by spaces; all other columns, therefore, are kept. You can use DROP COLUMNS command to drop a single column.

Example:

DROP COLUMNS continent lifeExp

Explanation: Active Table is updated by dropping the shown columns. If this script were executed on the same Active Table in memory prior to the SELECT COLUMNS example provided above, the updated Active Table would contain the same information. However, the SELECT COLUMNS example also rearranged the order of the columns to the new order shown in that example.

RENAME COLUMNS <columns> TO <columns>#

Description: Updating Active Table by renaming the columns with new names. Using spaces as separator. You can use RENAME COLUMNS command to rename one column.

Example:

RENAME COLUMNS pop count TO population count

_Explanation: Active Table is updated by renaming the shown columns.

Add a comment#

Description: to improve readability, comments are important in especially long script. Whole-line and in-line Python-style comments that start with the # sign are allowed. All information after # is ignored until a new <Enter> is encountered. Comments can even be added within a command that is formatted using <Enter> across multiple lines.

Example:

# This selection criteria comes from my client, per email on 4/5/2021.

SELECT ROWS WHERE (1960 < year <= 1970 and pop count < 10000000 and year != 1967)<Enter>

or continent == ‘Asia’ # client specifically asked about Asia <Enter>

or country in [‘Afghanistan’,’Bahrain’] # and these countries <Enter>

Hint: since the pandas+ Script is easy to read, the best # comments explain why an operation is being done, not what the operation is doing. The “what” should already be obvious from the Script.

Advanced Commands: Pandas Methods#

The Basic Commands discussed above can be combined with literally hundreds of different Python Pandas methods that perform a wide range of functionality, including round the numbers in the table, sorting table rows based on a given column, dropping rows with empty values, interpolating missing data, along with many more methods. Moreover, unlike SQL where these operations typically require many lines of code (along with the creation of stored routines), many pandas+ operations can be performed in a single line of code that is easily readable from left to right.

One such option is groupby discussed here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby

Example:

groupby([‘country’]).mean()

Explanation: Active Table is updated where each country now has just a single row despite having many rows (one per year) before this command. This data collapse can occur because the new table reports the mean values for the remaining columns.

Usage tips:
- More commands: Many of the “Methods” for Python Pandas can be used, provided that the method operates on a single table (known as a dataframe in Python Pandas): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html Scroll down the page in this last link to look for the section called “Methods.”
- Quotes for all text: The Pandas library requires all text references, including column names without spaces, to be embedded within quotes (e.g., ‘country’ ). To maintain consistency, pandas+ will automatically convert back ticks (e.g., country ) to quotes for column names, thereby allowing you to use back ticks whenever referring to column names while using ‘quotes’ to refer to row values.
- Readability: Methods can usually be strung together for easy reading from left to right, like an English sentence. The following example also rounds the results in the previous example to two decimal places and sorts the table with countries with the largest population on top.

groupby([‘country’]).mean().round(2).sort_values(by=pop count,ascending=False)

Methods are separated by a dot ( . ) or, alternatively, using the <Enter> key:

groupby([‘country’]).mean().round(2) <Enter>

sort_values(by=pop count,ascending=False).dropna() <Enter>

Where the dropna (dropping rows with missing values) was added just for fun.

FAQ#

What is [select all] and […] (ellipsis) in “insert a column” dropdown?

“[Select all]” indicates that all column names from the original data table are included, serving as a shortcut to avoid manual typing of each column name. “[…]” denotes all other column names.

For instance, suppose your current data table contains columns y1, y2, y3, y4, and y5.

“SELECT COLUMNS y3 …” would translate to “SELECT COLUMNS y3 y1 y2 y4 y5.”

“SELECT COLUMNS y2 … y4” would translate to “SELECT COLUMNS y2 y1 y3 y5 y4.”

What is the difference in usage between back ticks and ‘quote’ symbols?

Back ticks are only used for column names (identifiers) and ‘quotes’ are only used for referring to text values contained in data rows. Use back ticks for column names that contain spaces between words (e.g., pop count); if a column name does not have space between words (e.g., popcount), back ticks are optional. But quotes must always be used to refer to row text values, whether the text value has a space (e.g., “The United States of America”) or not (e.g., “Argentina”).

The pandas+ script fails to write any data to the “Select output location.” What happened?

That might be that one or more pandas+ commands reduced the Active Table to no data (a null table). No error message is reported since a null table is sometimes a valid answer. Remember that each pandas+ command modifies the current in-memory Active Table sequentially to support a logical sequence across pandas+ commands. Hence, after a pandas+ command is executed, the Active Table is typically different from the original table. This new Active Table becomes the input for the next pandas+ command.

CALCULATE COLUMN produced only zeros or empty values. What is wrong?

Two possibilities. First, one or more columns in the mathematical assignment within CALCULATE COLUMN command contain zeros, thereby potentially zeroing out the entire value under multiplication. Second, the assignment operator includes comparison logic that all evaluated to False.

What are other advanced commands besides groupby can be used?

Most self-contained methods in Python Pandas that operate on a single table (also called a Pandas data frame) will work. See Advanced Commands: Pandas Methods. Also, check out the Finplicity website for video examples that we will continue to create over time.