Featuretools Spark

Featuretools Spark is a prepackaged solution to scale your feature engineering workloads using Apache Spark.

Requirements

  • A running Spark cluster

  • A Featuretools Enterprise License

  • Results from Featuretools running on a sample of your data

How it works

Below is an example of using Featuretools without Spark to perform automated feature engineering.

import featuretools as ft
import pandas as pd
# function to load Pandas DataFrame into Featuretools EntitySet
def df_to_entityset(df):
es = ft.EntitySet('transactions')
es.entity_from_dataframe(entity_id='transactions',
dataframe=df,
index='transaction_id',
time_index='transaction_time')
# create second entity
es.normalize_entity(base_entity_id="transactions",
new_entity_id="customers",
index="customer_id",
additional_variables=["device", "zip_code",
"date_of_birth", "join_date"])
return es
# Load Data
df = pd.read_csv('./data/transactions.csv')
entityset = df_to_entityset(df)
features = ft.dfs(target_entity="customers",
entityset=entityset,
features_only=True)
fm = ft.calculate_feature_matrix(features, entityset=entityset)

To use Featuretools Spark, we only need to make a few changes

import featuretools.spark as fts
import pyspark
# Configure and load Spark DataFrame
sc = pyspark.SparkContext(appName='Featuretools Spark')
spark = pyspark.sql.SparkSession(sc)
spark_df = spark.read.csv('./data/transactions.csv')
# Create Spark EntitySet
entityset = fts.EntitySet(spark_df, df_to_entityset,
partition_by="customer_id")
# Call to calculate stays the same using same features defintions as above
fm = fts.calculate_feature_matrix(features, entityset=entityset)

Additional Resources