Cleaning the data

Importing

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols
from scipy.stats import gaussian_kde
import scipy
import scipy.sparse
import patsy
from statistics import median
import bz2
import math
def sort_cols(test):
    return(test.reindex(sorted(test.columns), axis=1))

# Assuming model is in the same directory as the python file

# update data from 2003 to 2010

frames = {}
for year in [2003,2004, 2005, 2006,2007,2008,2009,2010]:
    fil = "pickle_data/FACTOR_MODEL/" + "pandas-frames." + str(year) + ".pickle.bz2"
    frames.update(pd.read_pickle(fil))


for x in frames: 
    frames[x] = sort_cols(frames[x])

# Assuming model is in the same directory as the python file
# update data from 2003 to 2010
covariance = {}
for year in [2003,2004, 2005, 2006,2007,2008,2009,2010]:
    fil =  "pickle_data/FACTOR_MODEL/" + "covariance." + str(year) + ".pickle.bz2"
    covariance.update(pd.read_pickle(fil))

frames["20040102"]
1DREVRSL AERODEF AIRLINES ALUMSTEL APPAREL AUTO BANKS BETA BEVTOB BIOLIFE ... SPTYSTOR STREVRSL SpecRisk TELECOM TRADECO TRANSPRT TotalRisk VALUE WIRELESS Yield
0 -0.032 0.0 0.0 0.0 0.0 0.0 0.00 -2.177 0.0 0.0 ... 0.0 0.548 9.014505 0.054 0.0 0.0 13.959397 0.644 0.0 0.188679
1 0.684 0.0 0.0 0.0 0.0 0.0 0.00 -2.045 0.0 0.0 ... 0.0 0.986 19.304651 0.000 0.0 0.0 23.393503 -0.700 0.0 3.203463
2 0.235 0.0 0.0 0.0 0.0 0.0 0.00 -2.010 0.0 0.0 ... 0.0 -0.256 19.802556 0.000 0.0 0.0 22.135724 0.513 0.0 1.290796
3 0.759 0.0 0.0 0.0 0.0 0.0 0.00 -0.948 0.0 0.0 ... 0.0 1.841 31.274403 0.000 0.0 0.0 35.785120 0.609 0.0 NaN
4 0.674 0.0 0.0 0.0 0.0 0.0 0.00 -2.148 0.0 0.0 ... 0.0 -0.588 13.533480 0.000 0.0 0.0 18.928129 2.986 0.0 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
12430 -0.953 0.0 0.0 0.0 0.0 0.0 0.97 -2.009 0.0 0.0 ... 0.0 0.430 9.095567 0.000 0.0 0.0 13.125982 0.928 0.0 2.770352
12431 -0.808 0.0 0.0 0.0 0.0 0.0 0.97 -2.066 0.0 0.0 ... 0.0 0.561 8.739695 0.000 0.0 0.0 12.888183 0.966 0.0 7.175297
12432 -0.754 0.0 0.0 0.0 0.0 0.0 0.97 -2.053 0.0 0.0 ... 0.0 0.734 9.565838 0.000 0.0 0.0 13.436052 0.818 0.0 2.770352
12433 -0.647 0.0 0.0 0.0 0.0 0.0 0.97 -2.022 0.0 0.0 ... 0.0 0.680 9.394557 0.000 0.0 0.0 13.256003 0.774 0.0 2.770352
12434 0.331 0.0 0.0 0.0 0.0 0.0 0.00 -1.985 0.0 0.0 ... 0.0 -0.519 9.591641 0.000 0.0 0.0 12.277461 0.316 0.0 4.148421

12435 rows × 93 columns

Data Cleaning and Winsorization

The distribution of many statistics can be heavily influenced by outliers. A simple approach to robustifying parameter estimation procedures is to set all outliers to a specified percentile of the data; for example, a 90% winsorization would see all data below the 5th percentile set to the 5th percentile, and data above the 95th percentile set to the 95th percentile. Winsorized estimators are usually more robust to outliers than their more standard forms.

def wins(x, a, b):
    return(np.where(x <= a, a, np.where(x >= b, b, x)))

def clean_nas(df):
    numeric_columns = df.select_dtypes(include=[np.number]).columns.tolist()

    for numeric_column in numeric_columns:
        df[numeric_column] = np.nan_to_num(df[numeric_column])

    return df

Density Polt

def density_plot(data, title):
    density = gaussian_kde(data)
    xs = np.linspace(np.min(data), np.max(data), 200)
    density.covariance_factor = lambda: .25
    density._compute_covariance()
    plt.plot(xs, density(xs))
    plt.title(title)
    plt.show()

test = frames['20040102']
density_plot(test['Ret'], 'Daily return pre-winsorization')
density_plot(wins(test['Ret'], -0.25, 0.25), 'Daily return winsorized')
density_plot(test['SpecRisk'] / (100 * math.sqrt(252)), 'SpecRisk')

Data Definitions

Many of the remaining columns are industry factors, of which a full list is given below.

industry_factors = ['AERODEF', 'AIRLINES', 'ALUMSTEL', 'APPAREL', 'AUTO', 'BANKS', 'BEVTOB', 
'BIOLIFE', 'BLDGPROD', 'CHEM', 'CNSTENG', 'CNSTMACH', 'CNSTMATL', 'COMMEQP', 'COMPELEC',
'COMSVCS', 'CONGLOM', 'CONTAINR', 'DISTRIB', 'DIVFIN', 'ELECEQP', 'ELECUTIL', 'FOODPROD', 
'FOODRET', 'GASUTIL', 'HLTHEQP', 'HLTHSVCS', 'HOMEBLDG', 'HOUSEDUR', 'INDMACH', 'INSURNCE',
'LEISPROD', 'LEISSVCS', 'LIFEINS', 'MEDIA', 'MGDHLTH', 'MULTUTIL', 'OILGSCON', 'OILGSDRL', 
'OILGSEQP', 'OILGSEXP', 'PAPER', 'PHARMA', 'PRECMTLS', 'PSNLPROD', 'REALEST', 'RESTAUR', 
'ROADRAIL', 'SEMICOND', 'SEMIEQP', 'SOFTWARE', 'SPLTYRET', 'SPTYCHEM', 'SPTYSTOR', 'TELECOM', 
'TRADECO', 'TRANSPRT', 'WIRELESS']

style_factors = ['BETA', 'SIZE', 'MOMENTUM', 'VALUE']

def get_formula(alpha):
    L = ["0", alpha]
    L.extend(style_factors)
    L.extend(industry_factors)
    return "Ret ~ " + " + ".join(L)

To solve Problem 1, we need to create a function named estimate_factor_returns that performs several key tasks:

Filter the Data: The function should first filter the input dataframe df to include only rows where IssuerMarketCap > 1e9. This step ensures that the analysis focuses on liquid stocks.

Winsorize the Return: The function should apply the winsorization process to the ‘Ret’ column of the dataframe. This step will mitigate the impact of outliers in the data.

Run OLS Regression: Using the formula generated by get_formula function with the provided alpha factor, the function should perform an Ordinary Least Squares (OLS) regression. The dependent variable in this regression will be the winsorized return.

Return Results: The function should return the slope coefficients from the regression, and it can also return additional information from the regression results if desired.

structural exposures - Security analysts coeff - optimisation

problem 1

Write a function called “estimate_factor_returns” which takes as arguments, an alpha factor name, and an input data frame “df”. The function must not modify the contents of the data frame. The function will first subset to IssuerMarketCap > 1e9 to obtain a liquid universe, and then run an OLS regression using the formula returned by the get_formula function defined above. The dependent variable in the regression must be winsorized return, using the winsorisation procedure discussed above in this notebook. The estimate_factor_returns function should return an object which, at very least, contains the slope coefficients from the regression, and may contain other information.

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

def estimate_factor_returns(alpha, df):
    # Filtering data for IssuerMarketCap > 1e9
    filtered_df = df[df['IssuerMarketCap'] > 1e9]

    # Winsorizing the 'Ret' column
    winsorized_ret = wins(filtered_df['Ret'], -0.25, 0.25) # ret is return

    # Running OLS regression
    formula = get_formula(alpha)
    model = smf.ols(formula, data=filtered_df.assign(Ret=winsorized_ret)).fit()

    # Returning the slope coefficients and optionally other information
    return model.params

# Example usage
alpha_factor_name = 'PROFIT' 
df = frames["20030103"] 
results = estimate_factor_returns(alpha_factor_name, df)
print(results)
PROFIT     -0.000699
BETA       -0.000370
SIZE       -0.001563
MOMENTUM   -0.001272
VALUE      -0.000369
              ...   
SPTYSTOR   -0.004445
TELECOM    -0.011490
TRADECO    -0.000577
TRANSPRT   -0.006902
WIRELESS    0.003033
Length: 63, dtype: float64
  • wins is the previously defined function for winsorization.
  • get_formula is the function provided in your notebook that generates the formula for the regression.
  • The dataframe df is expected to include all the necessary columns like ‘IssuerMarketCap’, ‘Ret’, and any columns relevant to the alpha factor and style/industry factors.

problem 2

Write a function which iteratively calls estimate_factor_returns on every data frame in the dictionary called “frames” above, and returns a univariate time series containing the coefficient on the alpha factor for each date. Since only the alpha factor return is saved, style and industry factor returns are discarded by this procedure.

def iteratively_call_estimate_factor_return(frames, alpha_factors):
    # DataFrame to store alpha factors and their corresponding time series
    all_alpha_factor_returns = pd.DataFrame(index=frames.keys())

    # Iterate through each date and DataFrame in the frames dictionary
    for date, df in frames.items():
        for alpha in alpha_factors:
            # Call the estimate_factor_returns function for each alpha factor
            factor_returns = estimate_factor_returns(alpha, df)
            # Extract the alpha factor return and store it with the corresponding date
            all_alpha_factor_returns.loc[date, alpha] = factor_returns.get(alpha, np.nan)

    return all_alpha_factor_returns

# Example usage
alpha_factors = ['STREVRSL', 'PROFIT', 'SENTMT']
alpha_time_series_df = iteratively_call_estimate_factor_return(frames, alpha_factors)

problem 3

For each of the potential alpha factors STREVRSL, PROFIT, SENTMT, run the function created in Problem 2 and plot the cumulative sum of the resulting time series. Do any of these qualify as potential alpha factors? Do the cumulative returns from the plot represent returns on any tradable assets or portfolios?

plt.plot(pd.Series(alpha_time_series_df["STREVRSL"]).cumsum(), label="STREVRSL")
plt.plot(pd.Series(alpha_time_series_df["PROFIT"]).cumsum(), label="PROFIT")
plt.plot(pd.Series(alpha_time_series_df["SENTMT"]).cumsum(), label="SENTMT")

plt.title("Cumulative Returns of Potential Alpha Factors")
plt.xlabel("Date")
plt.ylabel("Cumulative Returns")
plt.legend()
plt.xticks(rotation=45)
plt.xticks(np.arange(0, len(alpha_time_series_df.index), 50), alpha_time_series_df.index[np.arange(0, len(alpha_time_series_df.index), 50)])
plt.show()

An alpha factor is a variable or set of variables that a model uses to predict stock returns. To qualify as a potential alpha factor, a variable should ideally show a consistent ability to generate excess returns over a benchmark.

STREVRSL shows a steep upward trend, which suggests that this factor might have been predictive of positive returns over the period shown. If these returns are risk-adjusted and beat a benchmark, STREVRSL could be considered a strong alpha factor.

PROFIT and SENTMT exhibit a more modest upward trajectory. This suggests that while they may have some predictive power, their ability to generate excess returns isn’t as pronounced as STREVRSL within the time frame of the data.

Part 4: Do the cumulative returns from the plot represent returns on any tradable assets or portfolios? The cumulative returns plotted for these alpha factors represent theoretical returns based on a model’s predictions. They are not directly the returns of tradable assets but indicate how an investment strategy based on these factors might have performed. To translate these into actual tradable strategies:

The factors would need to be a part of a trading model that determines how much to invest in different assets.

The model would adjust its positions regularly based on updated factor scores and predictions.

Transaction costs, slippage, market impact, and other real-world trading constraints would affect the actual returns realized.

The model’s predictions would need to be converted into portfolio weights, and these weights would then be used to construct a portfolio of stocks or other assets.

The plotted cumulative returns suggest how a strategy solely based on these alpha factors might have accumulated gains over time, but translating these theoretical returns into actual tradable strategies requires a comprehensive investment model and consideration of trading costs and constraints. If the factors are robust and continue to perform well out-of-sample (i.e., in future unseen data), they might be incorporated into real trading strategies or investment portfolios.