自动交易的兴起：交易标准普尔 500 指数的机器

已发表: 2022-03-11

如今，超过 60% 的不同资产（如股票、股指期货、商品）的交易活动不再由“人类”交易者进行，而是依靠自动化交易。有一些基于特定算法的专门程序可以在不同市场上自动买卖资产，以实现长期的正回报。

在本文中，我将向您展示如何准确地预测下一笔交易应该如何进行以获得正收益。在这个例子中，作为交易的标的资产，我选择了标准普尔 500 指数，即 500 家市值较大的美国公司的加权平均值。一个非常简单的实施策略是在华尔街交易所上午 9:30 开始交易时买入标准普尔 500 指数，并在东部时间下午 4:00 收盘时卖出。如果指数收盘价高于开盘价，则为正收益，而如果收盘价低于开盘价，则为负收益。所以问题是：我们如何知道交易时段的收盘价是否会高于开盘价？机器学习是完成如此复杂任务的强大工具，它可以成为支持我们做出交易决策的有用工具。

机器学习是许多有用的现实生活应用程序的新前沿。金融交易就是其中之一，并且在该领域中经常使用。关于机器学习的一个重要概念是我们不需要为每种可能的规则编写代码，例如模式识别。这是因为与机器学习相关的每个模型都从数据本身中学习，然后可以在以后用于预测看不见的新数据。

免责声明：本文的目的是展示如何训练机器学习方法，并且在提供的代码示例中并未解释每个功能。本文不打算让一个人复制和粘贴所有代码并运行提供的相同测试，因为缺少一些超出本文范围的细节。此外，还需要 Python 的基础知识。本文的主要目的是展示一个示例，说明机器学习如何有效地预测金融部门的买卖。然而，用真钱交易意味着拥有许多其他技能，例如资金管理和风险管理。这篇文章只是“大图”的一小部分。

构建您的第一个财务数据自动交易程序

那么，您想创建您的第一个程序来分析财务数据并预测正确的交易吗？让我告诉你怎么做。我将使用 Python 进行机器学习代码，我们将使用来自 Yahoo Finance 服务的历史数据。如前所述，在进行预测之前，需要历史数据来训练模型。

首先，我们需要安装：

Python，特别是我建议使用 IPython notebook。
Yahoo Finance Python 包（确切名称为yahoo-finance ）通过终端命令： pip install yahoo-finance 。
名为 GraphLab 的机器学习包的免费试用版。随意查看该库的有用文档。

请注意，GraphLab 只有一部分是开源的，即 SFrame，因此要使用整个库，我们需要许可证。学生或参加 Kaggle 比赛的人有 30 天的免费许可证和非商业许可证。在我看来，GraphLab Create 是一个非常直观且易于使用的库，用于分析数据和训练机器学习模型。

挖掘 Python 代码

让我们深入研究一些 Python 代码，看看如何从 Internet 下载财务数据。我建议使用 IPython notebook 来测试下面的代码，因为 IPython 与传统的 IDE 相比有很多优势，尤其是当我们需要将源代码、执行代码、表格数据和图表组合在同一个文档上时。有关使用 IPython Notebook 的简要说明，请查看 IPython Notebook 简介文章。

因此，让我们创建一个新的 IPython 笔记本并编写一些代码来下载标准普尔 500 指数的历史价格。请注意，如果您更喜欢使用其他工具，则可以在首选 IDE 中从一个新的 Python 项目开始。

 import graphlab as gl from __future__ import division from datetime import datetime from yahoo_finance import Share # download historical prices of S&P 500 index today = datetime.strftime(datetime.today(), "%Y-%m-%d") stock = Share('^GSPC') # ^GSPC is the Yahoo finance symbol to refer S&P 500 index # we gather historical quotes from 2001-01-01 up to today hist_quotes = stock.get_historical('2001-01-01', today) # here is how a row looks like hist_quotes[0] {'Adj_Close': '2091.580078', 'Close': '2091.580078', 'Date': '2016-04-22', 'High': '2094.320068', 'Low': '2081.199951', 'Open': '2091.48999', 'Symbol': '%5eGSPC', 'Volume': '3790580000'}

这里， hist_quotes是一个字典列表，每个字典对象是一个交易日，具有Open 、 High 、 Low 、 Close 、 Adj_close 、 Volume 、 Symbol和Date值。在每个交易日，价格通常会从开盘价Open到收盘价Close变化，并达到最高和最低值High和Low 。我们需要通读它并创建每个最相关数据的列表。此外，数据必须首先按最新值排序，因此我们需要将其反转：

 l_date = [] l_open = [] l_high = [] l_low = [] l_close = [] l_volume = [] # reverse the list hist_quotes.reverse() for quotes in hist_quotes: l_date.append(quotes['Date']) l_open.append(float(quotes['Open'])) l_high.append(float(quotes['High'])) l_low.append(float(quotes['Low'])) l_close.append(float(quotes['Close'])) l_volume.append(int(quotes['Volume']))

我们可以将所有下载的报价打包到一个SFrame对象中，这是一个高度可扩展的基于列的数据框，并且它是经过压缩的。优点之一是它也可以大于 RAM 的数量，因为它是磁盘支持的。您可以查看文档以了解有关 SFrame 的更多信息。

因此，让我们存储并检查历史数据：

 qq = gl.SFrame({'datetime' : l_date, 'open' : l_open, 'high' : l_high, 'low' : l_low, 'close' : l_close, 'volume' : l_volume}) # datetime is a string, so convert into datetime object qq['datetime'] = qq['datetime'].apply(lambda x:datetime.strptime(x, '%Y-%m-%d')) # just to check if data is sorted in ascending mode qq.head(3)

关闭	约会时间	高的	低的	打开	体积
1283.27	2001-01-02 00:00:00	1320.28	1276.05	1320.28	1129400000
1347.56	2001-01-03 00:00:00	1347.76	1274.62	1283.27	1880700000
1333.34	2001-01-04 00:00:00	1350.24	1329.14	1347.56	2131000000

现在我们可以使用SFrame方法save将数据保存到磁盘，如下所示：

 qq.save(“SP500_daily.bin”) # once data is saved, we can use the following instruction to retrieve it qq = gl.SFrame(“SP500_daily.bin/”)

让我们看看标准普尔 500 指数是什么样的

要查看加载的 S&P 500 数据的外观，我们可以使用以下代码：

 import matplotlib.pyplot as plt %matplotlib inline # only for those who are using IPython notebook plt.plot(qq['close'])

代码的输出如下图：

标准普尔 500 指数图表，通常随着时间的推移而增长，如上述 Python 代码所示。

训练一些机器学习模型

添加结果

正如我在本文的介绍部分所述，每个模型的目标是预测收盘价是否会高于开盘价。因此，在这种情况下，我们可以在购买标的资产时获得正回报。因此，我们需要在我们的数据上添加一个outcome列，这将是target或predicted变量。这个新列的每一行都将是：

+1对于Closing价高于Opening的上涨日。
-1表示Closing价低于Opening的下跌日。

 # add the outcome variable, 1 if the trading session was positive (close>open), 0 otherwise qq['outcome'] = qq.apply(lambda x: 1 if x['close'] > x['open'] else -1) # we also need to add three new columns 'ho' 'lo' and 'gain' # they will be useful to backtest the model, later qq['ho'] = qq['high'] - qq['open'] # distance between Highest and Opening price qq['lo'] = qq['low'] - qq['open'] # distance between Lowest and Opening price qq['gain'] = qq['close'] - qq['open']

由于我们需要在最后一个交易日之前的几天进行评估，因此我们需要将数据滞后一天或多天。对于这种滞后操作，我们需要 GraphLab 包中的另一个对象，称为TimeSeries 。 TimeSeries有一个方法shift ，它使数据滞后一定数量的行。

 ts = gl.TimeSeries(qq, index='datetime') # add the outcome variable, 1 if the bar was positive (close>open), 0 otherwise ts['outcome'] = ts.apply(lambda x: 1 if x['close'] > x['open'] else -1) # GENERATE SOME LAGGED TIMESERIES ts_1 = ts.shift(1) # by 1 day ts_2 = ts.shift(2) # by 2 days # ...etc.... # it's an arbitrary decision how many days of lag are needed to create a good forecaster, so # everyone can experiment by his own decision

添加预测变量

预测变量是一组特征变量，必须选择这些变量来训练模型并预测我们的结果。因此，预测因素的选择是至关重要的，即使不是最重要的，也是预测者的组成部分。

仅举几个例子，要考虑的一个因素可能是今天的收盘价是否高于昨天的收盘价，并且可能会延长前两天的收盘价等。类似的选择可以用以下代码翻译：

 ts['feat1'] = ts['close'] > ts_1['close'] ts['feat2'] = ts['close'] > ts_2['close']

如上所示，我在我们的数据集 ( ts ) 中添加了两个新的特征列， feat1和feat2 ，如果比较为真，则为1 ，否则为0 。

本文旨在给出一个应用于金融领域的机器学习示例。我更喜欢关注机器学习模型如何与金融数据一起使用，我们不会详细介绍如何选择合适的因素来训练模型。由于复杂性显着增加，解释为什么相对于其他因素使用某些因素过于详尽。我的工作研究是研究选择因素的许多假设以创建一个好的预测器。因此，首先，我建议您尝试许多不同的因素组合，看看它们是否可以提高模型的准确性。

 # add_features is a helper function, which is out of the scope of this article, # and it returns a tuple with: # ts: a timeseries object with, in addition to the already included columns, also lagged columns # as well as some features added to train the model, as shown above with feat1 and feat2 examples # l_features: a list with all features used to train Classifier models # l_lr_features: a list all features used to train Linear Regression models ts, l_features, l_lr_features = add_features(ts) # add the gain column, for trading operations with LONG only positions. # The gain is the difference between Closing price - Opening price ts['gain'] = ts['close'] - ts['open'] ratio = 0.8 # 80% of training set and 20% of testing set training = ts.to_sframe()[0:round(len(ts)*ratio)] testing = ts.to_sframe()[round(len(ts)*ratio):]

训练决策树模型

GraphLab Create 有一个非常干净的界面来实现机器学习模型。每个模型都有一个方法create用于将模型与训练数据集拟合。典型参数为：

training - 它是一个包含特征列和目标列的训练集。
target - 它是包含目标变量的列的名称。
validation_set - 它是用于监控模型泛化性能的数据集。在我们的例子中，我们没有validation_set 。
features - 它是用于训练模型的特征的列名称列表。
verbose - 如果为true ，则在训练期间打印进度信息。

而其他参数是模型本身的典型参数，例如：

max_depth - 它是树的最大深度。

使用以下代码，我们构建决策树：

 max_tree_depth = 6 decision_tree = gl.decision_tree_classifier.create(training, validation_set=None, target='outcome', features=l_features, max_depth=max_tree_depth, verbose=False)

测量拟合模型的性能

准确度是评估预测者好坏的重要指标。它是正确预测的数量除以总数据点的数量。由于模型拟合了训练数据，因此使用训练集评估的准确度优于使用测试集获得的准确度。

精确度是积极预测中积极的部分。我们需要精确到接近1的数字，以达到“完美”的胜率。我们的decision_tree ，作为 GraphLab Create 包中的另一个分类器，有它的方法evaluate来获得拟合模型的许多重要指标。

Recall量化了分类器预测正例的能力。召回率可以解释为随机选择的正例被分类器正确识别的概率。我们需要精确度接近1的数字，以实现“完美”胜率。

以下代码将显示拟合模型在训练集和测试集上的准确性：

 decision_tree.evaluate(training)['accuracy'], decision_tree.evaluate(testing)['accuracy'] (0.6077348066298343, 0.577373211963589)

如上所示，带有测试集的模型的准确率约为 57%，这在某种程度上比抛硬币（50%）要好。

预测数据

GraphLab Create 具有相同的界面来预测来自不同拟合模型的数据。在我们的案例中，我们将使用predict方法，它需要一个测试集来预测目标outcome 。现在，我们可以从测试集中预测数据：

 predictions = decision_tree.predict(testing) # and we add the predictions column in testing set testing['predictions'] = predictions # let's see the first 10 predictions, compared to real values (outcome column) testing[['datetime', 'outcome', 'predictions']].head(10)

约会时间	结果	预测
2013-04-05 00:00:00	-1	-1
2013-04-08 00:00:00	1	1
2013-04-09 00:00:00	1	1
2013-04-10 00:00:00	1	-1
2013-04-11 00:00:00	1	-1
2013-04-12 00:00:00	-1	-1
2013-04-15 00:00:00	-1	1
2013-04-16 00:00:00	1	1
2013-04-17 00:00:00	-1	-1
2013-04-18 00:00:00	-1	1

假阳性是模型预测为阳性结果而测试集的真实结果为阴性的情况。反之亦然，假阴性是模型预测负结果的情况，而测试集的实际结果是正的。

我们的交易策略是等待一个积极的预测结果，以Opening买入标准普尔 500 指数，并以Closing价卖出，因此我们希望有最低的误报率以避免损失。换句话说，我们希望我们的模型具有最高的准确率。

我们可以看到，在前十个预测值中有两个假阴性（2013-04-10 和 2013-04-11）和两个假阳性（2013-04-15 和 2013-04-18）测试集。

通过一个简单的计算，考虑这十个预测的小集合：

准确度 = 6/10 = 0.6 或 60%
精度 =3/5 = 0.6 或 60%
召回率 = 3/5 = 0.6 或 60%

请注意，通常上面的数字彼此不同，但在这种情况下它们是相同的。

回测模型

我们现在模拟模型如何使用其预测值进行交易。如果预测结果等于+1 ，则意味着我们预计会有一天。对于上涨日，我们在交易日开始时买入指数，并在同一天交易日结束时卖出指数。相反，如果预测结果等于-1 ，我们预计会出现下跌日，因此我们不会在那一天进行交易。

完整每日交易的盈亏 ( pnl )，也称为回合，在此示例中由下式给出：

pnl = Close - Open （每个交易日）

使用下面显示的代码，我调用辅助函数plot_equity_chart来创建具有累积收益曲线（权益曲线）的图表。无需深入，它只需获取一系列损益值并计算要绘制的一系列累积总和。

 pnl = testing[testing['predictions'] == 1]['gain'] # the gain column contains (Close - Open) values # I have written a simple helper function to plot the result of all the trades applied to the # testing set and represent the total return expressed by the index basis points # (not expressed in dollars $) plot_equity_chart(pnl,'Decision tree model')

 Mean of PnL is 1.843504 Sharpe is 1.972835 Round turns 511

这里，夏普是年度夏普比率，是交易模型好坏的重要指标。

夏普比率等于 252 的平方根，然后乘以 pnl 的平均值除以 pnl 的标准差。

考虑到每天表示的交易，而mean是损益表的平均值， sd是标准差。为简单起见，上面描述的公式中，我认为无风险回报等于 0。

关于交易的一些基础知识

交易指数需要购买直接来自指数的资产。许多经纪商使用一种名为CFD （差价合约）的衍生产品复制标准普尔 500 指数，这是两方之间交换合约开盘价和收盘价之间差价的协议。

示例：在Open时买入 1 份 CFD S&P 500（价值为 2000），在当天Close时卖出（价值为 2020）。差异，因此增益，是 20 点。如果每个点的价值为 25 美元：

总收益为20 points x $25 = $500 。

假设经纪人为自己的收入保留了 0.6 个点的滑点：

净收益为(20 - 0.6) points x $25 = $485 。

另一个需要考虑的重要方面是避免交易中的重大损失。当预测结果为+1但实际结果值为-1时，它们可能会发生，因此它是误报。在这种情况下，收盘价是下跌的一天，收盘价低于开盘价，我们就亏损了。

必须设置止损订单以防止我们在交易中可以承受的最大损失，并且只要资产价格低于我们之前设定的固定价值，就会触发此类订单。

如果我们看一下本文开头从雅虎财经下载的时间序列，每天都有一个Low ，即当天达到的最低价。如果我们将止损水平设置为远离Opening的-3点，并且Low - Open = -5将触发止损单，开仓头寸将以-3点而不是-5点的损失平仓。这是降低风险的简单方法。下面的代码代表我的辅助函数来模拟止损水平的交易：

 # This is a helper function to trade 1 bar (for example 1 day) with a Buy order at opening session # and a Sell order at closing session. To protect against adverse movements of the price, a STOP order # will limit the loss to the stop level (stop parameter must be a negative number) # each bar must contains the following attributes: # Open, High, Low, Close prices as well as gain = Close - Open and lo = Low - Open def trade_with_stop(bar, slippage = 0, stop=None): """ Given a bar, with a gain obtained by the closing price - opening price it applies a stop limit order to limit a negative loss If stop is equal to None, then it returns bar['gain'] """ bar['gain'] = bar['gain'] - slippage if stop<>None: real_stop = stop - slippage if bar['lo']<=stop: return real_stop # stop == None return bar['gain']

交易成本

交易成本是买卖证券时发生的费用。交易成本包括经纪人的佣金和点差（交易商为证券支付的价格与买方支付的价格之间的差额），如果我们想要回测我们的策略，就需要考虑它们，类似于真实场景。当点差发生变化时，股票交易经常会出现滑点。在本例中以及接下来的模拟中，交易成本固定为：

滑点 = 0.6 点
佣金 = 每笔交易 1 美元（一轮将花费 2 美元）

只是写一些数字，如果我们的总收益是 10 点，1 点 = 25 美元，所以 250 美元包括交易成本，我们的净收益将是(10 - 0.6)*$25 - 2 = $233 。

下面的代码展示了一个模拟之前的交易策略，止损为-3点。蓝色曲线是累积收益曲线。唯一考虑的成本是滑点（0.6 点），结果以基点表示（从雅虎财经下载的标准普尔 500 指数的相同基本单位）。

 SLIPPAGE = 0.6 STOP = -3 trades = testing[testing['predictions'] == 1][('datetime', 'gain', 'ho', 'lo', 'open', 'close')] trades['pnl'] = trades.apply(lambda x: trade_with_stop(x, slippage=SLIPPAGE, stop=STOP)) plot_equity_chart(trades['pnl'],'Decision tree model') print("Slippage is %s, STOP level at %s" % (SLIPPAGE, STOP))

 Mean of PnL is 2.162171 Sharpe is 3.502897 Round turns 511 Slippage is 0.6 STOP level at -3

以下代码用于以稍微不同的方式进行预测。请注意使用附加参数output_type = “probability”调用的predict方法。此参数用于返回预测值的概率，而不是它们的类预测（ +1表示正面预测的结果， -1表示负面预测的结果）。大于或等于0.5的概率与预测值+1相关联，而小于0.5的概率值与预测值-1相关联。概率越高，我们预测真正的Up Day的机会就越大。

 predictions_prob = decision_tree.predict(testing, output_type = 'probability') # predictions_prob will contain probabilities instead of the predicted class (-1 or +1)

现在我们使用名为backtest_ml_model的辅助函数对模型进行回测，该函数计算一系列累积回报，包括滑点和佣金，并绘制它们的值。为简洁起见，在不彻底解释backtest_ml_model函数的情况下，要强调的重要细节是，不是像我们在前面的示例中那样过滤预测outcome = 1的那些日子，现在我们过滤等于或大于threshold = 0.5的那些predictions_prob ，如下：

 trades = testing[predictions_prob>=0.5][('datetime', 'gain', 'ho', 'lo', 'open', 'close')]

请记住，每个交易日的净收益为： Net gain = (Gross gain - SLIPPAGE) * MULT - 2 * COMMISSION 。

用于评估交易策略优劣的另一个重要指标是最大回撤。一般来说，它衡量的是投资组合价值从峰值到底部的最大单次跌幅。在我们的案例中，这是股票曲线从峰值到底部的最大跌幅（我们的投资组合中只有一项资产，标准普尔 500 指数）。因此，给定一个SArray的损益pnl ，我们将回撤计算为：

 drawdown = pnl - pnl.cumulative_max() max_drawdown = min(drawdown)

在辅助函数backtest_summary内部计算：

最大回撤（以美元计）如上所示。
准确度，采用Graphlab.evaluation方法。
精度，使用Graphlab.evaluation方法。
回想一下，使用Graphlab.evaluation方法。

综上所述，以下示例显示了代表模型策略累积回报的股票曲线，所有价值均以美元表示。

 model = decision_tree predictions_prob = model.predict(testing, output_type="probability") THRESHOLD = 0.5 bt_1_1 = backtest_ml_model(testing, predictions_prob, target='outcome', threshold=THRESHOLD, STOP=-3, MULT=25, SLIPPAGE=0.6, COMMISSION=1, plot_title='DecisionTree') backtest_summary(bt_1_1)

决策树图，Y 轴标记为“美元”，最高可达 30,000，X 轴标记为“# of roundturns”并延伸到 600，如上述 Python 代码所示。图形数据本身与之前的渲染相同。

 Mean of PnL is 54.054286 Sharpe is 3.502897 Round turns 511 Name: DecisionTree Accuracy: 0.577373211964 Precision: 0.587084148728 Recall: 0.724637681159 Max Drawdown: -1769.00025

为了提高预测值的精度，我们选择了更高的阈值，而不是标准概率0.5 （50%），以更有信心模型预测Up day 。

 THRESHOLD = 0.55 # it's the minimum threshold to predict an Up day so hopefully a good day to trade bt_1_2 = backtest_ml_model(testing, predictions_prob, target='outcome', threshold=THRESHOLD, STOP=-3, MULT=25, SLIPPAGE=0.6, COMMISSION=1, plot_title='DecisionTree') backtest_summary(bt_1_2)

决策树图，Y 轴标记为“美元”，最高可达 30,000，X 轴标记为“# of roundturns”，这次仅扩展到 250，如上述 Python 代码所示。图形数据本身与之前的渲染相似，但更加平滑。

 Mean of PnL is 118.244689 Sharpe is 6.523478 Round turns 234 Name: DecisionTree Accuracy: 0.560468140442 Precision: 0.662393162393 Recall: 0.374396135266 Max Drawdown: -1769.00025

从上图我们可以看出，权益曲线比以前好得多（夏普是 6.5 而不是 3.5），即使回合数更少。

从这一点开始，我们将考虑阈值高于标准值的所有下一个模型。

训练一个逻辑分类器

就像我们之前对决策树所做的那样，我们可以将我们的研究应用到逻辑分类器模型中。 GraphLab Create 与 Logistic Classifier 对象具有相同的接口，我们将调用create方法来构建具有相同参数列表的模型。此外，我们更喜欢预测概率向量而不是预测的类向量（由+1表示正面结果， -1表示负面结果），因此我们将有一个大于0.5的阈值来实现更好的精度预测。

 model = gl.logistic_classifier.create(training, target='outcome', features=l_features, validation_set=None, verbose=False) predictions_prob = model.predict(testing, 'probability') THRESHOLD = 0.6 bt_2_2 = backtest_ml_model(testing, predictions_prob, target='outcome', threshold=THRESHOLD, STOP=-3, plot_title=model.name()) backtest_summary(bt_2_2)

LogisticClassifier 图，Y 轴标记为“美元”，这次上升到 50,000，X 轴标记为“# of roundturns”，现在扩展到 450，如上述 Python 代码所示。图表数据本身在整体趋势上与之前的渲染相似。

 Mean of PnL is 112.704215 Sharpe is 6.447859 Round turns 426 Name: LogisticClassifier Accuracy: 0.638491547464 Precision: 0.659624413146 Recall: 0.678743961353 Max Drawdown: -1769.00025

在这种情况下，有一个与决策树非常相似的摘要。毕竟，这两个模型都是分类器，它们只预测一类二元结果（ +1 ， -1 ）。

训练线性回归模型

如前所述，该模型的主要区别在于它处理连续值而不是二元类。我们不必使用目标变量为Up days等于+1和Down days为-1来训练模型，我们的目标必须是一个连续变量。由于我们想要预测正收益，或者换句话说，收盘价高于开盘价，现在目标必须是我们训练集的收益列。此外，特征列表必须由连续值组成，例如前面的Open 、 Close等。

为简洁起见，我不会详细介绍如何选择正确的特征，因为这超出了本文的范围，本文更倾向于展示我们应该如何在数据集上应用不同的机器学习模型。传递给 create 方法的参数列表是：

training - 它是一个包含特征列和目标列的训练集。
target - 它是包含目标变量的列的名称。
validation_set - 它是用于监控模型泛化性能的数据集。在我们的例子中，我们没有validation_set 。
features - 它是用于训练模型的特征的列名列表，对于这个模型，我们将使用另一个分类器模型的集合。
verbose - 如果为true ，它将在训练期间打印进度信息。
max_iterations - 它是允许通过数据的最大次数。更多的数据传递可以产生更准确的训练模型。

 model = gl.linear_regression.create(training, target='gain', features = l_lr_features, validation_set=None, verbose=False, max_iterations=100) predictions = model.predict(testing) # a linear regression model, predict continuous values, so we need to make an estimation of their # probabilities of success and normalize all values in order to have a vector of probabilities predictions_max, predictions_min = max(predictions), min(predictions) predictions_prob = (predictions - predictions_min)/(predictions_max - predictions_min)

到目前为止，我们的预测是预测增益的SArray ，而predictions_prob是predictions值归一化的SArray 。为了获得良好的精度和一定数量的转数，与以前的模型相比，我选择了阈值0.4 。对于predictions_prob小于0.4 ， backtest_linear_model辅助函数将不会打开交易，因为预计会出现下跌日。否则，将开启交易。

 THRESHOLD = 0.4 bt_3_2 = backtest_linear_model(testing, predictions_prob, target='gain', threshold=THRESHOLD, STOP = -3, plot_title=model.name()) backtest_summary(bt_3_2)

线性回归图，Y 轴标记为“美元”，最高可达 45,000，X 轴标记为“# of roundturns”并延伸到 350，如上述 Python 代码所示。图形数据本身再次与之前的渲染相似，但并不完全相同。

 Mean of PnL is 138.868280 Sharpe is 7.650187 Round turns 319 Name: LinearRegression Accuracy: 0.631989596879 Precision: 0.705329153605 Recall: 0.54347826087 Max Drawdown: -1769.00025

训练增强树

正如我们之前训练决策树一样，现在我们将使用与其他分类器模型相同的参数来训练提升树分类器。此外，我们设置max_iterations = 12以增加 boosting 的最大迭代次数。每次迭代都会创建一个额外的树。我们还设置了高于0.5的阈值以提高精度。

 model = gl.boosted_trees_classifier.create(training, target='outcome', features=l_features, validation_set=None, max_iterations=12, verbose=False) predictions_prob = model.predict(testing, 'probability') THRESHOLD = 0.7 bt_4_2 = backtest_ml_model(testing, predictions_prob, target='outcome', threshold=THRESHOLD, STOP=-3, plot_title=model.name()) backtest_summary(bt_4_2)

BoostedTreesClassifier 图，Y 轴标记为“美元”，最高可达 25,000，X 轴标记为“# of roundturns”，延伸到 250，如上述 Python 代码所示。图形数据本身再次与之前的渲染相似，在 X 轴上增加了大约 175。

 Mean of PnL is 112.002338 Sharpe is 6.341981 Round turns 214 Name: BoostedTreesClassifier Accuracy: 0.563068920676 Precision: 0.682242990654 Recall: 0.352657004831 Max Drawdown: -1769.00025

训练随机森林

这是我们最后一次训练的模型，一个随机森林分类器，由一组决策树组成。模型中使用的最大树数设置为num_trees = 10 ，以避免过于复杂和过度拟合。

 model = gl.random_forest_classifier.create(training, target='outcome', features=l_features, validation_set=None, verbose=False, num_trees = 10) predictions_prob = model.predict(testing, 'probability') THRESHOLD = 0.6 bt_5_2 = backtest_ml_model(testing, predictions_prob, target='outcome', threshold=THRESHOLD, STOP=-3, plot_title=model.name()) backtest_summary(bt_5_2)

RandomForestClassifier 图，Y 轴标记为“美元”，最高可达 40,000，X 轴标记为“# of roundturns”，延伸到 350，如上述 Python 代码所示。图形数据本身再次类似于之前的渲染。

 Mean of PnL is 114.786962 sharpe is 6.384243 Round turns 311 Name: RandomForestClassifier Accuracy: 0.598179453836 Precision: 0.668810289389 Recall: 0.502415458937 Max Drawdown: -1769.00025

一起收集所有模型

现在我们可以将所有策略结合在一起并查看整体结果。 It's interesting to see the summary of all Machine Learning models, sorted by their precision.

姓名	准确性	精确	round turns	锐化
线性回归	0.63	0.71	319	7.65
BoostedTreesClassifier	0.56	0.68	214	6.34
随机森林分类器	0.60	0.67	311	6.38
决策树	0.56	0.66	234	6.52
LogisticClassifier	0.64	0.66	426	6.45

If we collect all the profit and loss for each one of the previous models in the array pnl , the following chart depicts the equity curve obtained by the sum of each profit and loss, day by day.

 Mean of PnL is 119.446463 Sharpe is 6.685744 Round turns 1504 First trading day 2013-04-09 Last trading day 2016-04-22 Total return 179647

Just to give some numbers, with about 3 years of trading, all models have a total gain of about 180,000 dollars. The maximum exposition is 5 CFD contracts in the market, but to reduce the risk they all are closed at the end of each day, so overnight positions are not allowed.

Statistics of the Aggregation of All Models Together

Since each model can open a trade, but we added 5 concurrent models together, during the same day there could be from 1 contract up to 5 CFD contracts. If all models agree to open trades during the same day, there is a high chance to have an Up day predicted. Moreover, we can group by the number of models that open a trade at the same time during the opening session of the day. Then we evaluate precision as a function of the number of concurrent models.

As we can see by the chart depicted above, the precision gets better as the number of models do agree to open a trade. The more models agree, the more precision we get. For instance, with 5 models triggered the same day, the chance to predict an Up day is greater than 85%.

结论

Even in the financial world, Machine Learning is welcomed as a powerful instrument to learn from data and give us great forecasting tools. Each model shows different values of accuracy and precision, but in general, all models can be aggregated to achieve a better result than each one of them taken singularly. GraphLab Create is a great library, easy to use, scalable and able to manage Big Data very quickly. It implements different scientific and forecasting models, and there is a free license for students and Kaggle competitions.

相关：机器学习理论及其应用简介：带有示例的可视化教程

Additional disclosure: This article has been prepared solely for information purposes, and is not an offer to buy or sell or a solicitation of an offer to buy or sell any security or instrument or to participate in any particular trading strategy. Examples presented on these sites are for educational purposes only. 过去的结果不一定会指示未来的结果。