利用机器学习赚钱
你有1000美元可以挥霍。你决定投资股市,特别是特斯拉。
让我们看看能否利用机器学习来优化我们的回报。
从这里下载 TSLA.csv:https://www.kaggle.com/timoboz/tesla-stock-data-from-2010-to-2020/data
让我们开始吧。
让我们首先对我们得到的文件进行一些探索性数据分析(EDA)。
该文件是一个包含 7 列的逗号分隔值 (CSV) 文件。
这些列是:
- 日期
- 开盘价
- 当日最高价
- 当日最低价
- 收盘价
- 调整收盘价,考虑拆分等因素
- 交易量
# Importing pandas. "pandas is a fast, powerful, flexible and easy to use open source data
# analysis and manipulation tool, built on top of the Python programming language."
import pandas as pd
pd.options.display.max_rows = 30
# Read in the CSV, save it to a pandas dataframe variable called 'tsla_data'.
tsla_data = pd.read_csv("TSLA.csv");
# .head() gives us the first 5 rows of the data frame.
# You can also pass .head() a parameter to return any number of rows. Like .head(10) for 10 rows.
tsla_data.head()
日期 | 打开 | 高的 | 低的 | 关闭 | 调整收盘价 | 体积 | |
---|---|---|---|---|---|---|---|
0 | 2010年6月29日 | 19.000000 | 25.00 | 17.540001 | 23.889999 | 23.889999 | 18766300 |
1 | 2010年6月30日 | 25.790001 | 30.42 | 23.299999 | 23.830000 | 23.830000 | 17187100 |
2 | 2010年7月1日 | 25.000000 | 25.92 | 20.270000 | 21.959999 | 21.959999 | 8218800 |
3 | 2010年7月2日 | 23.000000 | 23.10 | 18.709999 | 19.200001 | 19.200001 | 5139800 |
4 | 2010年7月6日 | 20.000000 | 20.00 | 15.830000 | 16.110001 | 16.110001 | 6866900 |
# .shape tells us the number of rows, and the number of columns.
# This dataset has 2416 rows, and 7 columns.
# The NYSE and NASDAQ average about 253 trading days a year.
# This is from 365.25 (days on average per year) * 5/7 (proportion work days per week)
# - 6 (weekday holidays) - 3*5/7 (fixed date holidays) = 252.75 ≈ 253.
# 10 * 253 = 2530, this dataset is pretty close. Let's assume it's not missing any days.
tsla_data.shape
(2416, 7)
这是 10 年的数据,其中包含从 2010 年开始的股票信息。
为了节省时间,让我们做一些假设,我们还不是对冲基金经理。
假设
- 我们每天只能下达一个订单(买入或卖出),订单金额为持有的全部金额。
- 如果我们下订单,我们假设它会以该价格成交。
- 我们从 1000 美元开始
我们将追踪一些关键信息。
- 钱包里的钱
- 持有股票数量
让我们从 2010 年开始,看看如果我们在这个文件的第一天以 1000 美元开始,我们会赚多少钱。
# We're going to just pull the 2010 data. I like sticking this in variable, and array,
# because we'll likely do this again, and by multiple years.
years_to_pull = [2010]
# Let's tell pandas to treat the 'Date' column as a date.
tsla_data['Date'] = pd.to_datetime(tsla_data['Date'])
# Let's make a function for re-use
def pull_data_by_year(tsla_data, years_to_pull):
tsla_data_by_year = tsla_data[tsla_data['Date'].dt.year.isin(years_to_pull)]
return tsla_data_by_year
tsla_data_by_year = pull_data_by_year(tsla_data, years_to_pull)
tsla_data_by_year.shape
(130, 7)
# Sort by date ASC
tsla_data_by_year = tsla_data_by_year.sort_values(by = 'Date')
我们来添加几列来帮助我们处理数据。我想查看明天的调整后收盘价,并知道它是否高于今天的调整后收盘价。
# .shift(-1) brings the next row into the equation, so that we can add a column that
# shows tomorrow's adjusted close.
tsla_data_by_year["Adj Close Tomorrow"] = tsla_data_by_year["Adj Close"].shift(-1)
# This adds another column as a bool to quickly show if the stock goes up or down tomorrow.
tsla_data_by_year["Stock Goes Up Tomorrow"] = tsla_data_by_year["Adj Close"] < tsla_data_by_year["Adj Close Tomorrow"]
# Let's look at the first 10 rows to see if this looks correct.
tsla_data_by_year.head(10)
日期 | 打开 | 高的 | 低的 | 关闭 | 调整收盘价 | 体积 | 明天收盘价 | 明天股票上涨 | |
---|---|---|---|---|---|---|---|---|---|
0 | 2010年6月29日 | 19.000000 | 25.000000 | 17.540001 | 23.889999 | 23.889999 | 18766300 | 23.830000 | 错误的 |
1 | 2010年6月30日 | 25.790001 | 30.420000 | 23.299999 | 23.830000 | 23.830000 | 17187100 | 21.959999 | 错误的 |
2 | 2010年7月1日 | 25.000000 | 25.920000 | 20.270000 | 21.959999 | 21.959999 | 8218800 | 19.200001 | 错误的 |
3 | 2010年7月2日 | 23.000000 | 23.100000 | 18.709999 | 19.200001 | 19.200001 | 5139800 | 16.110001 | 错误的 |
4 | 2010年7月6日 | 20.000000 | 20.000000 | 15.830000 | 16.110001 | 16.110001 | 6866900 | 15.800000 | 错误的 |
5 | 2010年7月7日 | 16.400000 | 16.629999 | 14.980000 | 15.800000 | 15.800000 | 6921700 | 17.459999 | 真的 |
6 | 2010年7月8日 | 16.139999 | 17.520000 | 15.570000 | 17.459999 | 17.459999 | 7711400 | 17.400000 | 错误的 |
7 | 2010年7月9日 | 17.580000 | 17.900000 | 16.549999 | 17.400000 | 17.400000 | 4050600 | 17.049999 | 错误的 |
8 | 2010年7月12日 | 17.950001 | 18.070000 | 17.000000 | 17.049999 | 17.049999 | 2202500 | 18.139999 | 真的 |
9 | 2010年7月13日 | 17.389999 | 18.639999 | 16.900000 | 18.139999 | 18.139999 | 2680100 | 19.840000 | 真的 |
按照低买高卖的规则,我们通过查看历史数据,可以得出以下结论。
首先选择第二天收盘价上涨的第一天,并在当天购买价值 1000 美元的股票。
我们将有 3 个职位。
买入
卖出
持有
在代码中:
haveNoStock && !goesUpTomorrow = 持有
没有库存 && 明天上涨 = 买入
有库存 && !明天上涨 = 卖出
haveStock && goUpTomorrow = 持有
# Setting some default values of the new columns.
# Position can be Hold/Sell/Buy
tsla_data_by_year['Position'] = 'Hold'
tsla_data_by_year['Number Of Stocks Held'] = 0
tsla_data_by_year['Money In Wallet'] = 0
# .at says at row 0, column 'Money in Wallet', save $1000
tsla_data_by_year.at[0, 'Money In Wallet'] = 1000
tsla_data_by_year.head()
日期 | 打开 | 高的 | 低的 | 关闭 | 调整收盘价 | 体积 | 明天收盘价 | 明天股票上涨 | 位置 | 持有股票数量 | 钱包里的钱 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2010年6月29日 | 19.000000 | 25.00 | 17.540001 | 23.889999 | 23.889999 | 18766300 | 23.830000 | 错误的 | 抓住 | 0 | 1000 |
1 | 2010年6月30日 | 25.790001 | 30.42 | 23.299999 | 23.830000 | 23.830000 | 17187100 | 21.959999 | 错误的 | 抓住 | 0 | 0 |
2 | 2010年7月1日 | 25.000000 | 25.92 | 20.270000 | 21.959999 | 21.959999 | 8218800 | 19.200001 | 错误的 | 抓住 | 0 | 0 |
3 | 2010年7月2日 | 23.000000 | 23.10 | 18.709999 | 19.200001 | 19.200001 | 5139800 | 16.110001 | 错误的 | 抓住 | 0 | 0 |
4 | 2010年7月6日 | 20.000000 | 20.00 | 15.830000 | 16.110001 | 16.110001 | 6866900 | 15.800000 | 错误的 | 抓住 | 0 | 0 |
# Here's my code for determining if I should buy/sell/hold.
# We'll put this in a function down the line.
previousRow = ''
for index, row in tsla_data_by_year.iterrows():
if(index > 0):
row['Money In Wallet'] = previousRow['Money In Wallet']
row['Number Of Stocks Held'] = previousRow['Number Of Stocks Held']
if(row['Number Of Stocks Held'] == 0 and not row['Stock Goes Up Tomorrow']):
row['Position'] = 'Hold'
# print(1)
elif(row['Number Of Stocks Held'] == 0 and row['Stock Goes Up Tomorrow']):
row['Position'] = 'Buy'
row['Number Of Stocks Held'] = row['Money In Wallet'] / row['Adj Close']
row['Money In Wallet'] -= row['Number Of Stocks Held'] * row['Adj Close']
# print(2)
elif(row['Number Of Stocks Held'] > 0 and not row['Stock Goes Up Tomorrow']):
row['Position'] = 'Sell'
row['Money In Wallet'] += row['Number Of Stocks Held'] * row['Adj Close']
row['Number Of Stocks Held'] = 0
# print(3)
elif(row['Number Of Stocks Held'] > 0 and row['Stock Goes Up Tomorrow']):
row['Position'] = 'Hold'
# print(4)
previousRow = row
tsla_data_by_year.at[index] = row
# Round each number to 2 decimal places.
tsla_data_by_year = tsla_data_by_year.round(2)
# Let's look at the last row to see how much money or stock we have at the end of the year.
tsla_data_by_year.tail(1)
日期 | 打开 | 高的 | 低的 | 关闭 | 调整收盘价 | 体积 | 明天收盘价 | 明天股票上涨 | 位置 | 持有股票数量 | 钱包里的钱 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
129 | 2010年12月31日 | 26.57 | 27.25 | 26.5 | 26.63 | 26.63 | 1417900 | 钠 | 错误的 | 卖 | 0.0 | 8645.73 |
如果我们预知未来的话,截至 2010 年底,我们将拥有 8,645 美元。
现在让我们用机器学习来解决这个问题。关键在于记住,这是一门科学,这是一个实验。我们需要遵循科学方法。
我们的假设:我们将尝试预测股票明天是上涨还是下跌。
当我们试图预测两种可能的结果时,这被称为二元分类。
我们的目标是,对于给定的一行,我们可以预测该股票明天是否会上涨的列,但我们掌握的只有今天的最高/最低价和价格。这并不能提供太多信息,我们需要尝试分析历史价格的趋势。为此,我们使用技术指标。
## Add bollinger bands
## To learn more about bollinger bands: https://www.investopedia.com/terms/b/bollingerbands.asp
import matplotlib.pyplot as plt
tsla_data_by_year['30 Day MA'] = tsla_data_by_year['Adj Close'].rolling(window=30).mean()
tsla_data_by_year['30 Day STD'] = tsla_data_by_year['Adj Close'].rolling(window=30).std()
tsla_data_by_year['Upper Band'] = tsla_data_by_year['30 Day MA'] + (tsla_data_by_year['30 Day STD'] * 2)
tsla_data_by_year['Lower Band'] = tsla_data_by_year['30 Day MA'] - (tsla_data_by_year['30 Day STD'] * 2)
# Simple 30 Day Bollinger Band for Tesla
tsla_data_by_year[['Adj Close', '30 Day MA', 'Upper Band', 'Lower Band']].plot(figsize=(12,6))
plt.title('30 Day Bollinger Band for Tesla')
plt.ylabel('Price (USD)')
plt.show();
# This plot will show us the adjusted close, the rolling average, and the upper
# and lower bands of the TSLA stock.
# Since we used a 30 day moving average, the starting 30 days do not have
# bollinger bands information.
# We use dropna() to drop the nulls.
tsla_data_by_year = tsla_data_by_year.dropna()
tsla_data_by_year.head()
日期 | 打开 | 高的 | 低的 | 关闭 | 调整收盘价 | 体积 | 明天收盘价 | 明天股票上涨 | 位置 | 持有股票数量 | 钱包里的钱 | 30天移动平均线 | 30天性病 | 上带 | 下带 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
二十九 | 2010年8月10日 | 19.65 | 19.65 | 18.82 | 19.03 | 19.03 | 1281300 | 17.90 | 错误的 | 抓住 | 0.00 | 1660.38 | 20.041333 | 1.937226 | 23.915786 | 16.166880 |
三十 | 2010年8月11日 | 18.69 | 18.88 | 17.85 | 17.90 | 17.90 | 797600 | 17.60 | 错误的 | 抓住 | 0.00 | 1660.38 | 19.841667 | 1.832744 | 23.507156 | 16.176178 |
31 | 2010年8月12日 | 17.80 | 17.90 | 17.39 | 17.60 | 17.60 | 691000 | 18.32 | 真的 | 买 | 94.34 | 0.00 | 19.634000 | 1.714383 | 23.062765 | 16.205235 |
三十二 | 2010年8月13日 | 18.18 | 18.45 | 17.66 | 18.32 | 18.32 | 634000 | 18.78 | 真的 | 抓住 | 94.34 | 0.00 | 19.512667 | 1.672380 | 22.857427 | 16.167907 |
33 | 2010年8月16日 | 18.45 | 18.80 | 18.26 | 18.78 | 18.78 | 485800 | 19.15 | 真的 | 抓住 | 94.34 | 0.00 | 19.498667 | 1.676840 | 22.852346 | 16.144987 |
# Some fantastical python.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
import numpy as np
from matplotlib import pyplot, dates
# Here we are saying we want to predict the column 'Stock Goes Up Tomorrow' by
# storing the column name in a variable.
predict = 'Stock Goes Up Tomorrow'
X = tsla_data_by_year
# Treat the date as a number
X['Date'] = X['Date'].dt.strftime('%Y%m%d')
# For each column, apply a LabelEncoder. Regression problems need numerical values
# or categorical values.
# With columns like 'Position', we need to apply a LabelEncoding
# to set 1 = Hold, 2 = Buy, 3 = Sell
# (this is an example, the LabelEncoder will determine
# the numerical values of the categories at runtime.)
for column in X.columns:
if column != 'Date':
if X[column].dtype == type(object):
le = LabelEncoder()
X[column] = le.fit_transform(X[column])
# Set the y dataset to just the single column we want to predict.
y = tsla_data_by_year[predict]
# Set the X dataset (what we will use to predict), to all the columns mentioned.
X = tsla_data_by_year[['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', '30 Day MA', '30 Day STD', 'Upper Band', 'Lower Band']]
# This is used to stratify. Learn more here: https://en.wikipedia.org/wiki/Stratified_sampling
targets = tsla_data_by_year[predict]
# This splits the dataset into training and testing. 60% of the data will be
# used to train, 40% will be used to test.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=101, stratify=targets)
X_test.head()
日期 | 打开 | 高的 | 低的 | 关闭 | 调整收盘价 | 体积 | 30天移动平均线 | 30天性病 | 上带 | 下带 | |
---|---|---|---|---|---|---|---|---|---|---|---|
65 | 20100930 | 22.00 | 22.15 | 20.19 | 20.41 | 20.41 | 2195800 | 20.382333 | 0.786847 | 21.956028 | 18.808639 |
106 | 20101129 | 35.41 | 35.95 | 33.33 | 34.33 | 34.33 | 1145600 | 26.106333 | 5.340324 | 36.786982 | 15.425684 |
57 | 2010年9月20日 | 20.67 | 21.35 | 20.16 | 21.06 | 21.06 | 947500 | 19.866667 | 1.058318 | 21.983302 | 17.750031 |
99 | 20101117 | 30.20 | 30.75 | 28.61 | 29.49 | 29.49 | 750000 | 23.079667 | 3.586195 | 30.252057 | 15.907276 |
90 | 2010年11月04日 | 22.60 | 25.33 | 22.15 | 24.90 | 24.90 | 1874000 | 20.957667 | 0.907603 | 22.772872 | 19.142461 |
from sklearn import ensemble
# hyper parameters for the GradientBoostingRegressor algorithm.. More on this much later.
params = {'n_estimators': 100, 'max_depth': 7, 'min_samples_split': 3,
'learning_rate': 0.01, 'loss': 'ls'}
clf = ensemble.GradientBoostingRegressor(**params)
# Fit the classifier with the training data.
clf.fit(X_train, y_train)
# Use the trained model to predict the testing dataset.
y_pred_original = clf.predict(X_test)
from sklearn.metrics import (confusion_matrix, precision_score, recall_score, f1_score, classification_report)
y_pred = y_pred_original > .5
y_pred = pd.DataFrame(y_pred)
y_pred = y_pred * 1
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
# The confusion matrix will show us True/False Positives, True/False Negatives.
# This dataset is really small to get an accurate reading of the score,
# but so far it looks like we're close to 50% accurate.
[[12 7]
[14 7]]
precision recall f1-score support
False 0.46 0.63 0.53 19
True 0.50 0.33 0.40 21
accuracy 0.48 40
macro avg 0.48 0.48 0.47 40
weighted avg 0.48 0.47 0.46 40
我们比抛硬币还差一点!让我们看看这是怎么回事。
predictions = clf.predict(X)
# The values of predictions are stored as a value from 0.00 to 1.00, but we need them
# as a true/false to work with our algorithm to calculate $$. Here I compare to .5 (threshold)
# to determine if the prediction is true or false.
# You can manually adjust the threshold to get a better True Positive / True Negative rate,
# sometimes it's beneficial if they're trying to reduce a particular metric.
predictions = predictions > .5
X['Stock Goes Up Tomorrow'] = predictions
# Same code as above, functionized. Use a dataset to determine
# how much money we'll have made with our trades.
def howMuchMoneyDidWeMake(X):
if('Money In Wallet' not in X ):
X['Position'] = 'Hold'
X['Number Of Stocks Held'] = 0
X['Money In Wallet'] = 0
X.at[0, 'Money In Wallet'] = 1000
previousRow = ''
for index, row in X.iterrows():
# print(row)
if(index > 0):
row['Money In Wallet'] = previousRow['Money In Wallet']
row['Number Of Stocks Held'] = previousRow['Number Of Stocks Held']
if(row['Number Of Stocks Held'] == 0 and not row['Stock Goes Up Tomorrow']):
row['Position'] = 'Hold'
# print(1)
elif(row['Number Of Stocks Held'] == 0 and row['Stock Goes Up Tomorrow']):
row['Position'] = 'Buy'
row['Number Of Stocks Held'] = row['Money In Wallet'] / row['Adj Close']
row['Money In Wallet'] -= row['Number Of Stocks Held'] * row['Adj Close']
# print(2)
elif(row['Number Of Stocks Held'] > 0 and not row['Stock Goes Up Tomorrow']):
row['Position'] = 'Sell'
row['Money In Wallet'] += row['Number Of Stocks Held'] * row['Adj Close']
row['Number Of Stocks Held'] = 0
# print(3)
elif(row['Number Of Stocks Held'] > 0 and row['Stock Goes Up Tomorrow']):
row['Position'] = 'Hold'
# print(4)
previousRow = row
X.at[index] = row
X = X.round(2)
return X
X = X.reset_index()
X = howMuchMoneyDidWeMake(X)
X.tail(1)
指数 | 日期 | 打开 | 高的 | 低的 | 关闭 | 调整收盘价 | 体积 | 30天移动平均线 | 30天性病 | 上带 | 下带 | 明天股票上涨 | 位置 | 持有股票数量 | 钱包里的钱 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
99 | 128 | 2010年12月30日 | 27.7 | 27.9 | 26.38 | 26.5 | 26.5 | 2041100 | 31.28 | 2.63 | 36.55 | 26.02 | 错误的 | 抓住 | 0.0 | 2917.96 |
还不错!我们的钱几乎翻了三倍。最后我们得到了2917美元。
但这里有一个新手常犯的错误。我们用训练算法时用的数据来衡量成功。我们需要另一个数据集来测试。等到明年再试一次吧。
# Let's pull 2011 data.
years_to_pull = [2011]
tsla_data_by_year = pull_data_by_year(tsla_data, years_to_pull)
def addBollingerBands(df):
df['30 Day MA'] = df['Adj Close'].rolling(window=30).mean()
df['30 Day STD'] = df['Adj Close'].rolling(window=30).std()
df['Upper Band'] = df['30 Day MA'] + (df['30 Day STD'] * 2)
df['Lower Band'] = df['30 Day MA'] - (df['30 Day STD'] * 2)
df = df.dropna()
return df
tsla_data_by_year = addBollingerBands(tsla_data_by_year)
tsla_data_by_year.head()
日期 | 打开 | 高的 | 低的 | 关闭 | 调整收盘价 | 体积 | 30天移动平均线 | 30天性病 | 上带 | 下带 | |
---|---|---|---|---|---|---|---|---|---|---|---|
159 | 2011年2月14日 | 23.639999 | 24.139999 | 23.049999 | 23.08 | 23.08 | 1283100 | 24.937333 | 1.720240 | 28.377814 | 21.496853 |
160 | 2011年2月15日 | 23.010000 | 23.170000 | 22.559999 | 22.84 | 22.84 | 953700 | 24.811333 | 1.731142 | 28.273617 | 21.349049 |
161 | 2011年2月16日 | 23.100000 | 24.969999 | 23.070000 | 24.73 | 24.73 | 4115100 | 24.746667 | 1.695178 | 28.137023 | 21.356310 |
162 | 2011年2月17日 | 24.629999 | 25.490000 | 23.549999 | 23.60 | 23.60 | 2618400 | 24.639000 | 1.660516 | 27.960031 | 21.317969 |
163 | 2011年2月18日 | 23.330000 | 23.490000 | 22.959999 | 23.18 | 23.18 | 2370700 | 24.482333 | 1.563047 | 27.608427 | 21.356240 |
# Function to add the predicted column to a dataset using a trained classifier
def addPredictedColumn(df, clf):
df['Date'] = df['Date'].dt.strftime('%Y%m%d')
# df["Adj Close Tomorrow"] = df["Adj Close"].shift(-1)
df = df.dropna()
for column in df.columns:
if column != 'Date':
if df[column].dtype == type(object):
le = LabelEncoder()
df[column] = le.fit_transform(df[column])
predictions = clf.predict(df)
predictions = predictions > .5
df['Stock Goes Up Tomorrow'] = predictions
return df
tsla_data_by_year = addPredictedColumn(tsla_data_by_year, clf)
tsla_data_by_year = tsla_data_by_year.reset_index()
tsla_data_by_year = howMuchMoneyDidWeMake(tsla_data_by_year)
tsla_data_by_year.tail(1)
指数 | 日期 | 打开 | 高的 | 低的 | 关闭 | 调整收盘价 | 体积 | 30天移动平均线 | 30天性病 | 上带 | 下带 | 明天股票上涨 | 位置 | 持有股票数量 | 钱包里的钱 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
222 | 381 | 2011年12月30日 | 28.49 | 28.98 | 28.25 | 28.56 | 28.56 | 339800 | 30.66 | 2.33 | 35.33 | 25.99 | 错误的 | 抓住 | 0.0 | 1652.3 |
如果我们用 2010 年训练的模型来预测 2011 年的数据,最终结果会是 1652.01 美元。这很糟糕。我认为问题在于我们没有考虑 2011 年 TSLA 的新数据,我们只是用 2010 年的数据来预测 2011 年的数据。这根本行不通。
如果我们在 2011 年每 30 天重新训练一次模型会怎么样?这样,分类器每 30 天就会根据发现的任何新模式进行“重置”。
existing_df = tsla_data_by_year
years_to_pull = [2011]
tsla_data_by_year = pull_data_by_year(tsla_data, years_to_pull)
tsla_data_by_year = addBollingerBands(tsla_data_by_year)
tsla_data_by_year = tsla_data_by_year.reset_index()
working_df = tsla_data_by_year[:30]
working_df = working_df.drop(columns=['index'])
working_df = addPredictedColumn(working_df, clf)
working_df = howMuchMoneyDidWeMake(working_df)
all_the_money = pd.concat([working_df], sort=True)
for i in range(1, 8):
new_first_row = working_df[-1:]
for column in working_df.columns:
if column != 'Date':
if working_df[column].dtype == type(object):
le = LabelEncoder()
working_df[column] = le.fit_transform(working_df[column])
existing_df = pd.concat([existing_df, working_df], sort=True)
working_df = tsla_data_by_year[30*i:30*i+30]
existing_df["Adj Close Tomorrow"] = existing_df["Adj Close"].shift(-1)
existing_df = existing_df.dropna()
existing_df["Stock Goes Up Tomorrow"] = existing_df["Adj Close"] < existing_df["Adj Close Tomorrow"]
predict = 'Stock Goes Up Tomorrow'
X = existing_df
for column in X.columns:
if column != 'Date':
if X[column].dtype == type(object):
le = LabelEncoder()
X[column] = le.fit_transform(X[column])
y = existing_df[predict]
X = existing_df[['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume',
'30 Day MA', '30 Day STD', 'Upper Band', 'Lower Band']]
targets = existing_df[predict]
params = {'n_estimators': 100, 'max_depth': 7, 'min_samples_split': 3,
'learning_rate': 0.01, 'loss': 'ls'}
clf = ensemble.GradientBoostingRegressor(**params)
clf.fit(X, y)
working_df = working_df.drop(columns=['index'])
working_df = addPredictedColumn(working_df, clf)
working_df = pd.concat([new_first_row, working_df], sort=True)
working_df = working_df.reset_index(drop=True)
working_df = howMuchMoneyDidWeMake(working_df)
all_the_money = pd.concat([all_the_money, working_df], sort=True)
all_the_money.tail(1)
30天移动平均线 | 30天性病 | 调整收盘价 | 关闭 | 日期 | 高的 | 低的 | 下带 | 钱包里的钱 | 持有股票数量 | 打开 | 位置 | 明天股票上涨 | 上带 | 体积 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
十三 | 30.66 | 2.33 | 28.56 | 28.56 | 2011年12月30日 | 28.98 | 28.25 | 25.99 | 0.0 | 322.57 | 28.49 | 买 | 真的 | 35.33 | 339800 |
2011 年每 30 天重新训练一次模型,最终我们得到 322 股,每股 28.56 美元,总计 9,196 美元。
那太好了!
马上打电话给我的经纪人。