ابو الفضل: Training on large data

Sunday, April 28, 2019

Training on large data

There are 3 ways to deal with large data files:

1.AWS: Use a large machine to process data which is stored on s3. Only h2o models can be saved to binary files and uploaded to the s3 server

2. Read chunks of data one at a time and train the algorithm using the checkpoint feature of h2o. With h2o, the gbm should have one extra tree every iteration, and for deeplearning extra epochs.
Reading chunk by chunk: https://www.youtube.com/watch?v=Z5rMrI1e4kM

H2o check point: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/checkpoint.html

3. Vowpal Wabbit. Offers online learning.
Installation as on github on an ubuntu machine: https://github.com/JohnLangford/vowpal_wabbit/wiki/Tutorial

Commands:

git clone git://github.com/JohnLangford/vowpal_wabbit.git

git clone https://github.com/JohnLangford/vowpal_wabbit.git

cd vowpal_wabbit

sudo apt-get install libboost-program-options-dev libboost-python-dev

sudo apt-get install zlib1g-dev

sudo apt-get install libboost1.48-all-dev

make vw

make library_example

make test

VW requires a special format file. To convert csv to special format: https://www.youtube.com/watch?v=ee6T9ytzjyU&t=1s

https://www.auduno.com/2014/08/29/some-nice-ml-libraries/

Useful links

https://github.com/datasciencedojo/meetup/tree/master/getting_started_with_vowpal_wabbit

http://www.zinkov.com/posts/2013-08-13-vowpal-tutorial/

Vowpal wabbit tutorial
https://drive.google.com/drive/folders/11T8Rffx6I8vtZyIVf0r10lnTT9r7qWfN

ابو الفضل

Sunday, April 28, 2019

Training on large data

No comments:

Post a Comment

Loud fan of desktop

Followers

Report Abuse