Алгоритм Swivel и его приложения

Константин Славнов, source{d}

Категориальные признаки

Любые признаки без порядка. Как с ними работать?

🗺Города, страны
📖Тексты
🕸Графы
•...

Кодируем!

One-hot-encoding
Хэширование
Кодирование вещественным признаком

One-hot-encoding

🍒 ⇒ $[\;1,\; 0,\; 0\;]$
🍑 ⇒ $[\;0,\; 1,\; 0\;]$
🍏 ⇒ $[\;0,\; 0,\; 1\;]$

One-hot-encoding

🍒 ⇒ $[\;1,\; 0,\; 0\;]$
🍑 ⇒ $[\;0,\; 1,\; 0\;]$
🍏 ⇒ $[\;0,\; 0,\; 1\;]$

$$ \begin{bmatrix} 🍒 & 2 & 0.5 \\ 🍑 & 1 & 0.75 \\ 🍑 & 3 & 0.9 \\ \vdots & \vdots & \vdots \\ 🍏 & 5 & 0.2 \\ \end{bmatrix} \quad \Longrightarrow \quad \begin{bmatrix} 1 & 0 & 0 & 2 & 0.5 \\ 0 & 1 & 0 & 1 & 0.75 \\ 0 & 1 & 0 & 3 & 0.9 \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 0 & 0 & 1 & 5 & 0.2 \\ \end{bmatrix} $$

Хеширование

🍒 ⇒ [ 1, 1]
🍑 ⇒ [ 0,-1]
🍏 ⇒ [ 0, 1]
🍇 ⇒ [-1, 1]
🍓 ⇒ [ 2, 0]

aka the hashing trick

Хеширование в sklearn


            from sklearn.feature_extraction import FeatureHasher 

            fruits = ["cherry", "peach", "apple", "grapes", "strawberry"] 

            hasher = FeatureHasher(n_features=2, input_type="string") 

            hasher.transform(fruits)

Кодирование вещественным признаком

🍒 ⇒ 250
🍑 ⇒ 150
🍏 ⇒ 100

Кодирование вещественным признаком

🍒 ⇒ 250
🍑 ⇒ 150
🍏 ⇒ 100

$$ \begin{bmatrix} 🍒 & 2 & 0.5 \\ 🍑 & 1 & 0.75 \\ 🍑 & 3 & 0.9 \\ \vdots & \vdots & \vdots \\ 🍏 & 5 & 0.2 \\ \end{bmatrix} \; \begin{bmatrix} 250 \\ 100 \\ 200 \\ \vdots \\ 100 \\ \end{bmatrix} \quad \Longrightarrow \quad \begin{bmatrix} 250 & 2 & 0.5 \\ 150 & 1 & 0.75 \\ 150 & 3 & 0.9 \\ \vdots & \vdots & \vdots \\ 100 & 5 & 0.2 \\ \end{bmatrix} \; \begin{bmatrix} 250 \\ 100 \\ 200 \\ \vdots \\ 100 \\ \end{bmatrix} $$

Итог

☑️ Быстро
☑️ С любыми признаками
❌ Не конструктивно

Если у признака есть структура — ее можно использовать.

Векторные вложения 🎉

tensorflow fa-floppy-oat unaligned vector ttypes types const unaligned vec int unaligned vector ttypes types int const unaligned vec example statistics norm normalized squared regularizations regularizations status initialize construction kernel opkernel context error iferror return tfreturn context attr get symmetric error iferror return tfreturn context attr get symmetric shrinkage symmetric symmetric status shrink weight shrinked std max std abs weight shrinkage shrinked std copysign shrinked weight eigen tensor eigen major row eigen shrink eigen tensor eigen major row weights weights sign weights abs weights constant shrinkage cwise max weights constant shrinkage cwise max weights constant symmetric symmetric symmetric symmetric shrinkage and assign copy disallow tfdisallow regularizations model weights example example statistics and compute example norm weighted wxand num partitions model weights model weights regularizations regularization example label example label example weight example weight norm squared norm squared features sparse std ptr unique int unaligned vector indices std ptr unique float unaligned vector values std vector features sparse features sparse dense vector eigen map tensor eigen tensor eigen major row row eigen map tensor eigen tensor eigen major row data matrix data index row data matrix dimension data matrix dimension ttypes types const matrix data matrix int index row std vector std ptr unique dense vector dense vectors example label example weight norm squared examples model weights model weights model weights delta update weights eigen device pool thread device example example bounded delta dual normalized sparse weights size example features sparse features sparse example features sparse feature weights feature weights sparse weights int features sparse indices size feature value features sparse values features sparse values feature weights deltas features sparse indices feature value bounded delta dual normalized dense weights size example dense vector dense vector example dense vectors ttypes types vec deltas dense weights deltas deltas device device deltas dense vector row deltas constant bounded delta dual normalized status initialize context kernel opkernel context input list opinput inputs sparse weights error iferror return tfreturn context input list inputs sparse weights input list opinput dense inputs weights error iferror return tfreturn context input list dense inputs weights list opoutput output outputs sparse weights error iferror return tfreturn context list output outputs sparse weights list opoutput output dense outputs weights error iferror return tfreturn context list output dense outputs weights intialize weights input list opinput inputs weight list opoutput output outputs weight std vector feature weights feature weights inputs weight size tensor delta outputs weight allocate inputs weight shape delta deltas delta flat deltas set zero feature weights back emplace feature weights inputs weight flat deltas intialize weights inputs sparse weights outputs sparse weights sparse weights intialize weights dense inputs weights dense outputs weights dense weights status feature weights ttypes types vec nominals ttypes types vec deltas std vector feature weights sparse weights std vector feature weights dense weights example and assign copy disallow tfdisallow model weights example statistics example and compute example norm weighted wxand num partitions model weights model weights regularizations regularization example statistics result result norm normalized squared norm squared regularization symmetric features sparse size example features sparse features sparse features sparse model weights feature weights sparse weights model weights sparse weights int features sparse indices size int feature index features sparse indices feature value features sparse values features sparse values feature weight sparse weights nominals feature index sparse weights deltas feature index num partitions result feature value regularization shrink feature weight dense vectors size example dense vector dense vector dense vectors model weights feature weights dense weights model weights dense weights eigen tensor eigen major row feature weights dense weights nominals dense weights deltas dense weights deltas constant num partitions eigen tensor eigen major row prediction dense vector row regularization eigen shrink feature weights sum result prediction result examples examples example example example index examples example index examples num examples size features num features num status initialize context kernel opkernel context features num sparse features num sparse values with dense features num features num features num sparse dense features num input list opinput example indices inputs sparse error iferror return tfreturn context input list example indices inputs sparse input list opinput feature indices inputs sparse error iferror return tfreturn context input list feature indices inputs sparse input list opinput feature inputs sparse values features num sparse values with error iferror return tfreturn context input list feature inputs sparse values tensor example weights error iferror return tfreturn context input example weights example weights example weights flat examples num example weights size tensor example labels error iferror return tfreturn context input example labels example labels example labels flat input list opinput dense features inputs error iferror return tfreturn context input list dense features inputs examples clear examples

Вложение в векторное пространство

$ \begin{split} \texttt{"apple"} \Leftrightarrow & \; V_1 \\ \\ \texttt{"peach"} \Leftrightarrow & \; V_2 \\ \\ \texttt{"cat"} \Leftrightarrow & \; V_3 \\ \\ \mathrm{similarity}(V_1, V_2) < & \; \mathrm{similarity}(V_1, V_3) \end{split} $

Вложение в векторное пространство

$ \begin{split} \texttt{"apple"} \Leftrightarrow & \; V_1 \\ \\ \texttt{"peach"} \Leftrightarrow & \; V_2 \\ \\ \texttt{"cat"} \Leftrightarrow & \; V_3 \\ \\ \mathrm{similarity}(V_1, V_2) \sim & \; {V_1}^\top V_2 \end{split} $

Как оценить ${V_i}^\top V_j$ ?

Контекст — слово

Пример:
"Слова — это всего лишь слова."
Ширина контекста — 1.

Контекст — слово

Пример:
"Слова — это всего лишь слова."

Контекст — слово

Пример:
"Слова — это всего лишь слова."

Контекст — слово

Пример:
"Слова — это всего лишь слова."

Контекст — слово

Пример:
"Слова — это всего лишь слова."

Контекст — слово

Пример:
"Слова — это всего лишь слова."

Как оценить ${V_i}^\top V_j$ ?

Пример:
"Слова — это всего лишь слова."

$\qquad{V_i}^\top V_j = \mathrm{PMI}(A)_{ij}$

Как оценить ${V_i}^\top V_j$ ?

Pointwise mutual information:

$\mathrm{PMI} (x;y) \equiv \log {\dfrac {p(x,\,y)}{p(x)\,p(y)}}.$

Как оценить ${V_i}^\top V_j$ ?

Pointwise mutual information:

$ \mathrm{PMI} (x;y) \equiv \log {\dfrac {p(x,\,y)}{p(x)\,p(y)}}. \\[20pt] x = это \quad y = лишь \\ \dfrac {p(это,\, лишь)}{p(это)\,p(лишь)} = \dfrac {0}{1/4 \,\cdot\, 1/4} = 0. \\ $

Как оценить ${V_i}^\top V_j$ ?

Pointwise mutual information:

$ \mathrm{PMI} (x;y) \equiv \log {\dfrac {p(x,\,y)}{p(x)\,p(y)}}. \\[20pt] x = это \quad y = всего \\ \dfrac {p(это,\, всего)}{p(это)\,p(всего)} = \dfrac {1/8}{1/4 \,\cdot\, 1/4} = 2. \\ $

${V_i}^\top V_j \approx \log {\dfrac {p(x,\,y)}{p(x)\,p(y)}}$

Итого

Снижает размерность данных.
Похожий контекст — похожие вектора.
Логический вывод: $V(девушка) − V(парень) ≃ V(королева) − V(король)$.

Примеры моделей

Word2vec — Статья на ArXiV, fastText.
GloVe — Статья, реализация.
Swivel — Статья на ArXiV, реализация на tensorflow, наш форк.

Swivel

✅Работает с матрицей слово-контекст
✅Масштабируется с размером словаря
✅Горизонтально масштабируется (GPUs & nodes)

Swivel sharding

Быстро. Мощно.

Анализируем
код

Id2Vec

Строим векторное представление для идентификаторов в коде.
Как построить матрицу слово-контекст $A_{ij}$?

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None

Предобработка

            _tcp_socket_connect ⇒ [tcp, socket, connect]
            AuthenticationError ⇒ [authentication, error]
            authentication, authenticate ⇒ authenticate

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None

database, connect², user², password², host², port², tcp, socket², authenticate², error, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None

connect², user², password², host², port², tcp, socket², authenticate², error, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None

connect, user, password, host, port

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None

tcp, socket, connect, host, port

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None

authenticate², user, password, error, socket, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None

authenticate, user, password

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None

authenticate, error, socket, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None

authenticate, error

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None

socket, close

Id2Vec

Строим векторное представление для идентификаторов в коде.
Как построить матрицу слово-контекст $A_{ij}$?

$A_{ij} =$ количество раз когда $i$ и $j$ встретились вместе

Id2Vec

Строим векторное представление для идентификаторов в коде.
Как построить матрицу слово-контекст $A_{ij}$?

$A_{ij} =$ количество раз когда $i$ и $j$ встретились вместе

$$ {V_i}^\top V_j \approx PMI_{ij} = \log\dfrac{A_{ij} \sum A_{ij}}{\sum_{k = 1}^N A_{ik}\sum_{k = 1}^N A_{jk}} $$

Результаты

Ближайшие имена к “foo”

afoo
qux
myfoo
baz
mfoo
wibble
dofoo
quux
dfoo
testing
ifoo

Логический вывод

$V(\mathrm{bug})$ $-$ $V(\mathrm{test})$ $+$ $V(\mathrm{expect})$ $\approx$ $V(\mathrm{suppress})$

$V(\mathrm{database})$ $-$ $V(\mathrm{query})$ $+$ $V(\mathrm{tune})$ $\approx$ $V(\mathrm{settings})$

$V(\mathrm{send})$ $-$ $V(\mathrm{receive})$ $+$ $V(\mathrm{pop})$ $\approx$ $V(\mathrm{push})$

Очепятки

$V(\mathrm{recieve})$ $\approx$ $V(\mathrm{receive})$
$V(\mathrm{grey})$ $\approx$ $V(\mathrm{gray})$
$V(\mathrm{calback})$ $\approx$ $V(\mathrm{callbak})$ $\approx$ $V(\mathrm{callback})$

Id2vec

Реализация доступна на Github.

            $ python3 -m sourced.ml repo2coocc ...   
            $ python3 -m sourced.ml id2vec-preproc ... 
            $ python3 -m sourced.ml id2vec-train ...

Разбиение кода на сниппеты

                import numpy as np
                seed = 42
                np.random.seed(seed)
                print(np.random.randint(10))
                from sklearn.cluster import KMeans
                X = np.array([[1, 2], [1, 4]])
                kmeans = KMeans(2).fit(X)
                print(kmeans.labels_)

Разбиение кода на сниппеты

                import numpy as np
                seed = 42
                np.random.seed(seed)
                print(np.random.randint(10))
                from sklearn.cluster import KMeans
                X = np.array([[1, 2], [1, 4]])
                kmeans = KMeans(2).fit(X)
                print(kmeans.labels_)

Разбиение кода на сниппеты

                import numpy as np
                seed = 42
                np.random.seed(seed)
                print(np.random.randint(10))
                from sklearn.cluster import KMeans
                X = np.array([[1, 2], [1, 4]])
                kmeans = KMeans(2).fit(X)
                print(kmeans.labels_)

Разбиение кода на сниппеты

                import numpy as np
                seed = 42
                np.random.seed(seed)
                print(np.random.randint(10))
                from sklearn.cluster import KMeans
                X = np.array([[1, 2], [1, 4]])
                kmeans = KMeans(2).fit(X)
                print(kmeans.labels_)

Разбиение кода на сниппеты

                import numpy as np
                seed = 42
                np.random.seed(seed)
                print(np.random.randint(10))
                from sklearn.cluster import KMeans
                X = np.array([[1, 2], [1, 4]])
                kmeans = KMeans(2).fit(X)
                print(kmeans.labels_)

Разбиение кода на сниппеты

                import numpy as np
                seed = 42
                np.random.seed(seed)
                print(np.random.randint(10))
                from sklearn.cluster import KMeans
                X = np.array([[1, 2], [1, 4]])
                kmeans = KMeans(2).fit(X)
                print(kmeans.labels_)

Разбиение кода на сниппеты

                import numpy as np
                seed = 42
                np.random.seed(seed)
                print(np.random.randint(10))
                from sklearn.cluster import KMeans
                X = np.array([[1, 2], [1, 4]])
                kmeans = KMeans(2).fit(X)
                print(kmeans.labels_)

Пример

Python Data Science Handbook - Linear Regression.

            

            import matplotlib.pyplot as plt
            import seaborn as sns; sns.set()
            import numpy as np
            

            rng = np.random.RandomState(1)
            x = 10 * rng.rand(50)
            y = 2 * x - 5 + rng.randn(50)
            plt.scatter(x, y);
            

            from sklearn.linear_model import LinearRegression
            model = LinearRegression(fit_intercept=True)
            

            model.fit(x[:, np.newaxis], y)
            

            xfit = np.linspace(0, 10, 1000)
            yfit = model.predict(xfit[:, np.newaxis])
            

            plt.scatter(x, y)
            plt.plot(xfit, yfit);
            

            print("Model slope:    ", model.coef_[0])
            print("Model intercept:", model.intercept_)
            

            rng = np.random.RandomState(1)
            X = 10 * rng.rand(100, 3)
            y = 0.5 + np.dot(X, [1.5, -2., 1.])
            

            model.fit(X, y)
            print(model.intercept_)
            print(model.coef_)
            

            from sklearn.preprocessing import PolynomialFeatures
            x = np.array([2, 3, 4])
            poly = PolynomialFeatures(3, include_bias=False)
            poly.fit_transform(x[:, None])
            

            from sklearn.pipeline import make_pipeline
            poly_model = make_pipeline(PolynomialFeatures(7),
                                       LinearRegression())
            

            rng = np.random.RandomState(1)
            x = 10 * rng.rand(50)
            y = np.sin(x) + 0.1 * rng.randn(50)
            

            poly_model.fit(x[:, np.newaxis], y)
            yfit = poly_model.predict(xfit[:, np.newaxis])
            

            plt.scatter(x, y)
            plt.plot(xfit, yfit);
            

            from sklearn.base import BaseEstimator, TransformerMixin
            

            class GaussianFeatures(BaseEstimator, TransformerMixin):
                """Uniformly spaced Gaussian features for one-dimensional input"""
            

                def __init__(self, N, width_factor=2.0):
                    self.N = N
                    self.width_factor = width_factor
            

                @staticmethod
                def _gauss_basis(x, y, width, axis=None):
                    arg = (x - y) / width
                    return np.exp(-0.5 * np.sum(arg ** 2, axis))
            

                def fit(self, X, y=None):
            

                    self.centers_ = np.linspace(X.min(), X.max(), self.N)
                    self.width_ = self.width_factor * (self.centers_[1] - self.centers_[0])
                    return self
            

                def transform(self, X):
                    return self._gauss_basis(X[:, :, np.newaxis], self.centers_,
                                             self.width_, axis=1)
            

            gauss_model = make_pipeline(GaussianFeatures(20),
                                        LinearRegression())
            gauss_model.fit(x[:, np.newaxis], y)
            yfit = gauss_model.predict(xfit[:, np.newaxis])
            

            plt.scatter(x, y)
            plt.plot(xfit, yfit)
            plt.xlim(0, 10);
            model = make_pipeline(GaussianFeatures(30),
                                  LinearRegression())
            model.fit(x[:, np.newaxis], y)
            

            plt.scatter(x, y)
            plt.plot(xfit, model.predict(xfit[:, np.newaxis]))
            

            plt.xlim(0, 10)
            plt.ylim(-1.5, 1.5);
            

            def basis_plot(model, title=None):
                fig, ax = plt.subplots(2, sharex=True)
                model.fit(x[:, np.newaxis], y)
                ax[0].scatter(x, y)
                ax[0].plot(xfit, model.predict(xfit[:, np.newaxis]))
                ax[0].set(xlabel='x', ylabel='y', ylim=(-1.5, 1.5))
            

                if title:
                    ax[0].set_title(title)
            

                ax[1].plot(model.steps[0][1].centers_,
                           model.steps[1][1].coef_)
                ax[1].set(xlabel='basis location',
                          ylabel='coefficient',
                          xlim=(0, 10))
            

            model = make_pipeline(GaussianFeatures(30), LinearRegression())
            basis_plot(model)
            

            from sklearn.linear_model import Ridge
            model = make_pipeline(GaussianFeatures(30), Ridge(alpha=0.1))
            basis_plot(model, title='Ridge Regression')
            

            from sklearn.linear_model import Lasso
            model = make_pipeline(GaussianFeatures(30), Lasso(alpha=0.001))
            basis_plot(model, title='Lasso Regression')
            

            import pandas as pd
            counts = pd.read_csv('FremontBridge.csv', index_col='Date', parse_dates=True)
            weather = pd.read_csv('data/BicycleWeather.csv', index_col='DATE', parse_dates=True)
            

            daily = counts.resample('d').sum()
            daily['Total'] = daily.sum(axis=1)
            daily = daily[['Total']]
            

            days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
            for i in range(7):
                daily[days[i]] = (daily.index.dayofweek == i).astype(float)
            

            from pandas.tseries.holiday import USFederalHolidayCalendar
            cal = USFederalHolidayCalendar()
            holidays = cal.holidays('2012', '2016')
            daily = daily.join(pd.Series(1, index=holidays, name='holiday'))
            daily['holiday'].fillna(0, inplace=True)
            

            def hours_of_daylight(date, axis=23.44, latitude=47.61):
                """Compute the hours of daylight for the given date"""
                days = (date - pd.datetime(2000, 12, 21)).days
                m = (1. - np.tan(np.radians(latitude))
                     * np.tan(np.radians(axis) * np.cos(days * 2 * np.pi / 365.25)))
                return 24. * np.degrees(np.arccos(1 - np.clip(m, 0, 2))) / 180.
            

            daily['daylight_hrs'] = list(map(hours_of_daylight, daily.index))
            daily[['daylight_hrs']].plot()
            plt.ylim(8, 17)
            

            weather['TMIN'] /= 10
            weather['TMAX'] /= 10
            weather['Temp (C)'] = 0.5 * (weather['TMIN'] + weather['TMAX'])
            

            weather['PRCP'] /= 254
            weather['dry day'] = (weather['PRCP'] == 0).astype(int)
            

            daily = daily.join(weather[['PRCP', 'Temp (C)', 'dry day']])
            daily['annual'] = (daily.index - daily.index[0]).days / 365.
            daily.head()
            daily.dropna(axis=0, how='any', inplace=True)
            

            column_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'holiday',
                            'daylight_hrs', 'PRCP', 'dry day', 'Temp (C)', 'annual']
            X = daily[column_names]
            y = daily['Total']
            

            model = LinearRegression(fit_intercept=False)
            model.fit(X, y)
            daily['predicted'] = model.predict(X)
            

            daily[['Total', 'predicted']].plot(alpha=0.5);
            

            params = pd.Series(model.coef_, index=X.columns)
            params
            

            from sklearn.utils import resample
            np.random.seed(1)
            err = np.std([model.fit(*resample(X, y)).coef_
                          for i in range(1000)], 0)
            

            print(pd.DataFrame({'effect': params.round(0),
                                'error': err.round(0)}))

            

            import matplotlib.pyplot as plt
            import seaborn as sns; sns.set()
            import numpy as np
            

            rng = np.random.RandomState(1)
            x = 10 * rng.rand(50)
            y = 2 * x - 5 + rng.randn(50)
            plt.scatter(x, y);
            

            from sklearn.linear_model import LinearRegression
            model = LinearRegression(fit_intercept=True)
            

            model.fit(x[:, np.newaxis], y)
            

            xfit = np.linspace(0, 10, 1000)
            yfit = model.predict(xfit[:, np.newaxis])
            

            plt.scatter(x, y)
            plt.plot(xfit, yfit);
            

            print("Model slope:    ", model.coef_[0])
            print("Model intercept:", model.intercept_)
            

            rng = np.random.RandomState(1)
            X = 10 * rng.rand(100, 3)
            y = 0.5 + np.dot(X, [1.5, -2., 1.])
            

            model.fit(X, y)
            print(model.intercept_)
            print(model.coef_)
            

            from sklearn.preprocessing import PolynomialFeatures
            x = np.array([2, 3, 4])
            poly = PolynomialFeatures(3, include_bias=False)
            poly.fit_transform(x[:, None])
            

            from sklearn.pipeline import make_pipeline
            poly_model = make_pipeline(PolynomialFeatures(7),
                                       LinearRegression())
            

            rng = np.random.RandomState(1)
            x = 10 * rng.rand(50)
            y = np.sin(x) + 0.1 * rng.randn(50)
            

            poly_model.fit(x[:, np.newaxis], y)
            yfit = poly_model.predict(xfit[:, np.newaxis])
            

            plt.scatter(x, y)
            plt.plot(xfit, yfit);
            

            from sklearn.base import BaseEstimator, TransformerMixin
            

            class GaussianFeatures(BaseEstimator, TransformerMixin):
                """Uniformly spaced Gaussian features for one-dimensional input"""
            

                def __init__(self, N, width_factor=2.0):
                    self.N = N
                    self.width_factor = width_factor
            

                @staticmethod
                def _gauss_basis(x, y, width, axis=None):
                    arg = (x - y) / width
                    return np.exp(-0.5 * np.sum(arg ** 2, axis))
            

                def fit(self, X, y=None):
            

                    self.centers_ = np.linspace(X.min(), X.max(), self.N)
                    self.width_ = self.width_factor * (self.centers_[1] - self.centers_[0])
                    return self
            

                def transform(self, X):
                    return self._gauss_basis(X[:, :, np.newaxis], self.centers_,
                                             self.width_, axis=1)
            

            gauss_model = make_pipeline(GaussianFeatures(20),
                                        LinearRegression())
            gauss_model.fit(x[:, np.newaxis], y)
            yfit = gauss_model.predict(xfit[:, np.newaxis])
            

            plt.scatter(x, y)
            plt.plot(xfit, yfit)
            plt.xlim(0, 10);
            model = make_pipeline(GaussianFeatures(30),
                                  LinearRegression())
            model.fit(x[:, np.newaxis], y)
            

            plt.scatter(x, y)
            plt.plot(xfit, model.predict(xfit[:, np.newaxis]))
            

            plt.xlim(0, 10)
            plt.ylim(-1.5, 1.5);
            

            def basis_plot(model, title=None):
                fig, ax = plt.subplots(2, sharex=True)
                model.fit(x[:, np.newaxis], y)
                ax[0].scatter(x, y)
                ax[0].plot(xfit, model.predict(xfit[:, np.newaxis]))
                ax[0].set(xlabel='x', ylabel='y', ylim=(-1.5, 1.5))
            

                if title:
                    ax[0].set_title(title)
            

                ax[1].plot(model.steps[0][1].centers_,
                           model.steps[1][1].coef_)
                ax[1].set(xlabel='basis location',
                          ylabel='coefficient',
                          xlim=(0, 10))
            

            model = make_pipeline(GaussianFeatures(30), LinearRegression())
            basis_plot(model)
            

            from sklearn.linear_model import Ridge
            model = make_pipeline(GaussianFeatures(30), Ridge(alpha=0.1))
            basis_plot(model, title='Ridge Regression')
            

            from sklearn.linear_model import Lasso
            model = make_pipeline(GaussianFeatures(30), Lasso(alpha=0.001))
            basis_plot(model, title='Lasso Regression')
            

            import pandas as pd
            counts = pd.read_csv('FremontBridge.csv', index_col='Date', parse_dates=True)
            weather = pd.read_csv('data/BicycleWeather.csv', index_col='DATE', parse_dates=True)
            

            daily = counts.resample('d').sum()
            daily['Total'] = daily.sum(axis=1)
            daily = daily[['Total']]
            

            days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
            for i in range(7):
                daily[days[i]] = (daily.index.dayofweek == i).astype(float)
            

            from pandas.tseries.holiday import USFederalHolidayCalendar
            cal = USFederalHolidayCalendar()
            holidays = cal.holidays('2012', '2016')
            daily = daily.join(pd.Series(1, index=holidays, name='holiday'))
            daily['holiday'].fillna(0, inplace=True)
            

            def hours_of_daylight(date, axis=23.44, latitude=47.61):
                """Compute the hours of daylight for the given date"""
                days = (date - pd.datetime(2000, 12, 21)).days
                m = (1. - np.tan(np.radians(latitude))
                     * np.tan(np.radians(axis) * np.cos(days * 2 * np.pi / 365.25)))
                return 24. * np.degrees(np.arccos(1 - np.clip(m, 0, 2))) / 180.
            

            daily['daylight_hrs'] = list(map(hours_of_daylight, daily.index))
            daily[['daylight_hrs']].plot()
            plt.ylim(8, 17)
            

            weather['TMIN'] /= 10
            weather['TMAX'] /= 10
            weather['Temp (C)'] = 0.5 * (weather['TMIN'] + weather['TMAX'])
            

            weather['PRCP'] /= 254
            weather['dry day'] = (weather['PRCP'] == 0).astype(int)
            

            daily = daily.join(weather[['PRCP', 'Temp (C)', 'dry day']])
            daily['annual'] = (daily.index - daily.index[0]).days / 365.
            daily.head()
            daily.dropna(axis=0, how='any', inplace=True)
            

            column_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'holiday',
                            'daylight_hrs', 'PRCP', 'dry day', 'Temp (C)', 'annual']
            X = daily[column_names]
            y = daily['Total']
            

            model = LinearRegression(fit_intercept=False)
            model.fit(X, y)
            daily['predicted'] = model.predict(X)
            

            daily[['Total', 'predicted']].plot(alpha=0.5);
            

            params = pd.Series(model.coef_, index=X.columns)
            params
            

            from sklearn.utils import resample
            np.random.seed(1)
            err = np.std([model.fit(*resample(X, y)).coef_
                          for i in range(1000)], 0)
            

            print(pd.DataFrame({'effect': params.round(0),
                                'error': err.round(0)}))

TL; DR

Взять задачу
Построить матрицу слово-контекст
Запустить Swivel
???
PROFIT!

Конец

Презентация:
zurk.github.io/moscow-python-06-2018