[+] reformat with ruff

This commit is contained in:
Siarhei Siniak 2025-05-20 11:03:00 +03:00
parent cf9ede1dde
commit 64a898ce44
28 changed files with 8683 additions and 12600 deletions

@ -1,694 +0,0 @@
# %% [markdown]
# # About this Notebook
#
# NLP is a very hot topic right now and as belived by many experts '2020 is going to be NLP's Year' ,with its ever changing dynamics it is experiencing a boom , same as computer vision once did. Owing to its popularity Kaggle launched two NLP competitions recently and me being a lover of this Hot topic prepared myself to join in my first Kaggle Competition.<br><br>
# As I joined the competitions and since I was a complete beginner with Deep Learning Techniques for NLP, all my enthusiasm took a beating when I saw everyone Using all kinds of BERT , everything just went over my head,I thought to quit but there is a special thing about Kaggle ,it just hooks you. I thought I have to learn someday , why not now , so I braced myself and sat on the learning curve. I wrote a kernel on the Tweet Sentiment Extraction competition that has now got a gold medal , it can be viewed here : https://www.kaggle.com/tanulsingh077/twitter-sentiment-extaction-analysis-eda-and-model <br><br>
# After 10 days of extensive learning(finishing all the latest NLP approaches) , I am back here to share my leaning , by writing a kernel that starts from the very Basic RNN's to built over , all the way to BERT . I invite you all to come and learn alongside with me and take a step closer towards becoming an NLP expert
# %% [markdown]
# # Contents
#
# In this Notebook I will start with the very Basics of RNN's and Build all the way to latest deep learning architectures to solve NLP problems. It will cover the Following:
# * Simple RNN's
# * Word Embeddings : Definition and How to get them
# * LSTM's
# * GRU's
# * BI-Directional RNN's
# * Encoder-Decoder Models (Seq2Seq Models)
# * Attention Models
# * Transformers - Attention is all you need
# * BERT
#
# I will divide every Topic into four subsections:
# * Basic Overview
# * In-Depth Understanding : In this I will attach links of articles and videos to learn about the topic in depth
# * Code-Implementation
# * Code Explanation
#
# This is a comprehensive kernel and if you follow along till the end , I promise you would learn all the techniques completely
#
# Note that the aim of this notebook is not to have a High LB score but to present a beginner guide to understand Deep Learning techniques used for NLP. Also after discussing all of these ideas , I will present a starter solution for this competiton
# %% [markdown]
# **<span style="color:Red">This kernel has been a work of more than 10 days If you find my kernel useful and my efforts appreciable, Please Upvote it , it motivates me to write more Quality content**
# %% [code]
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU,SimpleRNN
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
# %% [markdown]
# # Configuring TPU's
#
# For this version of Notebook we will be using TPU's as we have to built a BERT Model
# %% [code]
# Detect hardware, return appropriate distribution strategy
try:
# TPU detection. No parameters necessary if TPU_NAME environment variable is
# set: this is always the case on Kaggle.
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
print('Running on TPU ', tpu.master())
except ValueError:
tpu = None
if tpu:
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
# Default distribution strategy in Tensorflow. Works on CPU and single GPU.
strategy = tf.distribute.get_strategy()
print("REPLICAS: ", strategy.num_replicas_in_sync)
# %% [code]
train = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv')
validation = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/validation.csv')
test = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/test.csv')
# %% [markdown]
# We will drop the other columns and approach this problem as a Binary Classification Problem and also we will have our exercise done on a smaller subsection of the dataset(only 12000 data points) to make it easier to train the models
# %% [code]
train.drop(['severe_toxic','obscene','threat','insult','identity_hate'],axis=1,inplace=True)
# %% [code]
train = train.loc[:12000,:]
train.shape
# %% [markdown]
# We will check the maximum number of words that can be present in a comment , this will help us in padding later
# %% [code]
train['comment_text'].apply(lambda x:len(str(x).split())).max()
# %% [markdown]
# Writing a function for getting auc score for validation
# %% [code]
def roc_auc(predictions,target):
'''
This methods returns the AUC Score when given the Predictions
and Labels
'''
fpr, tpr, thresholds = metrics.roc_curve(target, predictions)
roc_auc = metrics.auc(fpr, tpr)
return roc_auc
# %% [markdown]
# ### Data Preparation
# %% [code]
xtrain, xvalid, ytrain, yvalid = train_test_split(train.comment_text.values, train.toxic.values,
stratify=train.toxic.values,
random_state=42,
test_size=0.2, shuffle=True)
# %% [markdown]
# # Before We Begin
#
# Before we Begin If you are a complete starter with NLP and never worked with text data, I am attaching a few kernels that will serve as a starting point of your journey
# * https://www.kaggle.com/arthurtok/spooky-nlp-and-topic-modelling-tutorial
# * https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle
#
# If you want a more basic dataset to practice with here is another kernel which I wrote:
# * https://www.kaggle.com/tanulsingh077/what-s-cooking
#
# Below are some Resources to get started with basic level Neural Networks, It will help us to easily understand the upcoming parts
# * https://www.youtube.com/watch?v=aircAruvnKk&list=PL_h2yd2CGtBHEKwEH5iqTZH85wLS-eUzv
# * https://www.youtube.com/watch?v=IHZwWFHWa-w&list=PL_h2yd2CGtBHEKwEH5iqTZH85wLS-eUzv&index=2
# * https://www.youtube.com/watch?v=Ilg3gGewQ5U&list=PL_h2yd2CGtBHEKwEH5iqTZH85wLS-eUzv&index=3
# * https://www.youtube.com/watch?v=tIeHLnjs5U8&list=PL_h2yd2CGtBHEKwEH5iqTZH85wLS-eUzv&index=4
#
# For Learning how to visualize test data and what to use view:
# * https://www.kaggle.com/tanulsingh077/twitter-sentiment-extaction-analysis-eda-and-model
# * https://www.kaggle.com/jagangupta/stop-the-s-toxic-comments-eda
# %% [markdown]
# # Simple RNN
#
# ## Basic Overview
#
# What is a RNN?
#
# Recurrent Neural Network(RNN) are a type of Neural Network where the output from previous step are fed as input to the current step. In traditional neural networks, all the inputs and outputs are independent of each other, but in cases like when it is required to predict the next word of a sentence, the previous words are required and hence there is a need to remember the previous words. Thus RNN came into existence, which solved this issue with the help of a Hidden Layer.
#
# Why RNN's?
#
# https://www.quora.com/Why-do-we-use-an-RNN-instead-of-a-simple-neural-network
#
# ## In-Depth Understanding
#
# * https://medium.com/mindorks/understanding-the-recurrent-neural-network-44d593f112a2
# * https://www.youtube.com/watch?v=2E65LDnM2cA&list=PL1F3ABbhcqa3BBWo170U4Ev2wfsF7FN8l
# * https://www.d2l.ai/chapter_recurrent-neural-networks/rnn.html
#
# ## Code Implementation
#
# So first I will implement the and then I will explain the code step by step
# %% [code]
# using keras tokenizer here
token = text.Tokenizer(num_words=None)
max_len = 1500
token.fit_on_texts(list(xtrain) + list(xvalid))
xtrain_seq = token.texts_to_sequences(xtrain)
xvalid_seq = token.texts_to_sequences(xvalid)
#zero pad the sequences
xtrain_pad = sequence.pad_sequences(xtrain_seq, maxlen=max_len)
xvalid_pad = sequence.pad_sequences(xvalid_seq, maxlen=max_len)
word_index = token.word_index
# %% [code]
#%%time
with strategy.scope():
# A simpleRNN without any pretrained embeddings and one dense layer
model = Sequential()
model.add(Embedding(len(word_index) + 1,
300,
input_length=max_len))
model.add(SimpleRNN(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
# %% [code]
model.fit(xtrain_pad, ytrain, nb_epoch=5, batch_size=64*strategy.num_replicas_in_sync) #Multiplying by Strategy to run on TPU's
# %% [code]
scores = model.predict(xvalid_pad)
print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))
# %% [code]
scores_model = []
scores_model.append({'Model': 'SimpleRNN','AUC_Score': roc_auc(scores,yvalid)})
# %% [markdown]
# ## Code Explanantion
# * Tokenization<br><br>
# So if you have watched the videos and referred to the links, you would know that in an RNN we input a sentence word by word. We represent every word as one hot vectors of dimensions : Numbers of words in Vocab +1. <br>
# What keras Tokenizer does is , it takes all the unique words in the corpus,forms a dictionary with words as keys and their number of occurences as values,it then sorts the dictionary in descending order of counts. It then assigns the first value 1 , second value 2 and so on. So let's suppose word 'the' occured the most in the corpus then it will assigned index 1 and vector representing 'the' would be a one-hot vector with value 1 at position 1 and rest zereos.<br>
# Try printing first 2 elements of xtrain_seq you will see every word is represented as a digit now
# %% [code]
xtrain_seq[:1]
# %% [markdown]
# <b>Now you might be wondering What is padding? Why its done</b><br><br>
#
# Here is the answer :
# * https://www.quora.com/Which-effect-does-sequence-padding-have-on-the-training-of-a-neural-network
# * https://machinelearningmastery.com/data-preparation-variable-length-input-sequences-sequence-prediction/
# * https://www.coursera.org/lecture/natural-language-processing-tensorflow/padding-2Cyzs
#
# Also sometimes people might use special tokens while tokenizing like EOS(end of string) and BOS(Begining of string). Here is the reason why it's done
# * https://stackoverflow.com/questions/44579161/why-do-we-do-padding-in-nlp-tasks
#
#
# The code token.word_index simply gives the dictionary of vocab that keras created for us
# %% [markdown]
# * Building the Neural Network
#
# To understand the Dimensions of input and output given to RNN in keras her is a beautiful article : https://medium.com/@shivajbd/understanding-input-and-output-shape-in-lstm-keras-c501ee95c65e
#
# The first line model.Sequential() tells keras that we will be building our network sequentially . Then we first add the Embedding layer.
# Embedding layer is also a layer of neurons which takes in as input the nth dimensional one hot vector of every word and converts it into 300 dimensional vector , it gives us word embeddings similar to word2vec. We could have used word2vec but the embeddings layer learns during training to enhance the embeddings.
# Next we add an 100 LSTM units without any dropout or regularization
# At last we add a single neuron with sigmoid function which takes output from 100 LSTM cells (Please note we have 100 LSTM cells not layers) to predict the results and then we compile the model using adam optimizer
#
# * Comments on the model<br><br>
# We can see our model achieves an accuracy of 1 which is just insane , we are clearly overfitting I know , but this was the simplest model of all ,we can tune a lot of hyperparameters like RNN units, we can do batch normalization , dropouts etc to get better result. The point is we got an AUC score of 0.82 without much efforts and we know have learnt about RNN's .Deep learning is really revolutionary
# %% [markdown]
# # Word Embeddings
#
# While building our simple RNN models we talked about using word-embeddings , So what is word-embeddings and how do we get word-embeddings?
# Here is the answer :
# * https://www.coursera.org/learn/nlp-sequence-models/lecture/6Oq70/word-representation
# * https://machinelearningmastery.com/what-are-word-embeddings/
# <br> <br>
# The latest approach to getting word Embeddings is using pretained GLoVe or using Fasttext. Without going into too much details, I would explain how to create sentence vectors and how can we use them to create a machine learning model on top of it and since I am a fan of GloVe vectors, word2vec and fasttext. In this Notebook, I'll be using the GloVe vectors. You can download the GloVe vectors from here http://www-nlp.stanford.edu/data/glove.840B.300d.zip or you can search for GloVe in datasets on Kaggle and add the file
# %% [code]
# load the GloVe vectors in a dictionary:
embeddings_index = {}
f = open('/kaggle/input/glove840b300dtxt/glove.840B.300d.txt','r',encoding='utf-8')
for line in tqdm(f):
values = line.split(' ')
word = values[0]
coefs = np.asarray([float(val) for val in values[1:]])
embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' % len(embeddings_index))
# %% [markdown]
# # LSTM's
#
# ## Basic Overview
#
# Simple RNN's were certainly better than classical ML algorithms and gave state of the art results, but it failed to capture long term dependencies that is present in sentences . So in 1998-99 LSTM's were introduced to counter to these drawbacks.
#
# ## In Depth Understanding
#
# Why LSTM's?
# * https://www.coursera.org/learn/nlp-sequence-models/lecture/PKMRR/vanishing-gradients-with-rnns
# * https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/
#
# What are LSTM's?
# * https://www.coursera.org/learn/nlp-sequence-models/lecture/KXoay/long-short-term-memory-lstm
# * https://distill.pub/2019/memorization-in-rnns/
# * https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21
#
# # Code Implementation
#
# We have already tokenized and paded our text for input to LSTM's
# %% [code]
# create an embedding matrix for the words we have in the dataset
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_index.items()):
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
# %% [code]
#%%time
with strategy.scope():
# A simple LSTM with glove embeddings and one dense layer
model = Sequential()
model.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=max_len,
trainable=False))
model.add(LSTM(100, dropout=0.3, recurrent_dropout=0.3))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
model.summary()
# %% [code]
model.fit(xtrain_pad, ytrain, nb_epoch=5, batch_size=64*strategy.num_replicas_in_sync)
# %% [code]
scores = model.predict(xvalid_pad)
print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))
# %% [code]
scores_model.append({'Model': 'LSTM','AUC_Score': roc_auc(scores,yvalid)})
# %% [markdown]
# ## Code Explanation
#
# As a first step we calculate embedding matrix for our vocabulary from the pretrained GLoVe vectors . Then while building the embedding layer we pass Embedding Matrix as weights to the layer instead of training it over Vocabulary and thus we pass trainable = False.
# Rest of the model is same as before except we have replaced the SimpleRNN By LSTM Units
#
# * Comments on the Model
#
# We now see that the model is not overfitting and achieves an auc score of 0.96 which is quite commendable , also we close in on the gap between accuracy and auc .
# We see that in this case we used dropout and prevented overfitting the data
# %% [markdown]
# # GRU's
#
# ## Basic Overview
#
# Introduced by Cho, et al. in 2014, GRU (Gated Recurrent Unit) aims to solve the vanishing gradient problem which comes with a standard recurrent neural network. GRU's are a variation on the LSTM because both are designed similarly and, in some cases, produce equally excellent results . GRU's were designed to be simpler and faster than LSTM's and in most cases produce equally good results and thus there is no clear winner.
#
# ## In Depth Explanation
#
# * https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be
# * https://www.coursera.org/learn/nlp-sequence-models/lecture/agZiL/gated-recurrent-unit-gru
# * https://www.geeksforgeeks.org/gated-recurrent-unit-networks/
#
# ## Code Implementation
# %% [code]
#%%time
with strategy.scope():
# GRU with glove embeddings and two dense layers
model = Sequential()
model.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=max_len,
trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(GRU(300))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
model.summary()
# %% [code]
model.fit(xtrain_pad, ytrain, nb_epoch=5, batch_size=64*strategy.num_replicas_in_sync)
# %% [code]
scores = model.predict(xvalid_pad)
print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))
# %% [code]
scores_model.append({'Model': 'GRU','AUC_Score': roc_auc(scores,yvalid)})
# %% [code]
scores_model
# %% [markdown]
# # Bi-Directional RNN's
#
# ## In Depth Explanation
#
# * https://www.coursera.org/learn/nlp-sequence-models/lecture/fyXnn/bidirectional-rnn
# * https://towardsdatascience.com/understanding-bidirectional-rnn-in-pytorch-5bd25a5dd66
# * https://d2l.ai/chapter_recurrent-modern/bi-rnn.html
#
# ## Code Implementation
# %% [code]
#%%time
with strategy.scope():
# A simple bidirectional LSTM with glove embeddings and one dense layer
model = Sequential()
model.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=max_len,
trainable=False))
model.add(Bidirectional(LSTM(300, dropout=0.3, recurrent_dropout=0.3)))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
model.summary()
# %% [code]
model.fit(xtrain_pad, ytrain, nb_epoch=5, batch_size=64*strategy.num_replicas_in_sync)
# %% [code]
scores = model.predict(xvalid_pad)
print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))
# %% [code]
scores_model.append({'Model': 'Bi-directional LSTM','AUC_Score': roc_auc(scores,yvalid)})
# %% [markdown]
# ## Code Explanation
#
# Code is same as before,only we have added bidirectional nature to the LSTM cells we used before and is self explanatory. We have achieve similar accuracy and auc score as before and now we have learned all the types of typical RNN architectures
# %% [markdown]
# **We are now at the end of part 1 of this notebook and things are about to go wild now as we Enter more complex and State of the art models .If you have followed along from the starting and read all the articles and understood everything , these complex models would be fairly easy to understand.I recommend Finishing Part 1 before continuing as the upcoming techniques can be quite overwhelming**
# %% [markdown]
# # Seq2Seq Model Architecture
#
# ## Overview
#
# RNN's are of many types and different architectures are used for different purposes. Here is a nice video explanining different types of model architectures : https://www.coursera.org/learn/nlp-sequence-models/lecture/BO8PS/different-types-of-rnns.
# Seq2Seq is a many to many RNN architecture where the input is a sequence and the output is also a sequence (where input and output sequences can be or cannot be of different lengths). This architecture is used in a lot of applications like Machine Translation, text summarization, question answering etc
#
# ## In Depth Understanding
#
# I will not write the code implementation for this,but rather I will provide the resources where code has already been implemented and explained in a much better way than I could have ever explained.
#
# * https://www.coursera.org/learn/nlp-sequence-models/lecture/HyEui/basic-models ---> A basic idea of different Seq2Seq Models
#
# * https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html , https://machinelearningmastery.com/define-encoder-decoder-sequence-sequence-model-neural-machine-translation-keras/ ---> Basic Encoder-Decoder Model and its explanation respectively
#
# * https://towardsdatascience.com/how-to-implement-seq2seq-lstm-model-in-keras-shortcutnlp-6f355f3e5639 ---> A More advanced Seq2seq Model and its explanation
#
# * https://d2l.ai/chapter_recurrent-modern/machine-translation-and-dataset.html , https://d2l.ai/chapter_recurrent-modern/encoder-decoder.html ---> Implementation of Encoder-Decoder Model from scratch
#
# * https://www.youtube.com/watch?v=IfsjMg4fLWQ&list=PLtmWHNX-gukKocXQOkQjuVxglSDYWsSh9&index=8&t=0s ---> Introduction to Seq2seq By fast.ai
# %% [code]
# Visualization of Results obtained from various Deep learning models
results = pd.DataFrame(scores_model).sort_values(by='AUC_Score',ascending=False)
results.style.background_gradient(cmap='Blues')
# %% [code]
fig = go.Figure(go.Funnelarea(
text =results.Model,
values = results.AUC_Score,
title = {"position": "top center", "text": "Funnel-Chart of Sentiment Distribution"}
))
fig.show()
# %% [markdown]
# # Attention Models
#
# This is the toughest and most tricky part. If you are able to understand the intiuition and working of attention block , understanding transformers and transformer based architectures like BERT will be a piece of cake. This is the part where I spent the most time on and I suggest you do the same . Please read and view the following resources in the order I am providing to ignore getting confused, also at the end of this try to write and draw an attention block in your own way :-
#
# * https://www.coursera.org/learn/nlp-sequence-models/lecture/RDXpX/attention-model-intuition --> Only watch this video and not the next one
# * https://towardsdatascience.com/sequence-2-sequence-model-with-attention-mechanism-9e9ca2a613a
# * https://towardsdatascience.com/attention-and-its-different-forms-7fc3674d14dc
# * https://distill.pub/2016/augmented-rnns/
#
# ## Code Implementation
#
# * https://www.analyticsvidhya.com/blog/2019/11/comprehensive-guide-attention-mechanism-deep-learning/ --> Basic Level
# * https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html ---> Implementation from Scratch in Pytorch
# %% [markdown]
# # Transformers : Attention is all you need
#
# So finally we have reached the end of the learning curve and are about to start learning the technology that changed NLP completely and are the reasons for the state of the art NLP techniques .Transformers were introduced in the paper Attention is all you need by Google. If you have understood the Attention models,this will be very easy , Here is transformers fully explained:
#
# * http://jalammar.github.io/illustrated-transformer/
#
# ## Code Implementation
#
# * http://nlp.seas.harvard.edu/2018/04/03/attention.html ---> This presents the code implementation of the architecture presented in the paper by Google
# %% [markdown]
# # BERT and Its Implementation on this Competition
#
# As Promised I am back with Resiurces , to understand about BERT architecture , please follow the contents in the given order :-
#
# * http://jalammar.github.io/illustrated-bert/ ---> In Depth Understanding of BERT
#
# After going through the post Above , I guess you must have understood how transformer architecture have been utilized by the current SOTA models . Now these architectures can be used in two ways :<br><br>
# 1) We can use the model for prediction on our problems using the pretrained weights without fine-tuning or training the model for our sepcific tasks
# * EG: http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/ ---> Using Pre-trained BERT without Tuning
#
# 2) We can fine-tune or train these transformer models for our task by tweaking the already pre-trained weights and training on a much smaller dataset
# * EG:* https://www.youtube.com/watch?v=hinZO--TEk4&t=2933s ---> Tuning BERT For your TASK
#
# We will be using the first example as a base for our implementation of BERT model using Hugging Face and KERAS , but contrary to first example we will also Fine-Tune our model for our task
#
# Acknowledgements : https://www.kaggle.com/xhlulu/jigsaw-tpu-distilbert-with-huggingface-and-keras
#
#
# Steps Involved :
# * Data Preparation : Tokenization and encoding of data
# * Configuring TPU's
# * Building a Function for Model Training and adding an output layer for classification
# * Train the model and get the results
# %% [code]
# Loading Dependencies
import os
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
from kaggle_datasets import KaggleDatasets
import transformers
from tokenizers import BertWordPieceTokenizer
# %% [code]
# LOADING THE DATA
train1 = pd.read_csv("/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv")
valid = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/validation.csv')
test = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/test.csv')
sub = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/sample_submission.csv')
# %% [markdown]
# Encoder FOr DATA for understanding waht encode batch does read documentation of hugging face tokenizer :
# https://huggingface.co/transformers/main_classes/tokenizer.html here
# %% [code]
def fast_encode(texts, tokenizer, chunk_size=256, maxlen=512):
"""
Encoder for encoding the text into sequence of integers for BERT Input
"""
tokenizer.enable_truncation(max_length=maxlen)
tokenizer.enable_padding(max_length=maxlen)
all_ids = []
for i in tqdm(range(0, len(texts), chunk_size)):
text_chunk = texts[i:i+chunk_size].tolist()
encs = tokenizer.encode_batch(text_chunk)
all_ids.extend([enc.ids for enc in encs])
return np.array(all_ids)
# %% [code]
#IMP DATA FOR CONFIG
AUTO = tf.data.experimental.AUTOTUNE
# Configuration
EPOCHS = 3
BATCH_SIZE = 16 * strategy.num_replicas_in_sync
MAX_LEN = 192
# %% [markdown]
# ## Tokenization
#
# For understanding please refer to hugging face documentation again
# %% [code]
# First load the real tokenizer
tokenizer = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-multilingual-cased')
# Save the loaded tokenizer locally
tokenizer.save_pretrained('.')
# Reload it with the huggingface tokenizers library
fast_tokenizer = BertWordPieceTokenizer('vocab.txt', lowercase=False)
fast_tokenizer
# %% [code]
x_train = fast_encode(train1.comment_text.astype(str), fast_tokenizer, maxlen=MAX_LEN)
x_valid = fast_encode(valid.comment_text.astype(str), fast_tokenizer, maxlen=MAX_LEN)
x_test = fast_encode(test.content.astype(str), fast_tokenizer, maxlen=MAX_LEN)
y_train = train1.toxic.values
y_valid = valid.toxic.values
# %% [code]
train_dataset = (
tf.data.Dataset
.from_tensor_slices((x_train, y_train))
.repeat()
.shuffle(2048)
.batch(BATCH_SIZE)
.prefetch(AUTO)
)
valid_dataset = (
tf.data.Dataset
.from_tensor_slices((x_valid, y_valid))
.batch(BATCH_SIZE)
.cache()
.prefetch(AUTO)
)
test_dataset = (
tf.data.Dataset
.from_tensor_slices(x_test)
.batch(BATCH_SIZE)
)
# %% [code]
def build_model(transformer, max_len=512):
"""
function for training the BERT model
"""
input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
sequence_output = transformer(input_word_ids)[0]
cls_token = sequence_output[:, 0, :]
out = Dense(1, activation='sigmoid')(cls_token)
model = Model(inputs=input_word_ids, outputs=out)
model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
return model
# %% [markdown]
# ## Starting Training
#
# If you want to use any another model just replace the model name in transformers._____ and use accordingly
# %% [code]
#%%time
with strategy.scope():
transformer_layer = (
transformers.TFDistilBertModel
.from_pretrained('distilbert-base-multilingual-cased')
)
model = build_model(transformer_layer, max_len=MAX_LEN)
model.summary()
# %% [code]
n_steps = x_train.shape[0] // BATCH_SIZE
train_history = model.fit(
train_dataset,
steps_per_epoch=n_steps,
validation_data=valid_dataset,
epochs=EPOCHS
)
# %% [code]
n_steps = x_valid.shape[0] // BATCH_SIZE
train_history_2 = model.fit(
valid_dataset.repeat(),
steps_per_epoch=n_steps,
epochs=EPOCHS*2
)
# %% [code]
sub['toxic'] = model.predict(test_dataset, verbose=1)
sub.to_csv('submission.csv', index=False)
# %% [markdown]
# # End Notes
#
# This was my effort to share my learnings so that everyone can benifit from it.As this community has been very kind to me and helped me in learning all of this , I want to take this forward. I have shared all the resources I used to learn all the stuff .Join me and make these NLP competitions your first ,without being overwhelmed by the shear number of techniques used . It took me 10 days to learn all of this , you can learn it at your pace and dont give in , at the end of all this you will be a different person and it will all be worth it.
#
#
# ### I am attaching more resources if you want NLP end to end:
#
# 1) Books
#
# * https://d2l.ai/
# * Jason Brownlee's Books
#
# 2) Courses
#
# * https://www.coursera.org/learn/nlp-sequence-models/home/welcome
# * Fast.ai NLP Course
#
# 3) Blogs and websites
#
# * Machine Learning Mastery
# * https://distill.pub/
# * http://jalammar.github.io/
#
# **<span style="color:Red">This is subtle effort of contributing towards the community, if it helped you in any way please show a token of love by upvoting**

@ -1,757 +0,0 @@
# %% [markdown]
# <div>
# <h1 align="center">MLB Player Digital Engagement Forecasting</h1>
# <h1 align="center">LightGBM + CatBoost + ANN 2505f2</h1>
# </div>
# %% [markdown]
# <div class="alert alert-success">
# </div>
# %% [markdown]
# <div class="alert alert-success">
# <h1 align="center">If you find this work useful, please don't forget upvoting :)</h1>
# </div>
# %% [markdown]
# #### Thanks to: @lhagiimn https://www.kaggle.com/lhagiimn/lightgbm-catboost-ann-2505f2
#
# #### https://www.kaggle.com/columbia2131/mlb-lightgbm-starter-dataset-code-en-ja
#
# #### https://www.kaggle.com/mlconsult/1-3816-lb-lbgm-descriptive-stats-param-tune
#
# #### https://www.kaggle.com/batprem/lightgbm-ann-weight-with-love
#
# #### https://www.kaggle.com/mlconsult/1-3816-lb-lbgm-descriptive-stats-param-tune
#
# #### https://www.kaggle.com/ulrich07/mlb-ann-with-lags-tf-keras
#
# %% [markdown]
# <div class="alert alert-success">
# </div>
# %% [markdown]
# ## About Dataset
# %% [markdown]
# Train.csv is stored as a csv file with each column as follows.
#
# train.csvを以下のようにして各カラムをcsvファイルとして保管しています。
# %% [code] {"execution":{"iopub.status.busy":"2021-06-26T07:16:47.242749Z","iopub.execute_input":"2021-06-26T07:16:47.243324Z","iopub.status.idle":"2021-06-26T07:16:48.030215Z","shell.execute_reply.started":"2021-06-26T07:16:47.243266Z","shell.execute_reply":"2021-06-26T07:16:48.029Z"}}
import os
assert os.system(r'''cp ../input/fork-of-1-35-lightgbm-ann-2505f2-c4e96a/* .''') == 0
# %% [code] {"execution":{"iopub.status.busy":"2021-06-26T07:16:48.031858Z","iopub.execute_input":"2021-06-26T07:16:48.032396Z","iopub.status.idle":"2021-06-26T07:16:48.799514Z","shell.execute_reply.started":"2021-06-26T07:16:48.032357Z","shell.execute_reply":"2021-06-26T07:16:48.798628Z"}}
assert os.system(r'''ls''') == 0
# %% [code] {"jupyter":{"outputs_hidden":false},"execution":{"iopub.status.busy":"2021-06-26T07:16:48.801992Z","iopub.execute_input":"2021-06-26T07:16:48.802645Z","iopub.status.idle":"2021-06-26T07:16:48.813801Z","shell.execute_reply.started":"2021-06-26T07:16:48.802592Z","shell.execute_reply":"2021-06-26T07:16:48.812863Z"}}
#%%capture
"""
!pip install pandarallel
import gc
import numpy as np
import pandas as pd
from pathlib import Path
from pandarallel import pandarallel
pandarallel.initialize()
BASE_DIR = Path('../input/mlb-player-digital-engagement-forecasting')
train = pd.read_csv(BASE_DIR / 'train.csv')
null = np.nan
true = True
false = False
for col in train.columns:
if col == 'date': continue
_index = train[col].notnull()
train.loc[_index, col] = train.loc[_index, col].parallel_apply(lambda x: eval(x))
outputs = []
for index, date, record in train.loc[_index, ['date', col]].itertuples():
_df = pd.DataFrame(record)
_df['index'] = index
_df['date'] = date
outputs.append(_df)
outputs = pd.concat(outputs).reset_index(drop=True)
outputs.to_csv(f'{col}_train.csv', index=False)
outputs.to_pickle(f'{col}_train.pkl')
del outputs
del train[col]
gc.collect()
"""
# %% [markdown] {"execution":{"iopub.status.busy":"2021-06-16T09:14:33.869464Z","iopub.execute_input":"2021-06-16T09:14:33.869905Z","iopub.status.idle":"2021-06-16T09:14:33.874766Z","shell.execute_reply.started":"2021-06-16T09:14:33.869879Z","shell.execute_reply":"2021-06-16T09:14:33.873097Z"}}
# ## Training
# %% [code] {"jupyter":{"outputs_hidden":false},"execution":{"iopub.status.busy":"2021-06-26T07:16:48.81564Z","iopub.execute_input":"2021-06-26T07:16:48.816326Z","iopub.status.idle":"2021-06-26T07:16:50.081995Z","shell.execute_reply.started":"2021-06-26T07:16:48.816246Z","shell.execute_reply":"2021-06-26T07:16:50.080828Z"}}
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import mean_absolute_error
from datetime import timedelta
from functools import reduce
from tqdm import tqdm
import lightgbm as lgbm
import mlb
import os
# %% [code] {"jupyter":{"outputs_hidden":false},"execution":{"iopub.status.busy":"2021-06-26T07:16:50.083534Z","iopub.execute_input":"2021-06-26T07:16:50.083899Z","iopub.status.idle":"2021-06-26T07:16:50.088159Z","shell.execute_reply.started":"2021-06-26T07:16:50.083863Z","shell.execute_reply":"2021-06-26T07:16:50.087357Z"}}
BASE_DIR = Path('../input/mlb-player-digital-engagement-forecasting')
TRAIN_DIR = Path('../input/mlb-pdef-train-dataset')
# %% [code] {"jupyter":{"outputs_hidden":false},"execution":{"iopub.status.busy":"2021-06-26T07:16:50.08951Z","iopub.execute_input":"2021-06-26T07:16:50.090053Z","iopub.status.idle":"2021-06-26T07:16:54.221868Z","shell.execute_reply.started":"2021-06-26T07:16:50.090018Z","shell.execute_reply":"2021-06-26T07:16:54.220656Z"}}
players = pd.read_csv(BASE_DIR / 'players.csv')
rosters = pd.read_pickle(TRAIN_DIR / 'rosters_train.pkl')
targets = pd.read_pickle(TRAIN_DIR / 'nextDayPlayerEngagement_train.pkl')
scores = pd.read_pickle(TRAIN_DIR / 'playerBoxScores_train.pkl')
scores = scores.groupby(['playerId', 'date']).sum().reset_index()
# %% [code] {"jupyter":{"outputs_hidden":false},"execution":{"iopub.status.busy":"2021-06-26T07:16:54.223547Z","iopub.execute_input":"2021-06-26T07:16:54.224Z","iopub.status.idle":"2021-06-26T07:16:54.243132Z","shell.execute_reply.started":"2021-06-26T07:16:54.22395Z","shell.execute_reply":"2021-06-26T07:16:54.242076Z"}}
targets_cols = ['playerId', 'target1', 'target2', 'target3', 'target4', 'date']
players_cols = ['playerId', 'primaryPositionName']
rosters_cols = ['playerId', 'teamId', 'status', 'date']
scores_cols = ['playerId', 'battingOrder', 'gamesPlayedBatting', 'flyOuts',
'groundOuts', 'runsScored', 'doubles', 'triples', 'homeRuns',
'strikeOuts', 'baseOnBalls', 'intentionalWalks', 'hits', 'hitByPitch',
'atBats', 'caughtStealing', 'stolenBases', 'groundIntoDoublePlay',
'groundIntoTriplePlay', 'plateAppearances', 'totalBases', 'rbi',
'leftOnBase', 'sacBunts', 'sacFlies', 'catchersInterference',
'pickoffs', 'gamesPlayedPitching', 'gamesStartedPitching',
'completeGamesPitching', 'shutoutsPitching', 'winsPitching',
'lossesPitching', 'flyOutsPitching', 'airOutsPitching',
'groundOutsPitching', 'runsPitching', 'doublesPitching',
'triplesPitching', 'homeRunsPitching', 'strikeOutsPitching',
'baseOnBallsPitching', 'intentionalWalksPitching', 'hitsPitching',
'hitByPitchPitching', 'atBatsPitching', 'caughtStealingPitching',
'stolenBasesPitching', 'inningsPitched', 'saveOpportunities',
'earnedRuns', 'battersFaced', 'outsPitching', 'pitchesThrown', 'balls',
'strikes', 'hitBatsmen', 'balks', 'wildPitches', 'pickoffsPitching',
'rbiPitching', 'gamesFinishedPitching', 'inheritedRunners',
'inheritedRunnersScored', 'catchersInterferencePitching',
'sacBuntsPitching', 'sacFliesPitching', 'saves', 'holds', 'blownSaves',
'assists', 'putOuts', 'errors', 'chances', 'date']
feature_cols = ['label_playerId', 'label_primaryPositionName', 'label_teamId',
'label_status', 'battingOrder', 'gamesPlayedBatting', 'flyOuts',
'groundOuts', 'runsScored', 'doubles', 'triples', 'homeRuns',
'strikeOuts', 'baseOnBalls', 'intentionalWalks', 'hits', 'hitByPitch',
'atBats', 'caughtStealing', 'stolenBases', 'groundIntoDoublePlay',
'groundIntoTriplePlay', 'plateAppearances', 'totalBases', 'rbi',
'leftOnBase', 'sacBunts', 'sacFlies', 'catchersInterference',
'pickoffs', 'gamesPlayedPitching', 'gamesStartedPitching',
'completeGamesPitching', 'shutoutsPitching', 'winsPitching',
'lossesPitching', 'flyOutsPitching', 'airOutsPitching',
'groundOutsPitching', 'runsPitching', 'doublesPitching',
'triplesPitching', 'homeRunsPitching', 'strikeOutsPitching',
'baseOnBallsPitching', 'intentionalWalksPitching', 'hitsPitching',
'hitByPitchPitching', 'atBatsPitching', 'caughtStealingPitching',
'stolenBasesPitching', 'inningsPitched', 'saveOpportunities',
'earnedRuns', 'battersFaced', 'outsPitching', 'pitchesThrown', 'balls',
'strikes', 'hitBatsmen', 'balks', 'wildPitches', 'pickoffsPitching',
'rbiPitching', 'gamesFinishedPitching', 'inheritedRunners',
'inheritedRunnersScored', 'catchersInterferencePitching',
'sacBuntsPitching', 'sacFliesPitching', 'saves', 'holds', 'blownSaves',
'assists', 'putOuts', 'errors', 'chances','target1_mean',
'target1_median',
'target1_std',
'target1_min',
'target1_max',
'target1_prob',
'target2_mean',
'target2_median',
'target2_std',
'target2_min',
'target2_max',
'target2_prob',
'target3_mean',
'target3_median',
'target3_std',
'target3_min',
'target3_max',
'target3_prob',
'target4_mean',
'target4_median',
'target4_std',
'target4_min',
'target4_max',
'target4_prob']
feature_cols2 = ['label_playerId', 'label_primaryPositionName', 'label_teamId',
'label_status', 'battingOrder', 'gamesPlayedBatting', 'flyOuts',
'groundOuts', 'runsScored', 'doubles', 'triples', 'homeRuns',
'strikeOuts', 'baseOnBalls', 'intentionalWalks', 'hits', 'hitByPitch',
'atBats', 'caughtStealing', 'stolenBases', 'groundIntoDoublePlay',
'groundIntoTriplePlay', 'plateAppearances', 'totalBases', 'rbi',
'leftOnBase', 'sacBunts', 'sacFlies', 'catchersInterference',
'pickoffs', 'gamesPlayedPitching', 'gamesStartedPitching',
'completeGamesPitching', 'shutoutsPitching', 'winsPitching',
'lossesPitching', 'flyOutsPitching', 'airOutsPitching',
'groundOutsPitching', 'runsPitching', 'doublesPitching',
'triplesPitching', 'homeRunsPitching', 'strikeOutsPitching',
'baseOnBallsPitching', 'intentionalWalksPitching', 'hitsPitching',
'hitByPitchPitching', 'atBatsPitching', 'caughtStealingPitching',
'stolenBasesPitching', 'inningsPitched', 'saveOpportunities',
'earnedRuns', 'battersFaced', 'outsPitching', 'pitchesThrown', 'balls',
'strikes', 'hitBatsmen', 'balks', 'wildPitches', 'pickoffsPitching',
'rbiPitching', 'gamesFinishedPitching', 'inheritedRunners',
'inheritedRunnersScored', 'catchersInterferencePitching',
'sacBuntsPitching', 'sacFliesPitching', 'saves', 'holds', 'blownSaves',
'assists', 'putOuts', 'errors', 'chances','target1_mean',
'target1_median',
'target1_std',
'target1_min',
'target1_max',
'target1_prob',
'target2_mean',
'target2_median',
'target2_std',
'target2_min',
'target2_max',
'target2_prob',
'target3_mean',
'target3_median',
'target3_std',
'target3_min',
'target3_max',
'target3_prob',
'target4_mean',
'target4_median',
'target4_std',
'target4_min',
'target4_max',
'target4_prob',
'target1']
# %% [code] {"jupyter":{"outputs_hidden":false},"execution":{"iopub.status.busy":"2021-06-26T07:16:54.244866Z","iopub.execute_input":"2021-06-26T07:16:54.24532Z","iopub.status.idle":"2021-06-26T07:16:54.296844Z","shell.execute_reply.started":"2021-06-26T07:16:54.245257Z","shell.execute_reply":"2021-06-26T07:16:54.295689Z"}}
player_target_stats = pd.read_csv("../input/player-target-stats/player_target_stats.csv")
data_names=player_target_stats.columns.values.tolist()
data_names
# %% [code] {"jupyter":{"outputs_hidden":false},"execution":{"iopub.status.busy":"2021-06-26T07:16:54.300157Z","iopub.execute_input":"2021-06-26T07:16:54.300622Z","iopub.status.idle":"2021-06-26T07:17:02.252208Z","shell.execute_reply.started":"2021-06-26T07:16:54.300578Z","shell.execute_reply":"2021-06-26T07:17:02.250423Z"}}
# creat dataset
train = targets[targets_cols].merge(players[players_cols], on=['playerId'], how='left')
train = train.merge(rosters[rosters_cols], on=['playerId', 'date'], how='left')
train = train.merge(scores[scores_cols], on=['playerId', 'date'], how='left')
train = train.merge(player_target_stats, how='inner', left_on=["playerId"],right_on=["playerId"])
# label encoding
player2num = {c: i for i, c in enumerate(train['playerId'].unique())}
position2num = {c: i for i, c in enumerate(train['primaryPositionName'].unique())}
teamid2num = {c: i for i, c in enumerate(train['teamId'].unique())}
status2num = {c: i for i, c in enumerate(train['status'].unique())}
train['label_playerId'] = train['playerId'].map(player2num)
train['label_primaryPositionName'] = train['primaryPositionName'].map(position2num)
train['label_teamId'] = train['teamId'].map(teamid2num)
train['label_status'] = train['status'].map(status2num)
# %% [code] {"jupyter":{"outputs_hidden":false},"execution":{"iopub.status.busy":"2021-06-26T07:17:02.253453Z","iopub.status.idle":"2021-06-26T07:17:02.254076Z"}}
train_X = train[feature_cols]
train_y = train[['target1', 'target2', 'target3', 'target4']]
_index = (train['date'] < 20210401)
x_train1 = train_X.loc[_index].reset_index(drop=True)
y_train1 = train_y.loc[_index].reset_index(drop=True)
x_valid1 = train_X.loc[~_index].reset_index(drop=True)
y_valid1 = train_y.loc[~_index].reset_index(drop=True)
# %% [code] {"execution":{"iopub.status.busy":"2021-06-26T07:17:02.255068Z","iopub.status.idle":"2021-06-26T07:17:02.255685Z"}}
train_X = train[feature_cols2]
train_y = train[['target1', 'target2', 'target3', 'target4']]
_index = (train['date'] < 20210401)
x_train2 = train_X.loc[_index].reset_index(drop=True)
y_train2 = train_y.loc[_index].reset_index(drop=True)
x_valid2 = train_X.loc[~_index].reset_index(drop=True)
y_valid2 = train_y.loc[~_index].reset_index(drop=True)
# %% [code] {"execution":{"iopub.status.busy":"2021-06-26T07:17:02.256629Z","iopub.status.idle":"2021-06-26T07:17:02.257215Z"}}
train_X
# %% [code] {"jupyter":{"outputs_hidden":false},"execution":{"iopub.status.busy":"2021-06-26T07:17:02.258224Z","iopub.status.idle":"2021-06-26T07:17:02.258854Z"}}
def fit_lgbm(x_train, y_train, x_valid, y_valid, params: dict=None, verbose=100):
oof_pred = np.zeros(len(y_valid), dtype=np.float32)
model = lgbm.LGBMRegressor(**params)
model.fit(x_train, y_train,
eval_set=[(x_valid, y_valid)],
early_stopping_rounds=verbose,
verbose=verbose)
oof_pred = model.predict(x_valid)
score = mean_absolute_error(oof_pred, y_valid)
print('mae:', score)
return oof_pred, model, score
# training lightgbm
params1 = {'objective':'mae',
'reg_alpha': 0.14947461820098767,
'reg_lambda': 0.10185644384043743,
'n_estimators': 3633,
'learning_rate': 0.08046301304430488,
'num_leaves': 674,
'feature_fraction': 0.9101240539122566,
'bagging_fraction': 0.9884451442950513,
'bagging_freq': 8,
'min_child_samples': 51}
params2 = {
'objective':'mae',
'reg_alpha': 0.1,
'reg_lambda': 0.1,
'n_estimators': 80,
'learning_rate': 0.1,
'random_state': 42,
"num_leaves": 22
}
params4 = {'objective':'mae',
'reg_alpha': 0.016468100279441976,
'reg_lambda': 0.09128335764019105,
'n_estimators': 9868,
'learning_rate': 0.10528150510326864,
'num_leaves': 157,
'feature_fraction': 0.5419185713426886,
'bagging_fraction': 0.2637405128936662,
'bagging_freq': 19,
'min_child_samples': 71}
params = {
'objective':'mae',
'reg_alpha': 0.1,
'reg_lambda': 0.1,
'n_estimators': 10000,
'learning_rate': 0.1,
'random_state': 42,
"num_leaves": 100
}
# Slow from this point !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
oof1, model1, score1 = fit_lgbm(
x_train1, y_train1['target1'],
x_valid1, y_valid1['target1'],
params1
)
oof2, model2, score2 = fit_lgbm(
x_train2, y_train2['target2'],
x_valid2, y_valid2['target2'],
params2
)
oof3, model3, score3 = fit_lgbm(
x_train2, y_train2['target3'],
x_valid2, y_valid2['target3'],
params
)
oof4, model4, score4 = fit_lgbm(
x_train2, y_train2['target4'],
x_valid2, y_valid2['target4'],
params4
)
score = (score1+score2+score3+score4) / 4
print(f'score: {score}')
# %% [code]
import pickle
from catboost import CatBoostRegressor
def fit_lgbm(x_train, y_train, x_valid, y_valid, target, params: dict=None, verbose=100):
oof_pred_lgb = np.zeros(len(y_valid), dtype=np.float32)
oof_pred_cat = np.zeros(len(y_valid), dtype=np.float32)
if os.path.isfile(f'../input/mlb-lgbm-and-catboost-models/model_lgb_{target}.pkl'):
with open(f'../input/mlb-lgbm-and-catboost-models/model_lgb_{target}.pkl', 'rb') as fin:
model = pickle.load(fin)
else:
model = lgbm.LGBMRegressor(**params)
model.fit(x_train, y_train,
eval_set=[(x_valid, y_valid)],
early_stopping_rounds=verbose,
verbose=verbose)
with open(f'model_lgb_{target}.pkl', 'wb') as handle:
pickle.dump(model, handle, protocol=pickle.HIGHEST_PROTOCOL)
oof_pred_lgb = model.predict(x_valid)
score_lgb = mean_absolute_error(oof_pred_lgb, y_valid)
print('mae:', score_lgb)
if os.path.isfile(f'../input/mlb-lgbm-and-catboost-models/model_cb_{target}.pkl'):
with open(f'../input/mlb-lgbm-and-catboost-models/model_cb_{target}.pkl', 'rb') as fin:
model_cb = pickle.load(fin)
else:
model_cb = CatBoostRegressor(
n_estimators=2000,
learning_rate=0.05,
loss_function='MAE',
eval_metric='MAE',
max_bin=50,
subsample=0.9,
colsample_bylevel=0.5,
verbose=100)
model_cb.fit(x_train, y_train, use_best_model=True,
eval_set=(x_valid, y_valid),
early_stopping_rounds=25)
with open(f'model_cb_{target}.pkl', 'wb') as handle:
pickle.dump(model_cb, handle, protocol=pickle.HIGHEST_PROTOCOL)
oof_pred_cat = model_cb.predict(x_valid)
score_cat = mean_absolute_error(oof_pred_cat, y_valid)
print('mae:', score_cat)
return oof_pred_lgb, model, oof_pred_cat, model_cb, score_lgb, score_cat
# training lightgbm
params = {
'boosting_type': 'gbdt',
'objective':'mae',
'subsample': 0.5,
'subsample_freq': 1,
'learning_rate': 0.03,
'num_leaves': 2**11-1,
'min_data_in_leaf': 2**12-1,
'feature_fraction': 0.5,
'max_bin': 100,
'n_estimators': 2500,
'boost_from_average': False,
"random_seed":42,
}
oof_pred_lgb2, model_lgb2, oof_pred_cat2, model_cb2, score_lgb2, score_cat2 = fit_lgbm(
x_train1, y_train1['target2'],
x_valid1, y_valid1['target2'],
2, params
)
oof_pred_lgb1, model_lgb1, oof_pred_cat1, model_cb1, score_lgb1, score_cat1 = fit_lgbm(
x_train1, y_train1['target1'],
x_valid1, y_valid1['target1'],
1, params
)
oof_pred_lgb3, model_lgb3, oof_pred_cat3, model_cb3, score_lgb3, score_cat3 = fit_lgbm(
x_train1, y_train1['target3'],
x_valid1, y_valid1['target3'],
3, params
)
oof_pred_lgb4, model_lgb4, oof_pred_cat4, model_cb4, score_lgb4, score_cat4= fit_lgbm(
x_train1, y_train1['target4'],
x_valid1, y_valid1['target4'],
4, params
)
score = (score_lgb1+score_lgb2+score_lgb3+score_lgb4) / 4
print(f'LightGBM score: {score}')
score = (score_cat1+score_cat2+score_cat3+score_cat4) / 4
print(f'Catboost score: {score}')
# %% [markdown]
# ## Inference
# %% [code] {"jupyter":{"outputs_hidden":false},"execution":{"iopub.status.busy":"2021-06-26T07:17:02.259872Z","iopub.status.idle":"2021-06-26T07:17:02.260506Z"}}
players_cols = ['playerId', 'primaryPositionName']
rosters_cols = ['playerId', 'teamId', 'status']
scores_cols = ['playerId', 'battingOrder', 'gamesPlayedBatting', 'flyOuts',
'groundOuts', 'runsScored', 'doubles', 'triples', 'homeRuns',
'strikeOuts', 'baseOnBalls', 'intentionalWalks', 'hits', 'hitByPitch',
'atBats', 'caughtStealing', 'stolenBases', 'groundIntoDoublePlay',
'groundIntoTriplePlay', 'plateAppearances', 'totalBases', 'rbi',
'leftOnBase', 'sacBunts', 'sacFlies', 'catchersInterference',
'pickoffs', 'gamesPlayedPitching', 'gamesStartedPitching',
'completeGamesPitching', 'shutoutsPitching', 'winsPitching',
'lossesPitching', 'flyOutsPitching', 'airOutsPitching',
'groundOutsPitching', 'runsPitching', 'doublesPitching',
'triplesPitching', 'homeRunsPitching', 'strikeOutsPitching',
'baseOnBallsPitching', 'intentionalWalksPitching', 'hitsPitching',
'hitByPitchPitching', 'atBatsPitching', 'caughtStealingPitching',
'stolenBasesPitching', 'inningsPitched', 'saveOpportunities',
'earnedRuns', 'battersFaced', 'outsPitching', 'pitchesThrown', 'balls',
'strikes', 'hitBatsmen', 'balks', 'wildPitches', 'pickoffsPitching',
'rbiPitching', 'gamesFinishedPitching', 'inheritedRunners',
'inheritedRunnersScored', 'catchersInterferencePitching',
'sacBuntsPitching', 'sacFliesPitching', 'saves', 'holds', 'blownSaves',
'assists', 'putOuts', 'errors', 'chances']
null = np.nan
true = True
false = False
# %% [code] {"execution":{"iopub.status.busy":"2021-06-26T07:17:02.26162Z","iopub.status.idle":"2021-06-26T07:17:02.262287Z"}}
import pandas as pd
import numpy as np
from datetime import timedelta
from tqdm import tqdm
import gc
from functools import reduce
from sklearn.model_selection import StratifiedKFold
ROOT_DIR = "../input/mlb-player-digital-engagement-forecasting"
#=======================#
def flatten(df, col):
du = (df.pivot(index="playerId", columns="EvalDate",
values=col).add_prefix(f"{col}_").
rename_axis(None, axis=1).reset_index())
return du
#============================#
def reducer(left, right):
return left.merge(right, on="playerId")
#========================
TGTCOLS = ["target1","target2","target3","target4"]
def train_lag(df, lag=1):
dp = df[["playerId","EvalDate"]+TGTCOLS].copy()
dp["EvalDate"] =dp["EvalDate"] + timedelta(days=lag)
df = df.merge(dp, on=["playerId", "EvalDate"], suffixes=["",f"_{lag}"], how="left")
return df
#=================================
def test_lag(sub):
sub["playerId"] = sub["date_playerId"].apply(lambda s: int( s.split("_")[1] ) )
assert sub.date.nunique() == 1
dte = sub["date"].unique()[0]
eval_dt = pd.to_datetime(dte, format="%Y%m%d")
dtes = [eval_dt + timedelta(days=-k) for k in LAGS]
mp_dtes = {eval_dt + timedelta(days=-k):k for k in LAGS}
sl = LAST.loc[LAST.EvalDate.between(dtes[-1], dtes[0]), ["EvalDate","playerId"]+TGTCOLS].copy()
sl["EvalDate"] = sl["EvalDate"].map(mp_dtes)
du = [flatten(sl, col) for col in TGTCOLS]
du = reduce(reducer, du)
return du, eval_dt
#
#===============
tr = pd.read_csv("../input/mlb-data/target.csv")
print(tr.shape)
gc.collect()
tr["EvalDate"] = pd.to_datetime(tr["EvalDate"])
tr["EvalDate"] = tr["EvalDate"] + timedelta(days=-1)
tr["EvalYear"] = tr["EvalDate"].dt.year
MED_DF = tr.groupby(["playerId","EvalYear"])[TGTCOLS].median().reset_index()
MEDCOLS = ["tgt1_med","tgt2_med", "tgt3_med", "tgt4_med"]
MED_DF.columns = ["playerId","EvalYear"] + MEDCOLS
LAGS = list(range(1,21))
FECOLS = [f"{col}_{lag}" for lag in reversed(LAGS) for col in TGTCOLS]
for lag in tqdm(LAGS):
tr = train_lag(tr, lag=lag)
gc.collect()
#===========
tr = tr.sort_values(by=["playerId", "EvalDate"])
print(tr.shape)
tr = tr.dropna()
print(tr.shape)
tr = tr.merge(MED_DF, on=["playerId","EvalYear"])
gc.collect()
X = tr[FECOLS+MEDCOLS].values
y = tr[TGTCOLS].values
cl = tr["playerId"].values
NFOLDS = 6
skf = StratifiedKFold(n_splits=NFOLDS)
folds = skf.split(X, cl)
folds = list(folds)
import tensorflow as tf
import tensorflow.keras.layers as L
import tensorflow.keras.models as M
from sklearn.metrics import mean_absolute_error, mean_squared_error
from tensorflow.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping
tf.random.set_seed(777)
def make_model(n_in):
inp = L.Input(name="inputs", shape=(n_in,))
x = L.Dense(50, activation="relu", name="d1")(inp)
x = L.Dense(50, activation="relu", name="d2")(x)
preds = L.Dense(4, activation="linear", name="preds")(x)
model = M.Model(inp, preds, name="ANN")
model.compile(loss="mean_absolute_error", optimizer="adam")
return model
net = make_model(X.shape[1])
print(net.summary())
oof = np.zeros(y.shape)
nets = []
for idx in range(NFOLDS):
print("FOLD:", idx)
tr_idx, val_idx = folds[idx]
ckpt = ModelCheckpoint(f"w{idx}.h5", monitor='val_loss', verbose=1, save_best_only=True,mode='min')
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2,patience=3, min_lr=0.0005)
es = EarlyStopping(monitor='val_loss', patience=6)
reg = make_model(X.shape[1])
# reg.fit(X[tr_idx], y[tr_idx], epochs=10, batch_size=35_000, validation_data=(X[val_idx], y[val_idx]),
# verbose=1, callbacks=[ckpt, reduce_lr, es])
reg.load_weights(f"w{idx}.h5")
oof[val_idx] = reg.predict(X[val_idx], batch_size=50_000, verbose=1)
nets.append(reg)
gc.collect()
#
#
mae = mean_absolute_error(y, oof)
mse = mean_squared_error(y, oof, squared=False)
print("mae:", mae)
print("mse:", mse)
# Historical information to use in prediction time
bound_dt = pd.to_datetime("2021-01-01")
LAST = tr.loc[tr.EvalDate>bound_dt].copy()
LAST_MED_DF = MED_DF.loc[MED_DF.EvalYear==2021].copy()
LAST_MED_DF.drop("EvalYear", axis=1, inplace=True)
del tr
#"""
import mlb
FE = []; SUB = [];
# %% [markdown]
# <div class="alert alert-success">
# </div>
# %% [code] {"jupyter":{"outputs_hidden":false},"execution":{"iopub.status.busy":"2021-06-26T07:17:02.263332Z","iopub.status.idle":"2021-06-26T07:17:02.263974Z"}}
import copy
env = mlb.make_env() # initialize the environment
iter_test = env.iter_test() # iterator which loops over each date in test set
for (test_df, sample_prediction_df) in iter_test: # make predictions here
sub = copy.deepcopy(sample_prediction_df.reset_index())
sample_prediction_df = copy.deepcopy(sample_prediction_df.reset_index(drop=True))
# LGBM summit
# creat dataset
sample_prediction_df['playerId'] = sample_prediction_df['date_playerId']\
.map(lambda x: int(x.split('_')[1]))
# Dealing with missing values
if test_df['rosters'].iloc[0] == test_df['rosters'].iloc[0]:
test_rosters = pd.DataFrame(eval(test_df['rosters'].iloc[0]))
else:
test_rosters = pd.DataFrame({'playerId': sample_prediction_df['playerId']})
for col in rosters.columns:
if col == 'playerId': continue
test_rosters[col] = np.nan
if test_df['playerBoxScores'].iloc[0] == test_df['playerBoxScores'].iloc[0]:
test_scores = pd.DataFrame(eval(test_df['playerBoxScores'].iloc[0]))
else:
test_scores = pd.DataFrame({'playerId': sample_prediction_df['playerId']})
for col in scores.columns:
if col == 'playerId': continue
test_scores[col] = np.nan
test_scores = test_scores.groupby('playerId').sum().reset_index()
test = sample_prediction_df[['playerId']].copy()
test = test.merge(players[players_cols], on='playerId', how='left')
test = test.merge(test_rosters[rosters_cols], on='playerId', how='left')
test = test.merge(test_scores[scores_cols], on='playerId', how='left')
test = test.merge(player_target_stats, how='inner', left_on=["playerId"],right_on=["playerId"])
test['label_playerId'] = test['playerId'].map(player2num)
test['label_primaryPositionName'] = test['primaryPositionName'].map(position2num)
test['label_teamId'] = test['teamId'].map(teamid2num)
test['label_status'] = test['status'].map(status2num)
test_X = test[feature_cols]
# predict
pred1 = model1.predict(test_X)
# predict
pred_lgd1 = model_lgb1.predict(test_X)
pred_lgd2 = model_lgb2.predict(test_X)
pred_lgd3 = model_lgb3.predict(test_X)
pred_lgd4 = model_lgb4.predict(test_X)
pred_cat1 = model_cb1.predict(test_X)
pred_cat2 = model_cb2.predict(test_X)
pred_cat3 = model_cb3.predict(test_X)
pred_cat4 = model_cb4.predict(test_X)
test['target1'] = np.clip(pred1,0,100)
test_X = test[feature_cols2]
pred2 = model2.predict(test_X)
pred3 = model3.predict(test_X)
pred4 = model4.predict(test_X)
# merge submission
sample_prediction_df['target1'] = 0.65*np.clip(pred1, 0, 100)+0.25*np.clip(pred_lgd1, 0, 100)+0.10*np.clip(pred_cat1, 0, 100)
sample_prediction_df['target2'] = 0.65*np.clip(pred2, 0, 100)+0.25*np.clip(pred_lgd2, 0, 100)+0.10*np.clip(pred_cat2, 0, 100)
sample_prediction_df['target3'] = 0.65*np.clip(pred3, 0, 100)+0.25*np.clip(pred_lgd3, 0, 100)+0.10*np.clip(pred_cat3, 0, 100)
sample_prediction_df['target4'] = 0.65*np.clip(pred4, 0, 100)+0.25*np.clip(pred_lgd4, 0, 100)+0.10*np.clip(pred_cat4, 0, 100)
sample_prediction_df = sample_prediction_df.fillna(0.)
del sample_prediction_df['playerId']
# TF summit
# Features computation at Evaluation Date
sub_fe, eval_dt = test_lag(sub)
sub_fe = sub_fe.merge(LAST_MED_DF, on="playerId", how="left")
sub_fe = sub_fe.fillna(0.)
_preds = 0.
for reg in nets:
_preds += reg.predict(sub_fe[FECOLS + MEDCOLS]) / NFOLDS
sub_fe[TGTCOLS] = np.clip(_preds, 0, 100)
sub.drop(["date"]+TGTCOLS, axis=1, inplace=True)
sub = sub.merge(sub_fe[["playerId"]+TGTCOLS], on="playerId", how="left")
sub.drop("playerId", axis=1, inplace=True)
sub = sub.fillna(0.)
# Blending
blend = pd.concat(
[sub[['date_playerId']],
(0.35*sub.drop('date_playerId', axis=1) + 0.65*sample_prediction_df.drop('date_playerId', axis=1))],
axis=1
)
env.predict(blend)
# Update Available information
sub_fe["EvalDate"] = eval_dt
#sub_fe.drop(MEDCOLS, axis=1, inplace=True)
LAST = LAST.append(sub_fe)
LAST = LAST.drop_duplicates(subset=["EvalDate","playerId"], keep="last")
# %% [code] {"jupyter":{"outputs_hidden":false},"execution":{"iopub.status.busy":"2021-06-26T07:17:02.264951Z","iopub.status.idle":"2021-06-26T07:17:02.265581Z"}}
pd.concat(
[sub[['date_playerId']],
(sub.drop('date_playerId', axis=1) + sample_prediction_df.drop('date_playerId', axis=1)) / 2],
axis=1
)
# %% [code] {"jupyter":{"outputs_hidden":false},"execution":{"iopub.status.busy":"2021-06-26T07:17:02.26657Z","iopub.status.idle":"2021-06-26T07:17:02.267169Z"}}
sample_prediction_df
# %% [markdown]
# <div class="alert alert-success">
# </div>

File diff suppressed because it is too large Load Diff

@ -1,168 +0,0 @@
#!/usr/bin/env python
# coding: utf-8
# # Overview
# The kernel shows how to use the [tf_pose_estimation](https://github.com/ildoonet/tf-pose-estimation) package in Python on a series of running videos.
# ## Libraries we need
# Install tf_pose and pycocotools
# In[1]:
import os
def get_ipython():
return os
get_ipython().system('pip install -qq https://www.github.com/ildoonet/tf-pose-estimation')
# In[2]:
get_ipython().system('pip install -qq pycocotools')
# In[3]:
get_ipython().run_line_magic('load_ext', 'autoreload')
get_ipython().run_line_magic('autoreload', '2')
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (8, 8)
plt.rcParams["figure.dpi"] = 125
plt.rcParams["font.size"] = 14
plt.rcParams['font.family'] = ['sans-serif']
plt.rcParams['font.sans-serif'] = ['DejaVu Sans']
plt.style.use('ggplot')
sns.set_style("whitegrid", {'axes.grid': False})
# In[4]:
get_ipython().run_line_magic('matplotlib', 'inline')
import tf_pose
import cv2
from glob import glob
from tqdm import tqdm_notebook
from PIL import Image
import numpy as np
import os
def video_gen(in_path):
c_cap = cv2.VideoCapture(in_path)
while c_cap.isOpened():
ret, frame = c_cap.read()
if not ret:
break
yield c_cap.get(cv2.CAP_PROP_POS_MSEC), frame[:, :, ::-1]
c_cap.release()
# In[5]:
video_paths = glob('../input/*.mp4')
c_video = video_gen(video_paths[0])
for _ in range(300):
c_ts, c_frame = next(c_video)
plt.imshow(c_frame)
# In[6]:
from tf_pose.estimator import TfPoseEstimator
from tf_pose.networks import get_graph_path, model_wh
tfpe = tf_pose.get_estimator()
# In[7]:
humans = tfpe.inference(npimg=c_frame, upsample_size=4.0)
print(humans)
# In[8]:
new_image = TfPoseEstimator.draw_humans(c_frame[:, :, ::-1], humans, imgcopy=False)
fig, ax1 = plt.subplots(1, 1, figsize=(10, 10))
ax1.imshow(new_image[:, :, ::-1])
# In[9]:
body_to_dict = lambda c_fig: {'bp_{}_{}'.format(k, vec_name): vec_val
for k, part_vec in c_fig.body_parts.items()
for vec_name, vec_val in zip(['x', 'y', 'score'],
(part_vec.x, 1-part_vec.y, part_vec.score))}
c_fig = humans[0]
body_to_dict(c_fig)
# In[10]:
MAX_FRAMES = 200
body_pose_list = []
for vid_path in tqdm_notebook(video_paths, desc='Files'):
c_video = video_gen(vid_path)
c_ts, c_frame = next(c_video)
out_path = '{}_out.avi'.format(os.path.split(vid_path)[1])
out = cv2.VideoWriter(out_path,
cv2.VideoWriter_fourcc('M','J','P','G'),
10,
(c_frame.shape[1], c_frame.shape[0]))
for (c_ts, c_frame), _ in zip(c_video,
tqdm_notebook(range(MAX_FRAMES), desc='Frames')):
bgr_frame = c_frame[:,:,::-1]
humans = tfpe.inference(npimg=bgr_frame, upsample_size=4.0)
for c_body in humans:
body_pose_list += [dict(video=out_path, time=c_ts, **body_to_dict(c_body))]
new_image = TfPoseEstimator.draw_humans(bgr_frame, humans, imgcopy=False)
out.write(new_image)
out.release()
# In[11]:
import pandas as pd
body_pose_df = pd.DataFrame(body_pose_list)
body_pose_df.describe()
# In[12]:
fig, m_axs = plt.subplots(1, 2, figsize=(15, 5))
for c_ax, (c_name, c_rows) in zip(m_axs, body_pose_df.groupby('video')):
for i in range(17):
c_ax.plot(c_rows['time'], c_rows['bp_{}_y'.format(i)], label='x {}'.format(i))
c_ax.legend()
c_ax.set_title(c_name)
# In[13]:
fig, m_axs = plt.subplots(1, 2, figsize=(15, 5))
for c_ax, (c_name, n_rows) in zip(m_axs, body_pose_df.groupby('video')):
for i in range(17):
c_rows = n_rows.query('bp_{}_score>0.6'.format(i)) # only keep confident results
c_ax.plot(c_rows['bp_{}_x'.format(i)], c_rows['bp_{}_y'.format(i)], label='BP {}'.format(i))
c_ax.legend()
c_ax.set_title(c_name)
# In[14]:
body_pose_df.to_csv('body_pose.csv', index=False)
# In[15]:

@ -1,576 +0,0 @@
#!/usr/bin/env python
# coding: utf-8
#
#
# NOTE: Turn on Internet and GPU
# The code hidden below handles all the imports and function definitions (the heavy lifting). If you're a beginner I'd advice you skip this for now. When you are able to understand the rest of the code, come back here and understand each function to get a deeper knowledge.
# In[1]:
# !/usr/bin/env python3
# coding=utf-8
# author=dave.fang@outlook.com
# create=20171225
import os
import pprint
import cv2
import sys
import math
import time
import tempfile
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.optim as optim
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models
from torch.autograd import Variable
from scipy.ndimage.filters import gaussian_filter
#get_ipython().run_line_magic('matplotlib', 'inline')
#get_ipython().run_line_magic('config', "InlineBackend.figure_format = 'retina'")
# find connection in the specified sequence, center 29 is in the position 15
limb_seq = [[2, 3], [2, 6], [3, 4], [4, 5], [6, 7], [7, 8], [2, 9], [9, 10],
[10, 11], [2, 12], [12, 13], [13, 14], [2, 1], [1, 15], [15, 17],
[1, 16], [16, 18], [3, 17], [6, 18]]
# the middle joints heatmap correpondence
map_ids = [[31, 32], [39, 40], [33, 34], [35, 36], [41, 42], [43, 44], [19, 20], [21, 22],
[23, 24], [25, 26], [27, 28], [29, 30], [47, 48], [49, 50], [53, 54], [51, 52],
[55, 56], [37, 38], [45, 46]]
# these are the colours for the 18 body points
colors = [[255, 0, 0], [255, 85, 0], [255, 170, 0], [255, 255, 0], [170, 255, 0], [85, 255, 0], [0, 255, 0],
[0, 255, 85], [0, 255, 170], [0, 255, 255], [0, 170, 255], [0, 85, 255], [0, 0, 255], [85, 0, 255],
[170, 0, 255], [255, 0, 255], [255, 0, 170], [255, 0, 85]]
class PoseEstimation(nn.Module):
def __init__(self, model_dict):
super(PoseEstimation, self).__init__()
self.model0 = model_dict['block_0']
self.model1_1 = model_dict['block1_1']
self.model2_1 = model_dict['block2_1']
self.model3_1 = model_dict['block3_1']
self.model4_1 = model_dict['block4_1']
self.model5_1 = model_dict['block5_1']
self.model6_1 = model_dict['block6_1']
self.model1_2 = model_dict['block1_2']
self.model2_2 = model_dict['block2_2']
self.model3_2 = model_dict['block3_2']
self.model4_2 = model_dict['block4_2']
self.model5_2 = model_dict['block5_2']
self.model6_2 = model_dict['block6_2']
def forward(self, x):
out1 = self.model0(x)
out1_1 = self.model1_1(out1)
out1_2 = self.model1_2(out1)
out2 = torch.cat([out1_1, out1_2, out1], 1)
out2_1 = self.model2_1(out2)
out2_2 = self.model2_2(out2)
out3 = torch.cat([out2_1, out2_2, out1], 1)
out3_1 = self.model3_1(out3)
out3_2 = self.model3_2(out3)
out4 = torch.cat([out3_1, out3_2, out1], 1)
out4_1 = self.model4_1(out4)
out4_2 = self.model4_2(out4)
out5 = torch.cat([out4_1, out4_2, out1], 1)
out5_1 = self.model5_1(out5)
out5_2 = self.model5_2(out5)
out6 = torch.cat([out5_1, out5_2, out1], 1)
out6_1 = self.model6_1(out6)
out6_2 = self.model6_2(out6)
return out6_1, out6_2
def make_layers(layer_dict):
layers = []
for i in range(len(layer_dict) - 1):
layer = layer_dict[i]
for k in layer:
v = layer[k]
if 'pool' in k:
layers += [nn.MaxPool2d(kernel_size=v[0], stride=v[1], padding=v[2])]
else:
conv2d = nn.Conv2d(in_channels=v[0], out_channels=v[1], kernel_size=v[2], stride=v[3], padding=v[4])
layers += [conv2d, nn.ReLU(inplace=True)]
layer = list(layer_dict[-1].keys())
k = layer[0]
v = layer_dict[-1][k]
conv2d = nn.Conv2d(in_channels=v[0], out_channels=v[1], kernel_size=v[2], stride=v[3], padding=v[4])
layers += [conv2d]
return nn.Sequential(*layers)
def get_pose_model():
blocks = {}
block_0 = [{'conv1_1': [3, 64, 3, 1, 1]}, {'conv1_2': [64, 64, 3, 1, 1]}, {'pool1_stage1': [2, 2, 0]},
{'conv2_1': [64, 128, 3, 1, 1]}, {'conv2_2': [128, 128, 3, 1, 1]}, {'pool2_stage1': [2, 2, 0]},
{'conv3_1': [128, 256, 3, 1, 1]}, {'conv3_2': [256, 256, 3, 1, 1]}, {'conv3_3': [256, 256, 3, 1, 1]},
{'conv3_4': [256, 256, 3, 1, 1]}, {'pool3_stage1': [2, 2, 0]}, {'conv4_1': [256, 512, 3, 1, 1]},
{'conv4_2': [512, 512, 3, 1, 1]}, {'conv4_3_CPM': [512, 256, 3, 1, 1]},
{'conv4_4_CPM': [256, 128, 3, 1, 1]}]
blocks['block1_1'] = [{'conv5_1_CPM_L1': [128, 128, 3, 1, 1]}, {'conv5_2_CPM_L1': [128, 128, 3, 1, 1]},
{'conv5_3_CPM_L1': [128, 128, 3, 1, 1]}, {'conv5_4_CPM_L1': [128, 512, 1, 1, 0]},
{'conv5_5_CPM_L1': [512, 38, 1, 1, 0]}]
blocks['block1_2'] = [{'conv5_1_CPM_L2': [128, 128, 3, 1, 1]}, {'conv5_2_CPM_L2': [128, 128, 3, 1, 1]},
{'conv5_3_CPM_L2': [128, 128, 3, 1, 1]}, {'conv5_4_CPM_L2': [128, 512, 1, 1, 0]},
{'conv5_5_CPM_L2': [512, 19, 1, 1, 0]}]
for i in range(2, 7):
blocks['block%d_1' % i] = [{'Mconv1_stage%d_L1' % i: [185, 128, 7, 1, 3]},
{'Mconv2_stage%d_L1' % i: [128, 128, 7, 1, 3]},
{'Mconv3_stage%d_L1' % i: [128, 128, 7, 1, 3]},
{'Mconv4_stage%d_L1' % i: [128, 128, 7, 1, 3]},
{'Mconv5_stage%d_L1' % i: [128, 128, 7, 1, 3]},
{'Mconv6_stage%d_L1' % i: [128, 128, 1, 1, 0]},
{'Mconv7_stage%d_L1' % i: [128, 38, 1, 1, 0]}]
blocks['block%d_2' % i] = [{'Mconv1_stage%d_L2' % i: [185, 128, 7, 1, 3]},
{'Mconv2_stage%d_L2' % i: [128, 128, 7, 1, 3]},
{'Mconv3_stage%d_L2' % i: [128, 128, 7, 1, 3]},
{'Mconv4_stage%d_L2' % i: [128, 128, 7, 1, 3]},
{'Mconv5_stage%d_L2' % i: [128, 128, 7, 1, 3]},
{'Mconv6_stage%d_L2' % i: [128, 128, 1, 1, 0]},
{'Mconv7_stage%d_L2' % i: [128, 19, 1, 1, 0]}]
layers = []
for block in block_0:
# print(block)
for key in block:
v = block[key]
if 'pool' in key:
layers += [nn.MaxPool2d(kernel_size=v[0], stride=v[1], padding=v[2])]
else:
conv2d = nn.Conv2d(in_channels=v[0], out_channels=v[1], kernel_size=v[2], stride=v[3], padding=v[4])
layers += [conv2d, nn.ReLU(inplace=True)]
models = {
'block_0': nn.Sequential(*layers)
}
for k in blocks:
v = blocks[k]
models[k] = make_layers(v)
return PoseEstimation(models)
def get_paf_and_heatmap(model, img_raw, scale_search, param_stride=8, box_size=368):
multiplier = [scale * box_size / img_raw.shape[0] for scale in scale_search]
heatmap_avg = torch.zeros((len(multiplier), 19, img_raw.shape[0], img_raw.shape[1])).cuda()
paf_avg = torch.zeros((len(multiplier), 38, img_raw.shape[0], img_raw.shape[1])).cuda()
for i, scale in enumerate(multiplier):
img_test = cv2.resize(img_raw, (0, 0), fx=scale, fy=scale, interpolation=cv2.INTER_CUBIC)
img_test_pad, pad = pad_right_down_corner(img_test, param_stride, param_stride)
img_test_pad = np.transpose(np.float32(img_test_pad[:, :, :, np.newaxis]), (3, 2, 0, 1)) / 256 - 0.5
feed = Variable(torch.from_numpy(img_test_pad)).cuda()
output1, output2 = model(feed)
print(output1.size())
print(output2.size())
heatmap = nn.UpsamplingBilinear2d((img_raw.shape[0], img_raw.shape[1])).cuda()(output2)
paf = nn.UpsamplingBilinear2d((img_raw.shape[0], img_raw.shape[1])).cuda()(output1)
heatmap_avg[i] = heatmap[0].data
paf_avg[i] = paf[0].data
heatmap_avg = torch.transpose(torch.transpose(torch.squeeze(torch.mean(heatmap_avg, 0)), 0, 1), 1, 2).cuda()
heatmap_avg = heatmap_avg.cpu().numpy()
paf_avg = torch.transpose(torch.transpose(torch.squeeze(torch.mean(paf_avg, 0)), 0, 1), 1, 2).cuda()
paf_avg = paf_avg.cpu().numpy()
return paf_avg, heatmap_avg
def extract_heatmap_info(heatmap_avg, param_thre1=0.1):
all_peaks = []
peak_counter = 0
for part in range(18):
map_ori = heatmap_avg[:, :, part]
map_gau = gaussian_filter(map_ori, sigma=3)
map_left = np.zeros(map_gau.shape)
map_left[1:, :] = map_gau[:-1, :]
map_right = np.zeros(map_gau.shape)
map_right[:-1, :] = map_gau[1:, :]
map_up = np.zeros(map_gau.shape)
map_up[:, 1:] = map_gau[:, :-1]
map_down = np.zeros(map_gau.shape)
map_down[:, :-1] = map_gau[:, 1:]
peaks_binary = np.logical_and.reduce(
(map_gau >= map_left, map_gau >= map_right, map_gau >= map_up,
map_gau >= map_down, map_gau > param_thre1))
peaks = zip(np.nonzero(peaks_binary)[1], np.nonzero(peaks_binary)[0]) # note reverse
peaks = list(peaks)
peaks_with_score = [x + (map_ori[x[1], x[0]],) for x in peaks]
ids = range(peak_counter, peak_counter + len(peaks))
peaks_with_score_and_id = [peaks_with_score[i] + (ids[i],) for i in range(len(ids))]
all_peaks.append(peaks_with_score_and_id)
peak_counter += len(peaks)
return all_peaks
def extract_paf_info(img_raw, paf_avg, all_peaks, param_thre2=0.05, param_thre3=0.5):
connection_all = []
special_k = []
mid_num = 10
for k in range(len(map_ids)):
score_mid = paf_avg[:, :, [x - 19 for x in map_ids[k]]]
candA = all_peaks[limb_seq[k][0] - 1]
candB = all_peaks[limb_seq[k][1] - 1]
nA = len(candA)
nB = len(candB)
if nA != 0 and nB != 0:
connection_candidate = []
for i in range(nA):
for j in range(nB):
vec = np.subtract(candB[j][:2], candA[i][:2])
norm = math.sqrt(vec[0] * vec[0] + vec[1] * vec[1])
vec = np.divide(vec, norm)
startend = zip(np.linspace(candA[i][0], candB[j][0], num=mid_num),
np.linspace(candA[i][1], candB[j][1], num=mid_num))
startend = list(startend)
vec_x = np.array([score_mid[int(round(startend[I][1])), int(round(startend[I][0])), 0]
for I in range(len(startend))])
vec_y = np.array([score_mid[int(round(startend[I][1])), int(round(startend[I][0])), 1]
for I in range(len(startend))])
score_midpts = np.multiply(vec_x, vec[0]) + np.multiply(vec_y, vec[1])
score_with_dist_prior = sum(score_midpts) / len(score_midpts)
score_with_dist_prior += min(0.5 * img_raw.shape[0] / norm - 1, 0)
criterion1 = len(np.nonzero(score_midpts > param_thre2)[0]) > 0.8 * len(score_midpts)
criterion2 = score_with_dist_prior > 0
if criterion1 and criterion2:
connection_candidate.append(
[i, j, score_with_dist_prior, score_with_dist_prior + candA[i][2] + candB[j][2]])
connection_candidate = sorted(connection_candidate, key=lambda x: x[2], reverse=True)
connection = np.zeros((0, 5))
for c in range(len(connection_candidate)):
i, j, s = connection_candidate[c][0:3]
if i not in connection[:, 3] and j not in connection[:, 4]:
connection = np.vstack([connection, [candA[i][3], candB[j][3], s, i, j]])
if len(connection) >= min(nA, nB):
break
connection_all.append(connection)
else:
special_k.append(k)
connection_all.append([])
return special_k, connection_all
def get_subsets(connection_all, special_k, all_peaks):
# last number in each row is the total parts number of that person
# the second last number in each row is the score of the overall configuration
subset = -1 * np.ones((0, 20))
candidate = np.array([item for sublist in all_peaks for item in sublist])
for k in range(len(map_ids)):
if k not in special_k:
partAs = connection_all[k][:, 0]
partBs = connection_all[k][:, 1]
indexA, indexB = np.array(limb_seq[k]) - 1
for i in range(len(connection_all[k])): # = 1:size(temp,1)
found = 0
subset_idx = [-1, -1]
for j in range(len(subset)): # 1:size(subset,1):
if subset[j][indexA] == partAs[i] or subset[j][indexB] == partBs[i]:
subset_idx[found] = j
found += 1
if found == 1:
j = subset_idx[0]
if (subset[j][indexB] != partBs[i]):
subset[j][indexB] = partBs[i]
subset[j][-1] += 1
subset[j][-2] += candidate[partBs[i].astype(int), 2] + connection_all[k][i][2]
elif found == 2: # if found 2 and disjoint, merge them
j1, j2 = subset_idx
print("found = 2")
membership = ((subset[j1] >= 0).astype(int) + (subset[j2] >= 0).astype(int))[:-2]
if len(np.nonzero(membership == 2)[0]) == 0: # merge
subset[j1][:-2] += (subset[j2][:-2] + 1)
subset[j1][-2:] += subset[j2][-2:]
subset[j1][-2] += connection_all[k][i][2]
subset = np.delete(subset, j2, 0)
else: # as like found == 1
subset[j1][indexB] = partBs[i]
subset[j1][-1] += 1
subset[j1][-2] += candidate[partBs[i].astype(int), 2] + connection_all[k][i][2]
# if find no partA in the subset, create a new subset
elif not found and k < 17:
row = -1 * np.ones(20)
row[indexA] = partAs[i]
row[indexB] = partBs[i]
row[-1] = 2
row[-2] = sum(candidate[connection_all[k][i, :2].astype(int), 2]) + connection_all[k][i][2]
subset = np.vstack([subset, row])
return subset, candidate
def draw_key_point(subset, all_peaks, img_raw):
del_ids = []
for i in range(len(subset)):
if subset[i][-1] < 4 or subset[i][-2] / subset[i][-1] < 0.4:
del_ids.append(i)
subset = np.delete(subset, del_ids, axis=0)
img_canvas = img_raw.copy() # B,G,R order
for i in range(18):
for j in range(len(all_peaks[i])):
cv2.circle(img_canvas, all_peaks[i][j][0:2], 4, colors[i], thickness=-1)
return subset, img_canvas
def link_key_point(img_canvas, candidate, subset, stickwidth=4):
for i in range(17):
for n in range(len(subset)):
index = subset[n][np.array(limb_seq[i]) - 1]
if -1 in index:
continue
cur_canvas = img_canvas.copy()
Y = candidate[index.astype(int), 0]
X = candidate[index.astype(int), 1]
mX = np.mean(X)
mY = np.mean(Y)
length = ((X[0] - X[1]) ** 2 + (Y[0] - Y[1]) ** 2) ** 0.5
angle = math.degrees(math.atan2(X[0] - X[1], Y[0] - Y[1]))
polygon = cv2.ellipse2Poly((int(mY), int(mX)), (int(length / 2), stickwidth), int(angle), 0, 360, 1)
cv2.fillConvexPoly(cur_canvas, polygon, colors[i])
img_canvas = cv2.addWeighted(img_canvas, 0.4, cur_canvas, 0.6, 0)
return img_canvas
def pad_right_down_corner(img, stride, pad_value):
h = img.shape[0]
w = img.shape[1]
pad = 4 * [None]
pad[0] = 0 # up
pad[1] = 0 # left
pad[2] = 0 if (h % stride == 0) else stride - (h % stride) # down
pad[3] = 0 if (w % stride == 0) else stride - (w % stride) # right
img_padded = img
pad_up = np.tile(img_padded[0:1, :, :] * 0 + pad_value, (pad[0], 1, 1))
img_padded = np.concatenate((pad_up, img_padded), axis=0)
pad_left = np.tile(img_padded[:, 0:1, :] * 0 + pad_value, (1, pad[1], 1))
img_padded = np.concatenate((pad_left, img_padded), axis=1)
pad_down = np.tile(img_padded[-2:-1, :, :] * 0 + pad_value, (pad[2], 1, 1))
img_padded = np.concatenate((img_padded, pad_down), axis=0)
pad_right = np.tile(img_padded[:, -2:-1, :] * 0 + pad_value, (1, pad[3], 1))
img_padded = np.concatenate((img_padded, pad_right), axis=1)
return img_padded, pad
if __name__ == '__main__':
print(get_pose_model())
# First let's download the pre-trained model.
# In[2]:
# Using gdown to download the model directly from Google Drive
#assert os.system(' conda install -y gdown') == 0
import gdown
# In[3]:
model = 'coco_pose_iter_440000.pth.tar'
if not os.path.exists(model):
url = 'https://drive.google.com/u/0/uc?export=download&confirm=f_Ix&id=0B1asvDK18cu_MmY1ZkpaOUhhRHM'
gdown.download(
url,
model,
quiet=False
)
# In[4]:
state_dict = torch.load('./coco_pose_iter_440000.pth.tar')['state_dict'] # getting the pre-trained model's parameters
# A state_dict is simply a Python dictionary object that maps each layer to its parameter tensor.
model_pose = get_pose_model() # building the model (see fn. defn. above). To see the architecture, see below cell.
model_pose.load_state_dict(state_dict) # Loading the parameters (weights, biases) into the model.
model_pose.float() # I'm not sure why this is used. No difference if you remove it.
# In[5]:
arch_image = '../input/indonesian-traditional-dance/tgagrakanyar/tga_0000.jpg'
img_ori = cv2.imread(arch_image)
plt.figure(figsize=(15, 8))
plt.imshow(img_ori[...,::-1])
# Notice, the first 10 layers are from VGG-19. But here instead of downloading the model and loading the layers from there, we simply hardcoaded it in get_pose_model()
# In[6]:
# Run this to view the model's architecture
#model_pose.eval()
# In[7]:
use_gpu = True
if use_gpu:
model_pose.cuda()
model_pose = torch.nn.DataParallel(model_pose, device_ids=range(torch.cuda.device_count()))
cudnn.benchmark = True
# In[8]:
def estimate_pose(img_ori, name=None):
if name is None:
name = tempfile.mktemp(
dir='/kaggle/working',
suffix='.png',
)
pprint.pprint(
['estimate_pose', dict(name=name)],
)
# People might be at different scales in the image, perform inference at multiple scales to boost results
scale_param = [0.5, 1.0, 1.5, 2.0]
# Predict Heatmaps for approximate joint position
# Use Part Affinity Fields (PAF's) as guidance to link joints to form skeleton
# PAF's are just unit vectors along the limb encoding the direction of the limb
# A dot product of possible joint connection will be high if actual limb else low
paf_info, heatmap_info = get_paf_and_heatmap(model_pose, img_ori, scale_param)
peaks = extract_heatmap_info(heatmap_info)
sp_k, con_all = extract_paf_info(img_ori, paf_info, peaks)
subsets, candidates = get_subsets(con_all, sp_k, peaks)
subsets, img_points = draw_key_point(subsets, peaks, img_ori)
# After predicting Heatmaps and PAF's, proceeed to link joints correctly
img_canvas = link_key_point(img_points, candidates, subsets)
f = plt.figure(figsize=(15, 10))
plt.subplot(1, 2, 1)
plt.imshow(img_points[...,::-1])
plt.subplot(1, 2, 2)
plt.imshow(img_canvas[...,::-1])
f.savefig(name)
# In[9]:
test_image = '../input/indonesian-traditional-dance/tgagrakanyar/tga_0000.jpg'
img_ori = cv2.imread(test_image)
estimate_pose(img_ori)
# In[10]:
test_image = '../input/indonesian-traditional-dance/tgagrakanyar/tga_0010.jpg'
img_ori = cv2.imread(test_image)
estimate_pose(img_ori)
# In[11]:
test_image = '../input/indonesian-traditional-dance/tgagrakanyar/tga_0020.jpg'
img_ori = cv2.imread(test_image)
estimate_pose(img_ori)
# In[12]:
test_image = '../input/indonesian-traditional-dance/tgagrakanyar/tga_0030.jpg'
img_ori = cv2.imread(test_image)
estimate_pose(img_ori)
# In[13]:
test_image = '../input/indonesian-traditional-dance/tgagrakanyar/tga_0040.jpg'
img_ori = cv2.imread(test_image)
estimate_pose(img_ori)
# In[14]:
test_image = '../input/indonesian-traditional-dance/tgagrakanyar/tga_0050.jpg'
img_ori = cv2.imread(test_image)
estimate_pose(img_ori)
# In[ ]:

@ -1,56 +0,0 @@
import os
if os.system(r''' pip show alphapose''') != 0:
t1 = r'''
pip install pycocotools
rm -fr /kaggle/working/AlphaPose
pip install pyyaml==5.2
pip install scipy==1.1.0
git clone https://github.com/WildflowerSchools/AlphaPose
python -m pip install cython gdown
apt-get install libyaml-dev
cd /kaggle/working/AlphaPose && python setup.py build develop
'''
for o in t1.splitlines():
print(o)
assert os.system(o) == 0
import os
#!git clone https://github.com/MVIG-SJTU/AlphaPose.git
import torch
print(torch.__version__)
import yaml, scipy
print(yaml.__version__)
print(scipy.__version__)
import gdown
import os
for o1, o2 in [
(
'1D47msNOOiJKvPOXlnpyzdKA3k6E97NTC',
'/kaggle/working/AlphaPose/detector/yolo/data/yolov3-spp.weights',
),
(
'1nlnuYfGNuHWZztQHXwVZSL_FvfE551pA',
'/kaggle/working/AlphaPose/detector/tracker/data/JDE-1088x608-uncertainty',
),
(
'1kQhnMRURFiy7NsdS8EFL-8vtqEXOgECn',
'/kaggle/working/AlphaPose/pretrained_models/fast_res50_256x192.pth'
),
]:
os.makedirs(os.path.split(o2)[0], exist_ok=True)
if not os.path.exists(o2):
gdown.download(
'https://drive.google.com/u/0/uc?export=download&confirm=f_Ix&id=%s' % o1,
o2,
quiet=False
)
assert os.system(r'''
mkdir -p /kaggle/working/test-input && mkdir -p /kaggle/working/test-output && cp /kaggle/working/AlphaPose/examples/demo/*.jpg /kaggle/working/test-input
cd /kaggle/working/AlphaPose && python3 scripts/demo_inference.py --cfg configs/coco/resnet/256x192_res50_lr1e-3_1x.yaml --checkpoint pretrained_models/fast_res50_256x192.pth --indir /kaggle/working/test-input --outdir /kaggle/working/test-output --save_img
''') == 0

@ -1,172 +0,0 @@
# https://raw.githubusercontent.com/hafizas101/Real-time-human-pose-estimation-and-classification/master/main.py
# From Python
# It requires OpenCV installed for Python
import sys
import cv2
import os
from sys import platform
import argparse
from math import sqrt, acos, degrees, atan, degrees
import numpy as np
# ----------------------------------------- Arslan Part ----------------------------------------------------------------------------------
def get_angle(a,b):
#print(a)
#print(b)
del_y = a[1]-b[1]
del_x = b[0]-a[0]
if del_x == 0:
del_x = 0.1
#print("Del_X : "+str(del_x)+"-----Del_Y: "+str(del_y))
angle = 0
if del_x > 0 and del_y > 0:
angle = degrees(atan(del_y / del_x))
elif del_x < 0 and del_y > 0:
angle = degrees(atan(del_y / del_x)) + 180
return angle
# ------------------------------------------------------------------------------------------------------------------------------------------
# ----------------------------------------- Maksim Part ----------------------------------------------------------------------------------
def angle_gor(a,b,c,d):
ab=[a[0]-b[0],a[1]-b[1]]
ab1=[c[0]-d[0],c[1]-d[1]]
cos=abs(ab[0]*ab1[0]+ab[1]*ab1[1])/(sqrt(ab[0]**2+ab[1]**2)*sqrt(ab1[0]**2+ab1[1]**2))
ang = acos(cos)
return ang*180/np.pi
def sit_ang(a,b,c,d):
ang=angle_gor(a,b,c,d)
s1=0
if ang != None:
#print("Angle",ang)
if ang < 120 and ang>40:
s1=1
return s1
def sit_rec(a,b,c,d):
ab = [a[0] - b[0], a[1] - b[1]]
ab1 = [c[0] - d[0], c[1] - d[1]]
l1=sqrt(ab[0]**2+ab[1]**2)
l2=sqrt(ab1[0]**2+ab1[1]**2)
s=0
if l1!=0 and l2!=0:
#print(l1,l2, "---------->>>")
if l2/l1>=1.5:
s=1
return s
# ------------------------------------------------------------------------------------------------------------------------------------------
# ----------------------------------------------------------- OpenPose Example Code ----------------------------------------------------------
# Import Openpose (Windows/Ubuntu/OSX)
dir_path = os.path.dirname(os.path.realpath(__file__))
try:
# Windows Import
if platform == "win32":
# Change these variables to point to the correct folder (Release/x64 etc.)
sys.path.append(dir_path + '/../../python/openpose/Release');
os.environ['PATH'] = os.environ['PATH'] + ';' + dir_path + '/../../x64/Release;' + dir_path + '/../../bin;'
import pyopenpose as op
else:
# Change these variables to point to the correct folder (Release/x64 etc.)
sys.path.append('../../python');
# If you run `make install` (default path is `/usr/local/python` for Ubuntu), you can also access the OpenPose/python module from there. This will install OpenPose and the python library at your desired installation path. Ensure that this is in your python path in order to use it.
# sys.path.append('/usr/local/python')
from openpose import pyopenpose as op
except ImportError as e:
print('Error: OpenPose library could not be found. Did you enable `BUILD_PYTHON` in CMake and have this Python script in the right folder?')
raise e
# Flags
parser = argparse.ArgumentParser()
parser.add_argument("--image_path", default="../../../examples/media/COCO_val2014_000000000192.jpg", help="Process an image. Read all standard formats (jpg, png, bmp, etc.).")
args = parser.parse_known_args()
# Custom Params (refer to include/openpose/flags.hpp for more parameters)
params = dict()
params["model_folder"] = "/home/nvidia/openpose/models/"
# Add others in path?
for i in range(0, len(args[1])):
curr_item = args[1][i]
if i != len(args[1])-1: next_item = args[1][i+1]
else: next_item = "1"
if "--" in curr_item and "--" in next_item:
key = curr_item.replace('-','')
if key not in params: params[key] = "1"
elif "--" in curr_item and "--" not in next_item:
key = curr_item.replace('-','')
if key not in params: params[key] = next_item
# Construct it from system arguments
# op.init_argv(args[1])
# oppython = op.OpenposePython()
c=0
# Starting OpenPose
opWrapper = op.WrapperPython()
opWrapper.configure(params)
opWrapper.start()
# ------------------------------------------------------- OUR CONTRIBUTIONS ----------------------------------------------------------------
cam = cv2.VideoCapture(1)
for i in range(1000):
# Process Image
datum = op.Datum()
s, im = cam.read() # captures image
#cv2.imshow("Test Picture", im) # displays captured image
#im=cv2.resize(im,(480,270), interpolation = cv2.INTER_AREA)
image1 = im
#imageToProcess = cv2.imread(args[0].image_path)
c+=1
if c==8:
c=0
datum.cvInputData = image1
opWrapper.emplaceAndPop([datum]) # OpenPose being applied to the frame image.
# Display Image
#print("Body keypoints: \n" + str(datum.poseKeypoints))
#print(datum.poseKeypoints.shape)
if len(datum.poseKeypoints.shape)>=2:
x1=0
x2=0
for j in range(len(datum.poseKeypoints)):
x1=0
x2=0
s=0
s1=0
ang1 = get_angle(datum.poseKeypoints[j][3], datum.poseKeypoints[j][4])
ang2 = get_angle(datum.poseKeypoints[j][6], datum.poseKeypoints[j][7])
if (30 < ang1 < 150):
x1 = 1
if (30 < ang2 < 150):
x2 = 1
x3 = x1+x2
if (x3 == 1):
print("The {} person says: HELLO !".format(j+1))
#cv2.putText(datum.cvOutputData,'OpenPose using Python-OpenCV',(20,30), cv2.FONT_HERSHEY_SIMPLEX, 1,(255,255,255),1,cv2.LINE_AA)
elif (x3 == 2):
print("The {} person says: STOP PLEASE !".format(j+1))
s += sit_rec(datum.poseKeypoints[j][9], datum.poseKeypoints[j][10],datum.poseKeypoints[j][10],datum.poseKeypoints[j][11])
s += sit_rec(datum.poseKeypoints[j][12], datum.poseKeypoints[j][13],datum.poseKeypoints[j][13],datum.poseKeypoints[j][14])
s1+=sit_ang(datum.poseKeypoints[j][9], datum.poseKeypoints[j][10],datum.poseKeypoints[j][10],datum.poseKeypoints[j][11])
s1+=sit_ang(datum.poseKeypoints[j][12], datum.poseKeypoints[j][13],datum.poseKeypoints[j][13],datum.poseKeypoints[j][14])
if s > 0 or s1>0:
print("The {} person is sitting".format(j+1))
if s == 0 and s1 == 0:
print("The {} person is standing".format(j+1))
print("___________________________")
print(" ")
im=cv2.resize(datum.cvOutputData,(960,540), interpolation = cv2.INTER_AREA)
cv2.imshow("OpenPose 1.4.0 - Tutorial Python API", im)
cv2.waitKey(1)
# ------------------------------------------------------------------------------------------------------------------------------------------

@ -1,5 +1,5 @@
#!/usr/bin/env python3
#vim: set filetype=python
# vim: set filetype=python
import logging
import json
@ -7,158 +7,184 @@ import enum
import pathlib
import sys
import argparse
#import optparse
# import optparse
import dataclasses
import subprocess
import os
from typing import (
Optional, Any, TypeAlias, Literal, cast, BinaryIO, Generator,
ClassVar, Self,
Optional,
Any,
TypeAlias,
Literal,
cast,
BinaryIO,
Generator,
ClassVar,
Self,
)
logger = logging.getLogger()
@dataclasses.dataclass
class Settings:
project_root : pathlib.Path = pathlib.Path.cwd()
project_root: pathlib.Path = pathlib.Path.cwd()
env_path : pathlib.Path = project_root / 'tmp' / 'env3'
env_path: pathlib.Path = project_root / 'tmp' / 'env3'
_settings : ClassVar[Optional['Settings']] = None
_settings: ClassVar[Optional['Settings']] = None
@classmethod
def settings(cls) -> Self:
if cls._settings is None:
cls._settings = cls()
@classmethod
def settings(cls) -> Self:
if cls._settings is None:
cls._settings = cls()
return cls._settings
return cls._settings
def js(argv: list[str]) -> int:
return subprocess.check_call([
'sudo',
'docker-compose',
'--project-directory',
Settings.settings().project_root,
'-f',
Settings.settings().project_root / 'docker' / 'js' / 'docker-compose.yml',
*argv,
])
return subprocess.check_call(
[
'sudo',
'docker-compose',
'--project-directory',
Settings.settings().project_root,
'-f',
Settings.settings().project_root / 'docker' / 'js' / 'docker-compose.yml',
*argv,
]
)
def env(
argv: Optional[list[str]] = None,
mode: Literal['exec', 'subprocess'] = 'subprocess',
**kwargs: Any,
argv: Optional[list[str]] = None,
mode: Literal['exec', 'subprocess'] = 'subprocess',
**kwargs: Any,
) -> Optional[subprocess.CompletedProcess[bytes]]:
env_path = Settings.settings().env_path
env_path = Settings.settings().env_path
if not env_path.exists():
subprocess.check_call([
sys.executable, '-m', 'venv',
'--system-site-packages',
str(env_path)
])
if not env_path.exists():
subprocess.check_call([sys.executable, '-m', 'venv', '--system-site-packages', str(env_path)])
subprocess.check_call([
env_path / 'bin' / 'python3',
'-m', 'pip',
'install', '-r', 'requirements.txt',
])
subprocess.check_call(
[
env_path / 'bin' / 'python3',
'-m',
'pip',
'install',
'-r',
'requirements.txt',
]
)
if not argv is None:
python_path = str(env_path / 'bin' / 'python3')
if not argv is None:
python_path = str(env_path / 'bin' / 'python3')
if mode == 'exec':
os.execv(
python_path,
[
python_path,
*argv,
],
)
return None
elif mode == 'subprocess':
return subprocess.run([
python_path,
*argv,
], **kwargs)
else:
raise NotImplementedError
if mode == 'exec':
os.execv(
python_path,
[
python_path,
*argv,
],
)
return None
elif mode == 'subprocess':
return subprocess.run(
[
python_path,
*argv,
],
**kwargs,
)
else:
raise NotImplementedError
return None
return None
def ruff(argv: list[str]) -> None:
parser = argparse.ArgumentParser()
parser.add_argument(
'-i',
dest='paths',
help='specify paths to check',
default=[],
action='append',
)
parser.add_argument(
'-e',
dest='exclude',
help='rules to ignore',
default=[],
action='append',
)
parser = argparse.ArgumentParser()
parser.add_argument(
'-i',
dest='paths',
help='specify paths to check',
default=[],
action='append',
)
parser.add_argument(
'-e',
dest='exclude',
help='rules to ignore',
default=[],
action='append',
)
options, args = parser.parse_known_args(argv)
options, args = parser.parse_known_args(argv)
if len(options.paths) == 0:
options.paths.extend([
'.',
'dotfiles/.local/bin/commands',
])
if len(options.paths) == 0:
options.paths.extend(
[
'.',
'dotfiles/.local/bin/commands',
]
)
if len(options.exclude) == 0:
options.exclude.extend([
'E731',
'E713',
'E714',
'E703',
])
if len(options.exclude) == 0:
options.exclude.extend(
[
'E731',
'E713',
'E714',
'E703',
]
)
res = env([
'-m',
'ruff',
'check',
*args,
'--output-format', 'json',
'--ignore', ','.join(options.exclude),
*options.paths,
], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
res = env(
[
'-m',
'ruff',
'check',
*args,
'--output-format',
'json',
'--ignore',
','.join(options.exclude),
*options.paths,
],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
assert not res is None
assert not res is None
errors = json.loads(res.stdout.decode('utf-8'))
errors = json.loads(res.stdout.decode('utf-8'))
g: dict[str, Any] = dict()
for o in errors:
if not o['filename'] in g:
g[o['filename']] = []
g[o['filename']].append(o)
g: dict[str, Any] = dict()
for o in errors:
if not o['filename'] in g:
g[o['filename']] = []
g[o['filename']].append(o)
h = {
k : len(v)
for k, v in g.items()
}
h = {k: len(v) for k, v in g.items()}
logger.info(json.dumps(errors, indent=4))
logger.info(json.dumps(h, indent=4))
logger.info(json.dumps(errors, indent=4))
logger.info(json.dumps(h, indent=4))
def inside_env() -> bool:
try:
import numpy
return True
except Exception:
return False
try:
import numpy
#class Commands(enum.StrEnum):
return True
except Exception:
return False
# class Commands(enum.StrEnum):
# js = 'js'
# mypy = 'mypy'
# env = 'env'
@ -172,83 +198,97 @@ def inside_env() -> bool:
# argv,
# )
def host_deps(argv: list[str]) -> None:
if sys.platform in ['linux']:
subprocess.check_call(r'''
if sys.platform in ['linux']:
subprocess.check_call(
r"""
exec yay -S $(cat requirements-archlinux.txt)
''', shell=True,)
else:
raise NotImplementedError
""",
shell=True,
)
else:
raise NotImplementedError
Command_args = ['js', 'mypy', 'env', 'ruff', 'm2', 'host_deps',]
Command : TypeAlias = Literal['js', 'mypy', 'env', 'ruff', 'm2', 'host_deps',]
Command_args = [
'js',
'mypy',
'env',
'ruff',
'm2',
'host_deps',
]
Command: TypeAlias = Literal[
'js',
'mypy',
'env',
'ruff',
'm2',
'host_deps',
]
def run(argv: Optional[list[str]] = None) -> None:
logging.basicConfig(
level=logging.INFO,
format=(
'%(levelname)s:%(name)s:%(message)s'
':%(process)d'
':%(asctime)s'
':%(pathname)s:%(funcName)s:%(lineno)s'
),
)
logging.basicConfig(
level=logging.INFO,
format=('%(levelname)s:%(name)s:%(message)s:%(process)d:%(asctime)s:%(pathname)s:%(funcName)s:%(lineno)s'),
)
if argv is None:
argv = sys.argv[:]
if argv is None:
argv = sys.argv[:]
parser = argparse.ArgumentParser()
parser.add_argument(
'command',
#'_command',
choices=[o for o in Command_args],
# required=True,
)
parser = argparse.ArgumentParser()
parser.add_argument(
'command',
#'_command',
choices=[
o
for o in Command_args
],
#required=True,
)
options, args = parser.parse_known_args(argv[1:])
options, args = parser.parse_known_args(argv[1:])
assert options.command in Command_args
assert options.command in Command_args
if len(args) > 0 and args[0] == '--':
del args[0]
if len(args) > 0 and args[0] == '--':
del args[0]
# options.command = Commands(options._command)
#options.command = Commands(options._command)
if options.command == 'js':
js(args)
elif options.command == 'host_deps':
host_deps(args)
elif options.command == 'env':
env(
args,
mode='exec',
)
# elif options.command == 'mypy':
# if not inside_env():
# env(
# [
# pathlib.Path(__file__).parent / 'm.py',
# *argv[1:],
# ],
# mode='exec'
# )
# else:
# mypy(args)
elif options.command == 'ruff':
ruff(args)
elif options.command == 'm2':
if not inside_env():
env(['--', '_m.py', 'm2', *args])
return
if options.command == 'js':
js(args)
elif options.command == 'host_deps':
host_deps(args)
elif options.command == 'env':
env(args, mode='exec',)
# elif options.command == 'mypy':
# if not inside_env():
# env(
# [
# pathlib.Path(__file__).parent / 'm.py',
# *argv[1:],
# ],
# mode='exec'
# )
# else:
# mypy(args)
elif options.command == 'ruff':
ruff(args)
elif options.command == 'm2':
if not inside_env():
env(['--', '_m.py', 'm2', *args])
return
import python.tasks.cython
python.tasks.cython.mypyc_build(pathlib.Path('_m.py'))
else:
raise NotImplementedError
import python.tasks.cython
python.tasks.cython.mypyc_build(
pathlib.Path('_m.py')
)
else:
raise NotImplementedError
if __name__ == '__main__':
run()
run()

@ -10,7 +10,10 @@ import enum
import argparse
import dataclasses
from typing import (Optional, override,)
from typing import (
Optional,
override,
)
from online.fxreader.pr34.commands_typed.logging import setup as logging_setup
@ -24,183 +27,176 @@ logger = logging.getLogger(__name__)
class Command(enum.StrEnum):
mypy = 'mypy'
pyright = 'pyright'
ruff = 'ruff'
deploy_wheel = 'deploy:wheel'
tests = 'tests'
meson_setup = 'meson:setup'
mypy = 'mypy'
pyright = 'pyright'
ruff = 'ruff'
deploy_wheel = 'deploy:wheel'
tests = 'tests'
meson_setup = 'meson:setup'
@dataclasses.dataclass
class Settings(
_cli.DistSettings,
_cli.DistSettings,
):
base_dir: pathlib.Path = pathlib.Path(__file__).parent.parent
build_dir: pathlib.Path = base_dir / 'tmp' / 'build'
wheel_dir: pathlib.Path = base_dir / 'deps' / 'dist'
env_path: pathlib.Path = cli_bootstrap.BootstrapSettings.get().env_path
python_path: pathlib.Path = pathlib.Path(sys.executable)
base_dir: pathlib.Path = pathlib.Path(__file__).parent.parent
build_dir: pathlib.Path = base_dir / 'tmp' / 'build'
wheel_dir: pathlib.Path = base_dir / 'deps' / 'dist'
env_path: pathlib.Path = cli_bootstrap.BootstrapSettings.get().env_path
python_path: pathlib.Path = pathlib.Path(sys.executable)
class CLI(_cli.CLI):
def __init__(self) -> None:
self.settings = Settings()
self._projects: dict[str, _cli.Project] = {
'online.fxreader.pr34': _cli.Project(
source_dir=self.settings.base_dir / 'python',
build_dir=self.settings.base_dir / 'tmp' / 'online' / 'fxreader' / 'pr34' / 'build',
dest_dir=self.settings.base_dir / 'tmp' / 'online' / 'fxreader' / 'pr34' / 'install',
meson_path=self.settings.base_dir / 'python' / 'meson.build',
)
}
def __init__(self) -> None:
self.settings = Settings()
self._projects: dict[str, _cli.Project] = {
'online.fxreader.pr34': _cli.Project(
source_dir=self.settings.base_dir / 'python',
build_dir=self.settings.base_dir / 'tmp' / 'online' / 'fxreader' / 'pr34' / 'build',
dest_dir=self.settings.base_dir / 'tmp' / 'online' / 'fxreader' / 'pr34' / 'install',
meson_path=self.settings.base_dir / 'python' / 'meson.build',
)
}
self._dependencies : dict[str, _cli.Dependency] = dict()
self._dependencies: dict[str, _cli.Dependency] = dict()
@override
@property
def dist_settings(self) -> _cli.DistSettings:
return self.settings
@override
@property
def dist_settings(self) -> _cli.DistSettings:
return self.settings
@override
@property
def projects(self) -> dict[str, _cli.Project]:
return self._projects
@override
@property
def projects(self) -> dict[str, _cli.Project]:
return self._projects
def mypy(
self,
argv: list[str],
) -> None:
import online.fxreader.pr34.commands_typed.mypy as _mypy
def mypy(
self,
argv: list[str],
) -> None:
import online.fxreader.pr34.commands_typed.mypy as _mypy
project = self._projects['online.fxreader.pr34']
project = self._projects['online.fxreader.pr34']
_mypy.run(
argv,
settings=_mypy.MypySettings(
paths=[
#Settings.settings().project_root / 'dotfiles/.local/bin/commands',
# project.source_dir / 'm.py',
project.source_dir / '_m.py',
project.source_dir / 'online',
project.source_dir / 'cli.py',
project.source_dir / 'm.py',
# Settings.settings().project_root / 'deps/com.github.aiortc.aiortc/src',
#Settings.settings().project_root / 'm.py',
],
max_errors={
'online/fxreader/pr34/commands_typed': 0,
# 'online/fxreader/pr34/commands': 0,
'cli.py': 0,
'm.py': 0,
'../deps/com.github.aiortc.aiortc/src/online_fxreader': 0,
'../deps/com.github.aiortc.aiortc/src/aiortc/contrib/signaling': 0
}
),
)
_mypy.run(
argv,
settings=_mypy.MypySettings(
paths=[
# Settings.settings().project_root / 'dotfiles/.local/bin/commands',
# project.source_dir / 'm.py',
project.source_dir / '_m.py',
project.source_dir / 'online',
project.source_dir / 'cli.py',
project.source_dir / 'm.py',
# Settings.settings().project_root / 'deps/com.github.aiortc.aiortc/src',
# Settings.settings().project_root / 'm.py',
],
max_errors={
'online/fxreader/pr34/commands_typed': 0,
# 'online/fxreader/pr34/commands': 0,
'cli.py': 0,
'm.py': 0,
'../deps/com.github.aiortc.aiortc/src/online_fxreader': 0,
'../deps/com.github.aiortc.aiortc/src/aiortc/contrib/signaling': 0,
},
),
)
@override
@property
def dependencies(self) -> dict[str, _cli.Dependency]:
return self._dependencies
@override
@property
def dependencies(self) -> dict[str, _cli.Dependency]:
return self._dependencies
def run(self, argv: Optional[list[str]] = None) -> None:
if argv is None:
argv = copy.deepcopy(sys.argv)
def run(self, argv: Optional[list[str]] = None) -> None:
if argv is None:
argv = copy.deepcopy(sys.argv)
parser = argparse.ArgumentParser()
parser.add_argument(
'command',
choices=[
o.value
for o in Command
]
)
parser.add_argument(
'-p', '--project',
choices=[
o
for o in self.projects
]
)
parser.add_argument(
'-o', '--output_dir',
default=None,
help='wheel output dir for deploy:wheel',
)
parser.add_argument(
'-f', '--force',
default=False,
action='store_true',
help='remove install dir, before installing, default = false',
)
parser = argparse.ArgumentParser()
parser.add_argument('command', choices=[o.value for o in Command])
parser.add_argument('-p', '--project', choices=[o for o in self.projects])
parser.add_argument(
'-o',
'--output_dir',
default=None,
help='wheel output dir for deploy:wheel',
)
parser.add_argument(
'-f',
'--force',
default=False,
action='store_true',
help='remove install dir, before installing, default = false',
)
options, args = parser.parse_known_args(argv[1:])
options, args = parser.parse_known_args(argv[1:])
default_project : Optional[str] = None
default_project: Optional[str] = None
for k, v in self.projects.items():
if (
cli_bootstrap.paths_equal(
v.source_dir.resolve(),
# pathlib.Path(__file__).parent.resolve(),
pathlib.Path.cwd(),
)
):
default_project = k
for k, v in self.projects.items():
if cli_bootstrap.paths_equal(
v.source_dir.resolve(),
# pathlib.Path(__file__).parent.resolve(),
pathlib.Path.cwd(),
):
default_project = k
if options.project is None:
if not default_project is None:
options.project = default_project
else:
logger.error(dict(msg='not provided project name'))
raise NotImplementedError
if options.project is None:
if not default_project is None:
options.project = default_project
else:
logger.error(dict(msg='not provided project name'))
raise NotImplementedError
options.command = Command(options.command)
options.command = Command(options.command)
if options.command is Command.deploy_wheel:
assert not options.project is None
if options.command is Command.deploy_wheel:
assert not options.project is None
self.deploy_wheel(
project_name=options.project,
argv=args,
output_dir=options.output_dir,
mypy=True,
ruff=True,
pyright=True,
)
elif options.command is Command.pyright:
self.pyright(
project_name=options.project,
argv=args,
)
elif options.command is Command.ruff:
self.ruff(
project_name=options.project,
argv=args,
)
elif options.command is Command.meson_setup:
assert not options.project is None
self.deploy_wheel(
project_name=options.project,
argv=args,
output_dir=options.output_dir,
mypy=True,
ruff=True,
pyright=True,
)
elif options.command is Command.pyright:
self.pyright(
project_name=options.project,
argv=args,
)
elif options.command is Command.ruff:
self.ruff(
project_name=options.project,
argv=args,
)
elif options.command is Command.meson_setup:
assert not options.project is None
self.meson_setup(
project_name=options.project,
argv=args,
force=options.force,
)
elif options.command is Command.mypy:
self.mypy(
argv=args,
)
elif options.command is Command.tests:
for k, v in self.projects.items():
subprocess.check_call(
[
sys.executable,
'-m',
'unittest',
'online.fxreader.pr34.tests.test_crypto',
*args,
],
cwd=str(v.source_dir),
)
else:
raise NotImplementedError
self.meson_setup(
project_name=options.project,
argv=args,
force=options.force,
)
elif options.command is Command.mypy:
self.mypy(
argv=args,
)
elif options.command is Command.tests:
for k, v in self.projects.items():
subprocess.check_call([
sys.executable,
'-m',
'unittest',
'online.fxreader.pr34.tests.test_crypto',
*args,
], cwd=str(v.source_dir))
else:
raise NotImplementedError
if __name__ == '__main__':
CLI().run()
CLI().run()

@ -10,329 +10,326 @@ import os
import logging
from typing import (Optional, Any,)
from typing import (
Optional,
Any,
)
from typing_extensions import (
Self, BinaryIO,
Self,
BinaryIO,
)
logger = logging.getLogger(__name__)
def toml_load(f: BinaryIO) -> Any:
try:
import tomllib
return tomllib.load(f)
except:
pass
try:
import tomllib
try:
import tomli
return tomli.load(f)
except:
pass
return tomllib.load(f)
except:
pass
try:
import tomli
return tomli.load(f)
except:
pass
raise NotImplementedError
raise NotImplementedError
@dataclasses.dataclass
class PyProject:
path: pathlib.Path
dependencies: dict[str, list[str]]
early_features: Optional[list[str]] = None
pip_find_links: Optional[list[pathlib.Path]] = None
runtime_libdirs: Optional[list[pathlib.Path]] = None
runtime_preload: Optional[list[pathlib.Path]] = None
requirements: dict[str, pathlib.Path] = dataclasses.field(default_factory=lambda : dict())
path: pathlib.Path
dependencies: dict[str, list[str]]
early_features: Optional[list[str]] = None
pip_find_links: Optional[list[pathlib.Path]] = None
runtime_libdirs: Optional[list[pathlib.Path]] = None
runtime_preload: Optional[list[pathlib.Path]] = None
requirements: dict[str, pathlib.Path] = dataclasses.field(default_factory=lambda: dict())
def pyproject_load(
d: pathlib.Path,
d: pathlib.Path,
) -> PyProject:
with io.open(d, 'rb') as f:
content = toml_load(f)
with io.open(d, 'rb') as f:
content = toml_load(f)
assert isinstance(content, dict)
assert isinstance(content, dict)
dependencies : dict[str, list[str]] = dict()
dependencies: dict[str, list[str]] = dict()
dependencies['default'] = content['project']['dependencies']
dependencies['default'] = content['project']['dependencies']
if (
'optional-dependencies' in content['project']
):
assert isinstance(
content['project']['optional-dependencies'],
dict
)
if 'optional-dependencies' in content['project']:
assert isinstance(content['project']['optional-dependencies'], dict)
for k, v in content['project']['optional-dependencies'].items():
assert isinstance(v, list)
assert isinstance(k, str)
for k, v in content['project']['optional-dependencies'].items():
assert isinstance(v, list)
assert isinstance(k, str)
dependencies[k] = v
dependencies[k] = v
res = PyProject(
path=d,
dependencies=dependencies,
)
res = PyProject(
path=d,
dependencies=dependencies,
)
tool_name = 'online.fxreader.pr34'.replace('.', '-')
tool_name = 'online.fxreader.pr34'.replace('.', '-')
if 'tool' in content and isinstance(content['tool'], dict) and tool_name in content['tool'] and isinstance(content['tool'][tool_name], dict):
if 'early_features' in content['tool'][tool_name]:
res.early_features = content['tool'][tool_name]['early_features']
if (
'tool' in content and
isinstance(
content['tool'], dict
) and
tool_name in content['tool'] and
isinstance(
content['tool'][tool_name],
dict
)
):
if 'early_features' in content['tool'][tool_name]:
res.early_features = content['tool'][tool_name]['early_features']
if 'pip_find_links' in content['tool'][tool_name]:
res.pip_find_links = [d.parent / pathlib.Path(o) for o in content['tool'][tool_name]['pip_find_links']]
if 'pip_find_links' in content['tool'][tool_name]:
res.pip_find_links = [
d.parent / pathlib.Path(o)
for o in content['tool'][tool_name]['pip_find_links']
]
if 'runtime_libdirs' in content['tool'][tool_name]:
res.runtime_libdirs = [
d.parent / pathlib.Path(o)
# pathlib.Path(o)
for o in content['tool'][tool_name]['runtime_libdirs']
]
if 'runtime_libdirs' in content['tool'][tool_name]:
res.runtime_libdirs = [
d.parent / pathlib.Path(o)
# pathlib.Path(o)
for o in content['tool'][tool_name]['runtime_libdirs']
]
if 'runtime_preload' in content['tool'][tool_name]:
res.runtime_preload = [
d.parent / pathlib.Path(o)
# pathlib.Path(o)
for o in content['tool'][tool_name]['runtime_preload']
]
if 'runtime_preload' in content['tool'][tool_name]:
res.runtime_preload = [
d.parent / pathlib.Path(o)
# pathlib.Path(o)
for o in content['tool'][tool_name]['runtime_preload']
]
if 'requirements' in content['tool'][tool_name]:
assert isinstance(content['tool'][tool_name]['requirements'], dict)
if 'requirements' in content['tool'][tool_name]:
assert isinstance(content['tool'][tool_name]['requirements'], dict)
res.requirements = {
k: d.parent / pathlib.Path(v)
# pathlib.Path(o)
for k, v in content['tool'][tool_name]['requirements'].items()
}
res.requirements = {
k : d.parent / pathlib.Path(v)
# pathlib.Path(o)
for k, v in content['tool'][tool_name]['requirements'].items()
}
return res
return res
@dataclasses.dataclass
class BootstrapSettings:
env_path: pathlib.Path
python_path: pathlib.Path
base_dir: pathlib.Path
python_version: Optional[str] = dataclasses.field(
default_factory=lambda : os.environ.get(
'PYTHON_VERSION',
'%d.%d' % (
sys.version_info.major,
sys.version_info.minor,
),
).strip()
)
uv_args: list[str] = dataclasses.field(
default_factory=lambda : os.environ.get(
'UV_ARGS',
'--offline',
).split(),
)
env_path: pathlib.Path
python_path: pathlib.Path
base_dir: pathlib.Path
python_version: Optional[str] = dataclasses.field(
default_factory=lambda: os.environ.get(
'PYTHON_VERSION',
'%d.%d'
% (
sys.version_info.major,
sys.version_info.minor,
),
).strip()
)
uv_args: list[str] = dataclasses.field(
default_factory=lambda: os.environ.get(
'UV_ARGS',
'--offline',
).split(),
)
@classmethod
def get(
cls,
base_dir: Optional[pathlib.Path] = None,
) -> Self:
if base_dir is None:
base_dir = pathlib.Path.cwd()
@classmethod
def get(
cls,
base_dir: Optional[pathlib.Path] = None,
) -> Self:
if base_dir is None:
base_dir = pathlib.Path.cwd()
env_path = base_dir / '.venv'
python_path = env_path / 'bin' / 'python3'
env_path = base_dir / '.venv'
python_path = env_path / 'bin' / 'python3'
return cls(
base_dir=base_dir,
env_path=env_path,
python_path=python_path,
)
return cls(
base_dir=base_dir,
env_path=env_path,
python_path=python_path,
)
def env_bootstrap(
bootstrap_settings: BootstrapSettings,
pyproject: PyProject,
bootstrap_settings: BootstrapSettings,
pyproject: PyProject,
) -> None:
pip_find_links : list[pathlib.Path] = []
pip_find_links: list[pathlib.Path] = []
if not pyproject.pip_find_links is None:
pip_find_links.extend(pyproject.pip_find_links)
if not pyproject.pip_find_links is None:
pip_find_links.extend(pyproject.pip_find_links)
pip_find_links_args = sum([
['-f', str(o),]
for o in pip_find_links
], [])
pip_find_links_args = sum(
[
[
'-f',
str(o),
]
for o in pip_find_links
],
[],
)
features : list[str] = []
features: list[str] = []
if pyproject.early_features:
features.extend(pyproject.early_features)
if pyproject.early_features:
features.extend(pyproject.early_features)
requirements_python_version: Optional[str] = None
if not bootstrap_settings.python_version is None:
requirements_python_version = bootstrap_settings.python_version.replace('.', '_')
requirements_python_version: Optional[str] = None
if not bootstrap_settings.python_version is None:
requirements_python_version = bootstrap_settings.python_version.replace('.', '_')
requirements_name = '_'.join(sorted(features))
if requirements_python_version:
requirements_name += '_' + requirements_python_version
requirements_path: Optional[pathlib.Path] = None
if requirements_name in pyproject.requirements:
requirements_path = pyproject.requirements[requirements_name]
else:
requirements_path = pyproject.path.parent / 'requirements.txt'
requirements_in: list[str] = []
requirements_in.extend(['uv', 'pip', 'build', 'setuptools', 'meson-python', 'pybind11'])
if pyproject.early_features:
early_dependencies = sum([pyproject.dependencies[o] for o in pyproject.early_features], [])
logger.info(
dict(
early_dependencies=early_dependencies,
)
)
requirements_in.extend(early_dependencies)
# if len(early_dependencies) > 0:
# subprocess.check_call([
# bootstrap_settings.python_path,
# '-m',
# 'uv', 'pip', 'install',
# *pip_find_links_args,
# # '-f', str(pathlib.Path(__file__).parent / 'deps' / 'dist'),
# *bootstrap_settings.uv_args,
# *early_dependencies,
# ])
if not requirements_path.exists():
with tempfile.NamedTemporaryFile(
mode='w',
prefix='requirements',
suffix='.in',
) as f:
f.write('\n'.join(requirements_in))
f.flush()
subprocess.check_call(
[
'uv',
'pip',
'compile',
'--generate-hashes',
*pip_find_links_args,
# '-p',
# bootstrap_settings.python_path,
*bootstrap_settings.uv_args,
'-o',
str(requirements_path),
f.name,
]
)
uv_python_version: list[str] = []
if not bootstrap_settings.python_version is None:
uv_python_version.extend(
[
'-p',
bootstrap_settings.python_version,
]
)
subprocess.check_call(
[
'uv',
'venv',
*uv_python_version,
*pip_find_links_args,
# '--seed',
*bootstrap_settings.uv_args,
str(bootstrap_settings.env_path),
]
)
subprocess.check_call(
[
'uv',
'pip',
'install',
*pip_find_links_args,
'-p',
bootstrap_settings.python_path,
'--require-hashes',
*bootstrap_settings.uv_args,
'-r',
str(requirements_path),
]
)
requirements_name = '_'.join(sorted(features))
def paths_equal(a: pathlib.Path | str, b: pathlib.Path | str) -> bool:
return os.path.abspath(str(a)) == os.path.abspath(str(b))
if requirements_python_version:
requirements_name += '_' + requirements_python_version
requirements_path : Optional[pathlib.Path] = None
if requirements_name in pyproject.requirements:
requirements_path = pyproject.requirements[requirements_name]
else:
requirements_path = pyproject.path.parent / 'requirements.txt'
requirements_in : list[str] = []
requirements_in.extend([
'uv', 'pip', 'build', 'setuptools', 'meson-python', 'pybind11'
])
if pyproject.early_features:
early_dependencies = sum([
pyproject.dependencies[o]
for o in pyproject.early_features
], [])
logger.info(dict(
early_dependencies=early_dependencies,
))
requirements_in.extend(early_dependencies)
# if len(early_dependencies) > 0:
# subprocess.check_call([
# bootstrap_settings.python_path,
# '-m',
# 'uv', 'pip', 'install',
# *pip_find_links_args,
# # '-f', str(pathlib.Path(__file__).parent / 'deps' / 'dist'),
# *bootstrap_settings.uv_args,
# *early_dependencies,
# ])
if not requirements_path.exists():
with tempfile.NamedTemporaryFile(
mode='w',
prefix='requirements',
suffix='.in',
) as f:
f.write(
'\n'.join(requirements_in)
)
f.flush()
subprocess.check_call([
'uv',
'pip',
'compile',
'--generate-hashes',
*pip_find_links_args,
# '-p',
# bootstrap_settings.python_path,
*bootstrap_settings.uv_args,
'-o', str(requirements_path),
f.name,
])
uv_python_version: list[str] = []
if not bootstrap_settings.python_version is None:
uv_python_version.extend([
'-p', bootstrap_settings.python_version,
])
subprocess.check_call([
'uv', 'venv',
*uv_python_version,
*pip_find_links_args,
# '--seed',
*bootstrap_settings.uv_args,
str(bootstrap_settings.env_path)
])
subprocess.check_call([
'uv',
'pip',
'install',
*pip_find_links_args,
'-p',
bootstrap_settings.python_path,
'--require-hashes',
*bootstrap_settings.uv_args,
'-r', str(requirements_path),
])
def paths_equal(
a: pathlib.Path | str,
b: pathlib.Path | str
) -> bool:
return (
os.path.abspath(str(a)) ==
os.path.abspath(str(b))
)
def run(
d: Optional[pathlib.Path] = None,
cli_path: Optional[pathlib.Path] = None,
d: Optional[pathlib.Path] = None,
cli_path: Optional[pathlib.Path] = None,
) -> None:
if cli_path is None:
cli_path = pathlib.Path(__file__).parent / 'cli.py'
if cli_path is None:
cli_path = pathlib.Path(__file__).parent / 'cli.py'
if d is None:
d = pathlib.Path(__file__).parent / 'pyproject.toml'
if d is None:
d = pathlib.Path(__file__).parent / 'pyproject.toml'
bootstrap_settings = BootstrapSettings.get()
bootstrap_settings = BootstrapSettings.get()
pyproject : PyProject = pyproject_load(
d
)
pyproject: PyProject = pyproject_load(d)
logging.basicConfig(level=logging.INFO)
logging.basicConfig(level=logging.INFO)
if not bootstrap_settings.env_path.exists():
env_bootstrap(
bootstrap_settings=bootstrap_settings,
pyproject=pyproject,
)
if not bootstrap_settings.env_path.exists():
env_bootstrap(
bootstrap_settings=bootstrap_settings,
pyproject=pyproject,
)
logger.info([sys.executable, sys.argv, bootstrap_settings.python_path])
logger.info([sys.executable, sys.argv, bootstrap_settings.python_path])
if not paths_equal(sys.executable, bootstrap_settings.python_path):
os.execv(
str(bootstrap_settings.python_path),
[
str(bootstrap_settings.python_path),
*sys.argv,
]
)
if not paths_equal(sys.executable, bootstrap_settings.python_path):
os.execv(
str(bootstrap_settings.python_path),
[
str(bootstrap_settings.python_path),
*sys.argv,
],
)
os.execv(
str(bootstrap_settings.python_path),
[
str(bootstrap_settings.python_path),
str(cli_path),
*sys.argv[1:],
],
)
os.execv(
str(bootstrap_settings.python_path),
[
str(bootstrap_settings.python_path),
str(
cli_path
),
*sys.argv[1:],
]
)
if __name__ == '__main__':
run(
d=pathlib.Path(__file__).parent / 'pyproject.toml',
cli_path=pathlib.Path(__file__).parent / 'cli.py',
)
run(
d=pathlib.Path(__file__).parent / 'pyproject.toml',
cli_path=pathlib.Path(__file__).parent / 'cli.py',
)

File diff suppressed because it is too large Load Diff

@ -1,27 +1,28 @@
__all__ = (
'parse_args',
)
__all__ = ('parse_args',)
import sys
import argparse
from typing import (Optional,)
from typing import (
Optional,
)
def parse_args(
parser: argparse.ArgumentParser,
args: Optional[list[str]] = None,
parser: argparse.ArgumentParser,
args: Optional[list[str]] = None,
) -> tuple[argparse.Namespace, list[str]]:
if args is None:
args = sys.argv[1:]
if args is None:
args = sys.argv[1:]
argv : list[str] = []
argv: list[str] = []
for i, o in enumerate(args):
if o == '--':
argv.extend(args[i + 1:])
for i, o in enumerate(args):
if o == '--':
argv.extend(args[i + 1 :])
del args[i:]
del args[i:]
break
break
return parser.parse_args(args), argv
return parser.parse_args(args), argv

@ -1,14 +1,23 @@
import logging
import asyncio
from typing import (Any,)
from typing import (
Any,
)
logger = logging.getLogger(__name__)
def handle_task_result(fut: asyncio.Future[Any]) -> None:
try:
fut.result()
logger.debug(dict(fut=fut, msg='done'), stacklevel=2,)
except:
logger.exception('', stacklevel=2,)
def handle_task_result(fut: asyncio.Future[Any]) -> None:
try:
fut.result()
logger.debug(
dict(fut=fut, msg='done'),
stacklevel=2,
)
except:
logger.exception(
'',
stacklevel=2,
)

File diff suppressed because it is too large Load Diff

@ -11,532 +11,519 @@ import os
import logging
from typing import (Optional, Any, cast, Type, TypeVar,)
from typing import (
Optional,
Any,
cast,
Type,
TypeVar,
)
from typing_extensions import (
Self, BinaryIO, overload,
Self,
BinaryIO,
overload,
)
logger = logging.getLogger(__name__)
def toml_load(f: BinaryIO) -> Any:
try:
import tomllib
return tomllib.load(f)
except:
pass
try:
import tomllib
try:
import tomli
return tomli.load(f)
except:
pass
return tomllib.load(f)
except:
pass
try:
import tomli
return tomli.load(f)
except:
pass
raise NotImplementedError
raise NotImplementedError
@dataclasses.dataclass
class PyProject:
@dataclasses.dataclass
class Module:
name: str
meson: Optional[pathlib.Path] = None
tool: dict[str, Any] = dataclasses.field(default_factory=lambda : dict())
@dataclasses.dataclass
class Module:
name: str
meson: Optional[pathlib.Path] = None
tool: dict[str, Any] = dataclasses.field(default_factory=lambda: dict())
path: pathlib.Path
dependencies: dict[str, list[str]]
early_features: Optional[list[str]] = None
pip_find_links: Optional[list[pathlib.Path]] = None
runtime_libdirs: Optional[list[pathlib.Path]] = None
runtime_preload: Optional[list[pathlib.Path]] = None
requirements: dict[str, pathlib.Path] = dataclasses.field(default_factory=lambda: dict())
path: pathlib.Path
dependencies: dict[str, list[str]]
early_features: Optional[list[str]] = None
pip_find_links: Optional[list[pathlib.Path]] = None
runtime_libdirs: Optional[list[pathlib.Path]] = None
runtime_preload: Optional[list[pathlib.Path]] = None
requirements: dict[str, pathlib.Path] = dataclasses.field(default_factory=lambda : dict())
modules: list[Module] = dataclasses.field(
default_factory=lambda: [],
)
modules: list[Module] = dataclasses.field(
default_factory=lambda : [],
)
tool: dict[str, Any] = dataclasses.field(
default_factory=lambda: dict(),
)
tool: dict[str, Any] = dataclasses.field(
default_factory=lambda : dict(),
)
Key = TypeVar('Key')
Value = TypeVar('Value')
@overload
def check_dict(
value: Any,
KT: Type[Key],
VT: Type[Value],
value: Any,
KT: Type[Key],
VT: Type[Value],
) -> dict[Key, Value]: ...
@overload
def check_dict(
value: Any,
KT: Type[Key],
value: Any,
KT: Type[Key],
) -> dict[Key, Any]: ...
def check_dict(
value: Any,
KT: Type[Key],
VT: Optional[Type[Value]] = None,
value: Any,
KT: Type[Key],
VT: Optional[Type[Value]] = None,
) -> dict[Key, Value]:
assert isinstance(value, dict)
value2 = cast(dict[Any, Any], value)
assert isinstance(value, dict)
value2 = cast(dict[Any, Any], value)
assert all([
isinstance(k, KT) and (
VT is None or
isinstance(v, VT)
)
for k, v in value2.items()
])
assert all([isinstance(k, KT) and (VT is None or isinstance(v, VT)) for k, v in value2.items()])
if VT is None:
return cast(
dict[Key, Any],
value,
)
else:
return cast(
dict[Key, Value],
value,
)
if VT is None:
return cast(
dict[Key, Any],
value,
)
else:
return cast(
dict[Key, Value],
value,
)
@overload
def check_list(
value: Any,
VT: Type[Value],
value: Any,
VT: Type[Value],
) -> list[Value]: ...
@overload
def check_list(
value: Any,
value: Any,
) -> list[Any]: ...
def check_list(
value: Any,
VT: Optional[Type[Value]] = None,
value: Any,
VT: Optional[Type[Value]] = None,
) -> list[Value] | list[Any]:
assert isinstance(value, list)
value2 = cast(list[Any], value)
assert isinstance(value, list)
value2 = cast(list[Any], value)
assert all([
(
VT is None or
isinstance(o, VT)
)
for o in value2
])
assert all([(VT is None or isinstance(o, VT)) for o in value2])
if VT is None:
return cast(
list[Any],
value,
)
else:
return cast(
list[Value],
value,
)
if VT is None:
return cast(
list[Any],
value,
)
else:
return cast(
list[Value],
value,
)
def pyproject_load(
d: pathlib.Path,
d: pathlib.Path,
) -> PyProject:
with io.open(d, 'rb') as f:
content = toml_load(f)
with io.open(d, 'rb') as f:
content = toml_load(f)
assert isinstance(content, dict)
assert isinstance(content, dict)
dependencies : dict[str, list[str]] = dict()
dependencies: dict[str, list[str]] = dict()
dependencies['default'] = content['project']['dependencies']
dependencies['default'] = content['project']['dependencies']
if (
'optional-dependencies' in content['project']
):
assert isinstance(
content['project']['optional-dependencies'],
dict
)
if 'optional-dependencies' in content['project']:
assert isinstance(content['project']['optional-dependencies'], dict)
for k, v in check_dict(
check_dict(
check_dict(
content,
str,
# Any,
)['project'],
str,
# Any,
)['optional-dependencies'],
str,
list[Any],
).items():
# assert isinstance(v, list)
# assert isinstance(k, str)
for k, v in check_dict(
check_dict(
check_dict(
content,
str,
# Any,
)['project'],
str,
# Any,
)['optional-dependencies'],
str,
list[Any],
).items():
# assert isinstance(v, list)
# assert isinstance(k, str)
dependencies[k] = v
dependencies[k] = v
res = PyProject(
path=d,
dependencies=dependencies,
)
res = PyProject(
path=d,
dependencies=dependencies,
)
tool_name = 'online.fxreader.pr34'.replace('.', '-')
tool_name = 'online.fxreader.pr34'.replace('.', '-')
if 'tool' in content:
res.tool = check_dict(
content['tool'],
str,
)
if (
'tool' in content
):
res.tool = check_dict(
content['tool'],
str,
)
if 'tool' in content and isinstance(content['tool'], dict) and tool_name in content['tool'] and isinstance(content['tool'][tool_name], dict):
pr34_tool = check_dict(
check_dict(
content['tool'],
str,
)[tool_name],
str,
)
if (
'tool' in content and
isinstance(
content['tool'], dict
) and
tool_name in content['tool'] and
isinstance(
content['tool'][tool_name],
dict
)
):
pr34_tool = check_dict(
check_dict(
content['tool'],
str,
)[tool_name],
str
)
if 'early_features' in pr34_tool:
res.early_features = pr34_tool['early_features']
if 'early_features' in pr34_tool:
res.early_features = pr34_tool['early_features']
if 'pip_find_links' in pr34_tool:
res.pip_find_links = [d.parent / pathlib.Path(o) for o in pr34_tool['pip_find_links']]
if 'pip_find_links' in pr34_tool:
res.pip_find_links = [
d.parent / pathlib.Path(o)
for o in pr34_tool['pip_find_links']
]
if 'runtime_libdirs' in pr34_tool:
res.runtime_libdirs = [
d.parent / pathlib.Path(o)
# pathlib.Path(o)
for o in pr34_tool['runtime_libdirs']
]
if 'runtime_libdirs' in pr34_tool:
res.runtime_libdirs = [
d.parent / pathlib.Path(o)
# pathlib.Path(o)
for o in pr34_tool['runtime_libdirs']
]
if 'runtime_preload' in pr34_tool:
res.runtime_preload = [
d.parent / pathlib.Path(o)
# pathlib.Path(o)
for o in pr34_tool['runtime_preload']
]
if 'runtime_preload' in pr34_tool:
res.runtime_preload = [
d.parent / pathlib.Path(o)
# pathlib.Path(o)
for o in pr34_tool['runtime_preload']
]
if 'requirements' in pr34_tool:
res.requirements = {
k: d.parent / pathlib.Path(v)
# pathlib.Path(o)
for k, v in check_dict(pr34_tool['requirements'], str, str).items()
}
if 'requirements' in pr34_tool:
res.requirements = {
k : d.parent / pathlib.Path(v)
# pathlib.Path(o)
for k, v in check_dict(
pr34_tool['requirements'],
str,
str
).items()
}
if 'modules' in pr34_tool:
modules = check_list(pr34_tool['modules'])
# res.modules = []
if 'modules' in pr34_tool:
modules = check_list(
pr34_tool['modules']
)
# res.modules = []
for o in modules:
assert isinstance(o, dict)
assert 'name' in o and isinstance(o['name'], str)
for o in modules:
assert isinstance(o, dict)
assert 'name' in o and isinstance(o['name'], str)
module = PyProject.Module(
name=o['name'],
)
module = PyProject.Module(
name=o['name'],
)
if 'meson' in o:
assert 'meson' in o and isinstance(o['meson'], str)
if 'meson' in o:
assert 'meson' in o and isinstance(o['meson'], str)
module.meson = pathlib.Path(o['meson'])
module.meson = pathlib.Path(o['meson'])
if 'tool' in o:
module.tool.update(
check_dict(
o['tool'],
str,
)
)
if 'tool' in o:
module.tool.update(
check_dict(
o['tool'],
str,
)
)
res.modules.append(module)
res.modules.append(module)
return res
return res
@dataclasses.dataclass
class BootstrapSettings:
env_path: pathlib.Path
python_path: pathlib.Path
base_dir: pathlib.Path
python_version: Optional[str] = dataclasses.field(
default_factory=lambda : os.environ.get(
'PYTHON_VERSION',
'%d.%d' % (
sys.version_info.major,
sys.version_info.minor,
),
).strip()
)
pip_check_conflicts: Optional[bool] = dataclasses.field(
default_factory=lambda : os.environ.get(
'PIP_CHECK_CONFLICTS',
json.dumps(True)
) in [json.dumps(True)],
)
uv_args: list[str] = dataclasses.field(
default_factory=lambda : os.environ.get(
'UV_ARGS',
'--offline',
).split(),
)
env_path: pathlib.Path
python_path: pathlib.Path
base_dir: pathlib.Path
python_version: Optional[str] = dataclasses.field(
default_factory=lambda: os.environ.get(
'PYTHON_VERSION',
'%d.%d'
% (
sys.version_info.major,
sys.version_info.minor,
),
).strip()
)
pip_check_conflicts: Optional[bool] = dataclasses.field(
default_factory=lambda: os.environ.get('PIP_CHECK_CONFLICTS', json.dumps(True)) in [json.dumps(True)],
)
uv_args: list[str] = dataclasses.field(
default_factory=lambda: os.environ.get(
'UV_ARGS',
'--offline',
).split(),
)
@classmethod
def get(
cls,
base_dir: Optional[pathlib.Path] = None,
) -> Self:
if base_dir is None:
base_dir = pathlib.Path.cwd()
@classmethod
def get(
cls,
base_dir: Optional[pathlib.Path] = None,
) -> Self:
if base_dir is None:
base_dir = pathlib.Path.cwd()
env_path: Optional[pathlib.Path] = None
if 'ENV_PATH' in os.environ:
env_path = pathlib.Path(os.environ['ENV_PATH'])
else:
env_path = base_dir / '.venv'
env_path: Optional[pathlib.Path] = None
if 'ENV_PATH' in os.environ:
env_path = pathlib.Path(os.environ['ENV_PATH'])
else:
env_path = base_dir / '.venv'
python_path = env_path / 'bin' / 'python3'
python_path = env_path / 'bin' / 'python3'
return cls(
base_dir=base_dir,
env_path=env_path,
python_path=python_path,
)
return cls(
base_dir=base_dir,
env_path=env_path,
python_path=python_path,
)
class requirements_name_get_t:
@dataclasses.dataclass
class res_t:
not_compiled : pathlib.Path
compiled: pathlib.Path
name: str
@dataclasses.dataclass
class res_t:
not_compiled: pathlib.Path
compiled: pathlib.Path
name: str
def requirements_name_get(
source_dir: pathlib.Path,
python_version: Optional[str],
features: list[str],
requirements: dict[str, pathlib.Path],
source_dir: pathlib.Path,
python_version: Optional[str],
features: list[str],
requirements: dict[str, pathlib.Path],
) -> requirements_name_get_t.res_t:
requirements_python_version: Optional[str] = None
if not python_version is None:
requirements_python_version = \
python_version.replace('.', '_')
requirements_python_version: Optional[str] = None
if not python_version is None:
requirements_python_version = python_version.replace('.', '_')
requirements_name = '_'.join(sorted(features))
requirements_name = '_'.join(sorted(features))
if requirements_python_version:
requirements_name += '_' + requirements_python_version
if requirements_python_version:
requirements_name += '_' + requirements_python_version
requirements_path : Optional[pathlib.Path] = None
requirements_path: Optional[pathlib.Path] = None
if requirements_name in requirements:
requirements_path = requirements[requirements_name]
else:
requirements_path = source_dir / 'requirements.txt'
if requirements_name in requirements:
requirements_path = requirements[requirements_name]
else:
requirements_path = source_dir / 'requirements.txt'
requirements_path_in = requirements_path.parent / (
requirements_path.stem + '.in'
)
requirements_path_in = requirements_path.parent / (requirements_path.stem + '.in')
requirements_in : list[str] = []
requirements_in: list[str] = []
return requirements_name_get_t.res_t(
not_compiled=requirements_path_in,
compiled=requirements_path,
name=requirements_name,
)
return requirements_name_get_t.res_t(
not_compiled=requirements_path_in,
compiled=requirements_path,
name=requirements_name,
)
def env_bootstrap(
bootstrap_settings: BootstrapSettings,
pyproject: PyProject,
bootstrap_settings: BootstrapSettings,
pyproject: PyProject,
) -> None:
pip_find_links : list[pathlib.Path] = []
pip_find_links: list[pathlib.Path] = []
if not pyproject.pip_find_links is None:
pip_find_links.extend(pyproject.pip_find_links)
if not pyproject.pip_find_links is None:
pip_find_links.extend(pyproject.pip_find_links)
pip_find_links_args = sum([
['-f', str(o),]
for o in pip_find_links
], cast(list[str], []))
pip_find_links_args = sum(
[
[
'-f',
str(o),
]
for o in pip_find_links
],
cast(list[str], []),
)
features : list[str] = []
features: list[str] = []
if pyproject.early_features:
features.extend(pyproject.early_features)
if pyproject.early_features:
features.extend(pyproject.early_features)
requirements_name_get_res = requirements_name_get(
python_version=bootstrap_settings.python_version,
features=features,
requirements=pyproject.requirements,
source_dir=pyproject.path.parent,
)
requirements_path = requirements_name_get_res.compiled
requirements_name_get_res = requirements_name_get(
python_version=bootstrap_settings.python_version,
features=features,
requirements=pyproject.requirements,
source_dir=pyproject.path.parent,
)
requirements_path = requirements_name_get_res.compiled
requirements_in : list[str] = []
requirements_in: list[str] = []
requirements_in.extend([
'uv', 'pip', 'build', 'setuptools', 'meson-python', 'pybind11'
])
requirements_in.extend(['uv', 'pip', 'build', 'setuptools', 'meson-python', 'pybind11'])
if pyproject.early_features:
early_dependencies = sum([
pyproject.dependencies[o]
for o in pyproject.early_features
], cast(list[str], []))
if pyproject.early_features:
early_dependencies = sum([pyproject.dependencies[o] for o in pyproject.early_features], cast(list[str], []))
logger.info(dict(
requirements_name_get_res=requirements_name_get_res,
early_dependencies=early_dependencies,
))
logger.info(
dict(
requirements_name_get_res=requirements_name_get_res,
early_dependencies=early_dependencies,
)
)
requirements_in.extend(early_dependencies)
# if len(early_dependencies) > 0:
# subprocess.check_call([
# bootstrap_settings.python_path,
# '-m',
# 'uv', 'pip', 'install',
# *pip_find_links_args,
# # '-f', str(pathlib.Path(__file__).parent / 'deps' / 'dist'),
# *bootstrap_settings.uv_args,
# *early_dependencies,
# ])
requirements_in.extend(early_dependencies)
# if len(early_dependencies) > 0:
# subprocess.check_call([
# bootstrap_settings.python_path,
# '-m',
# 'uv', 'pip', 'install',
# *pip_find_links_args,
# # '-f', str(pathlib.Path(__file__).parent / 'deps' / 'dist'),
# *bootstrap_settings.uv_args,
# *early_dependencies,
# ])
if not requirements_path.exists():
with tempfile.NamedTemporaryFile(
mode='w',
prefix='requirements',
suffix='.in',
) as f:
f.write(
'\n'.join(requirements_in)
)
f.flush()
if not requirements_path.exists():
with tempfile.NamedTemporaryFile(
mode='w',
prefix='requirements',
suffix='.in',
) as f:
f.write('\n'.join(requirements_in))
f.flush()
subprocess.check_call([
'uv',
'pip',
'compile',
'--generate-hashes',
*pip_find_links_args,
# '-p',
# bootstrap_settings.python_path,
*bootstrap_settings.uv_args,
'-o', str(requirements_path),
f.name,
])
subprocess.check_call(
[
'uv',
'pip',
'compile',
'--generate-hashes',
*pip_find_links_args,
# '-p',
# bootstrap_settings.python_path,
*bootstrap_settings.uv_args,
'-o',
str(requirements_path),
f.name,
]
)
uv_python_version: list[str] = []
uv_python_version: list[str] = []
if not bootstrap_settings.python_version is None:
uv_python_version.extend([
'-p', bootstrap_settings.python_version,
])
if not bootstrap_settings.python_version is None:
uv_python_version.extend(
[
'-p',
bootstrap_settings.python_version,
]
)
subprocess.check_call([
'uv', 'venv',
*uv_python_version,
*pip_find_links_args,
# '--seed',
*bootstrap_settings.uv_args,
str(bootstrap_settings.env_path)
])
subprocess.check_call(
[
'uv',
'venv',
*uv_python_version,
*pip_find_links_args,
# '--seed',
*bootstrap_settings.uv_args,
str(bootstrap_settings.env_path),
]
)
subprocess.check_call([
'uv',
'pip',
'install',
*pip_find_links_args,
'-p',
bootstrap_settings.python_path,
'--require-hashes',
*bootstrap_settings.uv_args,
'-r', str(requirements_path),
])
subprocess.check_call(
[
'uv',
'pip',
'install',
*pip_find_links_args,
'-p',
bootstrap_settings.python_path,
'--require-hashes',
*bootstrap_settings.uv_args,
'-r',
str(requirements_path),
]
)
if bootstrap_settings.pip_check_conflicts:
subprocess.check_call([
bootstrap_settings.python_path,
'-m',
'online.fxreader.pr34.commands',
'pip_check_conflicts',
])
if bootstrap_settings.pip_check_conflicts:
subprocess.check_call(
[
bootstrap_settings.python_path,
'-m',
'online.fxreader.pr34.commands',
'pip_check_conflicts',
]
)
def paths_equal(a: pathlib.Path | str, b: pathlib.Path | str) -> bool:
return os.path.abspath(str(a)) == os.path.abspath(str(b))
def paths_equal(
a: pathlib.Path | str,
b: pathlib.Path | str
) -> bool:
return (
os.path.abspath(str(a)) ==
os.path.abspath(str(b))
)
def run(
d: Optional[pathlib.Path] = None,
cli_path: Optional[pathlib.Path] = None,
d: Optional[pathlib.Path] = None,
cli_path: Optional[pathlib.Path] = None,
) -> None:
if cli_path is None:
cli_path = pathlib.Path(__file__).parent / 'cli.py'
if cli_path is None:
cli_path = pathlib.Path(__file__).parent / 'cli.py'
if d is None:
d = pathlib.Path(__file__).parent / 'pyproject.toml'
if d is None:
d = pathlib.Path(__file__).parent / 'pyproject.toml'
bootstrap_settings = BootstrapSettings.get()
bootstrap_settings = BootstrapSettings.get()
pyproject : PyProject = pyproject_load(
d
)
pyproject: PyProject = pyproject_load(d)
logging.basicConfig(level=logging.INFO)
logging.basicConfig(level=logging.INFO)
if not bootstrap_settings.env_path.exists():
env_bootstrap(
bootstrap_settings=bootstrap_settings,
pyproject=pyproject,
)
if not bootstrap_settings.env_path.exists():
env_bootstrap(
bootstrap_settings=bootstrap_settings,
pyproject=pyproject,
)
logger.info([sys.executable, sys.argv, bootstrap_settings.python_path])
logger.info([sys.executable, sys.argv, bootstrap_settings.python_path])
if not paths_equal(sys.executable, bootstrap_settings.python_path):
os.execv(
str(bootstrap_settings.python_path),
[
str(bootstrap_settings.python_path),
*sys.argv,
]
)
if not paths_equal(sys.executable, bootstrap_settings.python_path):
os.execv(
str(bootstrap_settings.python_path),
[
str(bootstrap_settings.python_path),
*sys.argv,
],
)
os.execv(
str(bootstrap_settings.python_path),
[
str(bootstrap_settings.python_path),
str(cli_path),
*sys.argv[1:],
],
)
os.execv(
str(bootstrap_settings.python_path),
[
str(bootstrap_settings.python_path),
str(
cli_path
),
*sys.argv[1:],
]
)
if __name__ == '__main__':
run()
run()

@ -4,88 +4,95 @@ import os
import cryptography.hazmat.primitives.kdf.scrypt
import cryptography.exceptions
from typing import (Literal, overload, Optional,)
from typing import (
Literal,
overload,
Optional,
)
class PasswordUtils:
@overload
@classmethod
def secret_hash(
cls,
secret: str | bytes,
mode: Literal['base64'],
salt: Optional[bytes] = None,
) -> tuple[str, str]: ...
@overload
@classmethod
def secret_hash(
cls,
secret: str | bytes,
mode: Literal['base64'],
salt: Optional[bytes] = None,
) -> tuple[str, str]: ...
@overload
@classmethod
def secret_hash(
cls,
secret: str | bytes,
mode: Literal['bytes'],
salt: Optional[bytes] = None,
) -> tuple[bytes, bytes]: ...
@overload
@classmethod
def secret_hash(
cls,
secret: str | bytes,
mode: Literal['bytes'],
salt: Optional[bytes] = None,
) -> tuple[bytes, bytes]: ...
@classmethod
def secret_hash(
cls,
secret: str | bytes,
mode: Literal['bytes', 'base64'],
salt: Optional[bytes] = None,
) -> tuple[str, str] | tuple[bytes, bytes]:
if salt is None:
salt = os.urandom(16)
@classmethod
def secret_hash(
cls,
secret: str | bytes,
mode: Literal['bytes', 'base64'],
salt: Optional[bytes] = None,
) -> tuple[str, str] | tuple[bytes, bytes]:
if salt is None:
salt = os.urandom(16)
if isinstance(secret, str):
secret = secret.encode('utf-8')
# derive
kdf = cls._scrypt_init(salt=salt)
if isinstance(secret, str):
secret = secret.encode('utf-8')
# derive
kdf = cls._scrypt_init(salt=salt)
hashed_secret = kdf.derive(secret)
hashed_secret = kdf.derive(secret)
if mode == 'bytes':
return (salt, hashed_secret)
elif mode == 'base64':
res_tuple = tuple((
base64.b64encode(o).decode('utf-8')
for o in (salt, hashed_secret,)
))
return (res_tuple[0], res_tuple[1])
else:
raise NotImplementedError
if mode == 'bytes':
return (salt, hashed_secret)
elif mode == 'base64':
res_tuple = tuple(
(
base64.b64encode(o).decode('utf-8')
for o in (
salt,
hashed_secret,
)
)
)
return (res_tuple[0], res_tuple[1])
else:
raise NotImplementedError
@classmethod
def _scrypt_init(
cls,
salt: bytes
) -> cryptography.hazmat.primitives.kdf.scrypt.Scrypt:
return cryptography.hazmat.primitives.kdf.scrypt.Scrypt(
salt=salt,
length=32,
n=2**14,
r=8,
p=1,
)
@classmethod
def _scrypt_init(cls, salt: bytes) -> cryptography.hazmat.primitives.kdf.scrypt.Scrypt:
return cryptography.hazmat.primitives.kdf.scrypt.Scrypt(
salt=salt,
length=32,
n=2**14,
r=8,
p=1,
)
@classmethod
def secret_check(
cls,
secret: str | bytes,
salt: str | bytes,
hashed_secret: str | bytes,
) -> bool:
if isinstance(salt, str):
salt = base64.b64decode(salt)
@classmethod
def secret_check(
cls,
secret: str | bytes,
salt: str | bytes,
hashed_secret: str | bytes,
) -> bool:
if isinstance(salt, str):
salt = base64.b64decode(salt)
if isinstance(secret, str):
secret = secret.encode('utf-8')
if isinstance(secret, str):
secret = secret.encode('utf-8')
if isinstance(hashed_secret, str):
hashed_secret = base64.b64decode(hashed_secret)
if isinstance(hashed_secret, str):
hashed_secret = base64.b64decode(hashed_secret)
kdf = cls._scrypt_init(salt=salt)
kdf = cls._scrypt_init(salt=salt)
try:
kdf.verify(secret, hashed_secret)
return True
except cryptography.exceptions.InvalidKey:
return False
try:
kdf.verify(secret, hashed_secret)
return True
except cryptography.exceptions.InvalidKey:
return False

@ -1,35 +1,39 @@
import os
import logging
from typing import (Optional,)
from typing import (
Optional,
)
logger = logging.getLogger(__name__)
class DebugPy:
@classmethod
def set_trace(
cls,
host: Optional[str] = None,
port: Optional[int] = None,
wait: Optional[bool] = None,
) -> None:
if host is None:
host = '127.0.0.1'
if port is None:
port = 4444
if wait is None:
wait = True
@classmethod
def set_trace(
cls,
host: Optional[str] = None,
port: Optional[int] = None,
wait: Optional[bool] = None,
) -> None:
if host is None:
host = '127.0.0.1'
if port is None:
port = 4444
if wait is None:
wait = True
import debugpy
import debugpy
if os.environ.get('DEBUGPY_RUNNING') != 'true':
logger.info('debugpy init')
import debugpy
debugpy.listen((host, port))
os.environ['DEBUGPY_RUNNING'] = 'true'
if os.environ.get('DEBUGPY_RUNNING') != 'true':
logger.info('debugpy init')
import debugpy
if wait:
debugpy.wait_for_client()
debugpy.breakpoint()
debugpy.listen((host, port))
os.environ['DEBUGPY_RUNNING'] = 'true'
logger.info('debugpy done')
if wait:
debugpy.wait_for_client()
debugpy.breakpoint()
logger.info('debugpy done')

@ -1,16 +1,14 @@
import logging
from typing import (Optional,)
from typing import (
Optional,
)
def setup(level: Optional[int] = None) -> None:
if level is None:
level = logging.INFO
if level is None:
level = logging.INFO
logging.basicConfig(
level=level,
format=(
'%(levelname)s:%(name)s:%(message)s'
':%(process)d'
':%(asctime)s'
':%(pathname)s:%(funcName)s:%(lineno)s'
),
)
logging.basicConfig(
level=level,
format=('%(levelname)s:%(name)s:%(message)s:%(process)d:%(asctime)s:%(pathname)s:%(funcName)s:%(lineno)s'),
)

@ -9,208 +9,232 @@ import logging
import sys
import argparse
from pydantic import (Field,)
from pydantic import (
Field,
)
from typing import (ClassVar, Generator, Annotated, Optional, Any,)
from typing import (
ClassVar,
Generator,
Annotated,
Optional,
Any,
)
logger = logging.getLogger(__name__)
@pydantic.dataclasses.dataclass
class MypyFormatEntry:
name : str
value : str
name: str
value: str
def __eq__(self, other: object) -> bool:
if not isinstance(other, type(self)):
raise NotImplementedError
def __eq__(self, other: object) -> bool:
if not isinstance(other, type(self)):
raise NotImplementedError
return self.value == other.value
return self.value == other.value
class MypyFormat:
vscode : ClassVar[MypyFormatEntry] = MypyFormatEntry(name='vscode', value='vscode')
json : ClassVar[MypyFormatEntry] = MypyFormatEntry(name='json', value='json')
vscode: ClassVar[MypyFormatEntry] = MypyFormatEntry(name='vscode', value='vscode')
json: ClassVar[MypyFormatEntry] = MypyFormatEntry(name='json', value='json')
@classmethod
def from_value(cls, value: str) -> MypyFormatEntry:
for e in cls.entries():
if value == e.value:
return e
@classmethod
def from_value(cls, value: str) -> MypyFormatEntry:
for e in cls.entries():
if value == e.value:
return e
raise NotImplementedError
raise NotImplementedError
@classmethod
def entries(
cls,
) -> Generator[
MypyFormatEntry,
None,
None,
]:
for o in dir(cls):
e = getattr(cls, o)
if not isinstance(e, MypyFormatEntry):
continue
@classmethod
def entries(cls) -> Generator[MypyFormatEntry, None, None,]:
for o in dir(cls):
e = getattr(cls, o)
if not isinstance(e, MypyFormatEntry):
continue
yield e
yield e
class MypySettings(pydantic_settings.BaseSettings):
model_config = pydantic_settings.SettingsConfigDict(
env_prefix='online_fxreader_pr34_mypy_',
case_sensitive=False,
)
model_config = pydantic_settings.SettingsConfigDict(
env_prefix='online_fxreader_pr34_mypy_',
case_sensitive=False,
)
config_path: pathlib.Path = pathlib.Path.cwd() / '.mypy.ini'
max_errors: dict[str, int] = dict()
paths: Annotated[list[pathlib.Path], Field(default_factory=lambda: ['.'])]
config_path : pathlib.Path = pathlib.Path.cwd() / '.mypy.ini'
max_errors : dict[str, int] = dict()
paths : Annotated[list[pathlib.Path], Field(default_factory=lambda : ['.'])]
def run(
argv: Optional[list[str]] = None,
settings: Optional[MypySettings] = None,
argv: Optional[list[str]] = None,
settings: Optional[MypySettings] = None,
) -> None:
if argv is None:
argv = []
if argv is None:
argv = []
if settings is None:
settings = MypySettings.model_validate(dict())
if settings is None:
settings = MypySettings.model_validate(dict())
parser = argparse.ArgumentParser()
parser.add_argument(
'-q', '--quiet',
dest='quiet',
action='store_true',
help='do not print anything if the program is correct according to max_errors limits',
default=False,
)
parser.add_argument(
'-i',
dest='paths',
help='specify paths to check',
default=[],
action='append',
)
parser.add_argument(
'-f', '--format',
dest='_format',
help='output format of errors',
default=MypyFormat.json.value,
choices=[
o.value
for o in MypyFormat.entries()
],
)
options, args = parser.parse_known_args(argv)
parser = argparse.ArgumentParser()
parser.add_argument(
'-q',
'--quiet',
dest='quiet',
action='store_true',
help='do not print anything if the program is correct according to max_errors limits',
default=False,
)
parser.add_argument(
'-i',
dest='paths',
help='specify paths to check',
default=[],
action='append',
)
parser.add_argument(
'-f',
'--format',
dest='_format',
help='output format of errors',
default=MypyFormat.json.value,
choices=[o.value for o in MypyFormat.entries()],
)
options, args = parser.parse_known_args(argv)
if len(args) > 0 and args[0] == '--':
del args[0]
if len(args) > 0 and args[0] == '--':
del args[0]
options.format = MypyFormat.from_value(options._format)
options.format = MypyFormat.from_value(options._format)
if len(options.paths) == 0:
options.paths.extend(settings.paths)
if len(options.paths) == 0:
options.paths.extend(settings.paths)
started_at = datetime.datetime.now()
started_at = datetime.datetime.now()
mypy_cmd = [
sys.executable,
'-m',
'mypy',
'--config-file', str(settings.config_path),
'--strict',
'-O',
'json',
*args,
*options.paths,
]
mypy_cmd = [
sys.executable,
'-m',
'mypy',
'--config-file',
str(settings.config_path),
'--strict',
'-O',
'json',
*args,
*options.paths,
]
logger.info(dict(cmd=mypy_cmd))
logger.info(dict(cmd=mypy_cmd))
res = subprocess.run(
mypy_cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
res = subprocess.run(
mypy_cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
done_at = datetime.datetime.now()
done_at = datetime.datetime.now()
try:
assert not res.returncode is None
try:
assert not res.returncode is None
errors = sorted(
[json.loads(o) for o in res.stdout.decode('utf-8').splitlines() if not o.strip() == ''],
key=lambda x: (
x.get('file', ''),
x.get('line', 0),
),
)
errors = sorted([
json.loads(o)
for o in res.stdout.decode('utf-8').splitlines()
if not o.strip() == ''
], key=lambda x: (
x.get('file', ''),
x.get('line', 0),
))
if not options.quiet:
if (len(res.stderr)) > 0:
logger.error(res.stderr.decode('utf-8'))
except:
logger.exception('')
logger.error(res.stdout.decode('utf-8'))
logger.error(res.stderr.decode('utf-8'))
sys.exit(res.returncode)
if not options.quiet:
if (len(res.stderr)) > 0:
logger.error(res.stderr.decode('utf-8'))
except:
logger.exception('')
logger.error(res.stdout.decode('utf-8'))
logger.error(res.stderr.decode('utf-8'))
sys.exit(res.returncode)
g: dict[str, Any] = dict()
for o in errors:
if not o['file'] in g:
g[o['file']] = []
g[o['file']].append(o)
h = {
k: len(v)
for k, v in sorted(
list(g.items()),
key=lambda x: x[0],
)
}
g : dict[str, Any] = dict()
for o in errors:
if not o['file'] in g:
g[o['file']] = []
g[o['file']].append(o)
mentioned_paths = marisa_trie.Trie(list(h))
h = {
k : len(v)
for k, v in sorted(
list(g.items()),
key=lambda x: x[0],
)
}
violated_limits: dict[str, str] = dict()
mentioned_paths = marisa_trie.Trie(list(h))
for k, v in settings.max_errors.items():
matching_paths = mentioned_paths.keys(k)
total_errors = sum([h[o] for o in matching_paths], 0)
violated_limits : dict[str, str] = dict()
if total_errors > v:
violated_limits[k] = '%s - [%s]: has %d errors > %d' % (
k,
', '.join(matching_paths),
total_errors,
v,
)
for k, v in settings.max_errors.items():
matching_paths = mentioned_paths.keys(k)
total_errors = sum([
h[o]
for o in matching_paths
], 0)
if len(violated_limits) > 0 or not options.quiet:
if options.format == MypyFormat.vscode:
for o in errors:
sys.stdout.write(
'[%s] %s:%d,%d %s - %s - %s\n'
% (
o['severity'],
o['file'],
o['line'],
o['column'],
o['message'],
o['hint'],
o['code'],
)
)
sys.stdout.flush()
# logger.info(json.dumps(errors, indent=4))
else:
logger.info(json.dumps(errors, indent=4))
if total_errors > v:
violated_limits[k] = '%s - [%s]: has %d errors > %d' % (
k, ', '.join(matching_paths), total_errors, v,
)
# if len(violated_limits) > 0:
# logger.info(json.dumps(violated_limits, indent=4))
logger.info(
json.dumps(
dict(
max_errors=settings.max_errors,
violated_limits=violated_limits,
histogram=h,
elapsed=(done_at - started_at).total_seconds(),
),
indent=4,
)
)
if len(violated_limits) > 0 or not options.quiet:
if options.format == MypyFormat.vscode:
for o in errors:
sys.stdout.write('[%s] %s:%d,%d %s - %s - %s\n' % (
o['severity'],
o['file'],
o['line'],
o['column'],
o['message'],
o['hint'],
o['code'],
))
sys.stdout.flush()
#logger.info(json.dumps(errors, indent=4))
else:
logger.info(json.dumps(errors, indent=4))
if len(violated_limits) > 0:
sys.exit(1)
#if len(violated_limits) > 0:
# logger.info(json.dumps(violated_limits, indent=4))
logger.info(json.dumps(dict(
max_errors=settings.max_errors,
violated_limits=violated_limits,
histogram=h,
elapsed=(done_at - started_at).total_seconds(),
), indent=4))
if len(violated_limits) > 0:
sys.exit(1)
if __name__ == '__main__':
from . import logging as _logging
_logging.setup()
run(sys.argv[1:])
from . import logging as _logging
_logging.setup()
run(sys.argv[1:])

@ -11,112 +11,115 @@ import dataclasses
logger = logging.getLogger(__name__)
from typing import (overload, Optional, Literal, Any, Annotated,)
from typing import (
overload,
Optional,
Literal,
Any,
Annotated,
)
from .cli_bootstrap import PyProject
@overload
def shutil_which(
name: str,
raise_on_failure: Literal[True],
name: str,
raise_on_failure: Literal[True],
) -> str: ...
@overload
def shutil_which(
name: str,
raise_on_failure: bool,
name: str,
raise_on_failure: bool,
) -> Optional[str]: ...
def shutil_which(
name: str,
raise_on_failure: bool,
name: str,
raise_on_failure: bool,
) -> Optional[str]:
res = shutil.which(name)
if res is None and raise_on_failure:
raise NotImplementedError
else:
return res
res = shutil.which(name)
if res is None and raise_on_failure:
raise NotImplementedError
else:
return res
def runtime_libdirs_init(
project: PyProject,
project: PyProject,
) -> None:
if sys.platform == 'linux':
ld_library_path : list[pathlib.Path] = [
o
for o in [
*[
o.absolute()
for o in (
project.runtime_libdirs
if project.runtime_libdirs
else []
)
],
*[
pathlib.Path(o)
for o in os.environ.get(
'LD_LIBRARY_PATH',
''
).split(os.path.pathsep)
if o != ''
]
]
]
if sys.platform == 'linux':
ld_library_path: list[pathlib.Path] = [
o
for o in [
*[o.absolute() for o in (project.runtime_libdirs if project.runtime_libdirs else [])],
*[pathlib.Path(o) for o in os.environ.get('LD_LIBRARY_PATH', '').split(os.path.pathsep) if o != ''],
]
]
ld_library_path_present : list[pathlib.Path] = []
ld_library_path_present: list[pathlib.Path] = []
for o in ld_library_path:
if not o.exists():
logger.warning(dict(
ld_library_path=o,
msg='not found',
))
for o in ld_library_path:
if not o.exists():
logger.warning(
dict(
ld_library_path=o,
msg='not found',
)
)
ld_library_path_present.append(o)
ld_library_path_present.append(o)
os.environ.update(
LD_LIBRARY_PATH=os.path.pathsep.join([
str(o) for o in ld_library_path_present
])
)
os.environ.update(LD_LIBRARY_PATH=os.path.pathsep.join([str(o) for o in ld_library_path_present]))
for preload_path in (project.runtime_preload or []):
for preload_found in glob.glob(str(
preload_path.parent / ('lib%s.so' % preload_path.name)
)):
logger.info(dict(
preload_path=preload_path, preload_found=preload_found,
# lib_path=o,
msg='load_library',
))
for preload_path in project.runtime_preload or []:
for preload_found in glob.glob(str(preload_path.parent / ('lib%s.so' % preload_path.name))):
logger.info(
dict(
preload_path=preload_path,
preload_found=preload_found,
# lib_path=o,
msg='load_library',
)
)
ctypes.cdll.LoadLibrary(preload_found)
else:
raise NotImplementedError
ctypes.cdll.LoadLibrary(preload_found)
else:
raise NotImplementedError
class interfaces_index_t:
@dataclasses.dataclass
class Interface:
@dataclasses.dataclass
class AddrInfo:
family: str
local: str
@dataclasses.dataclass
class Interface:
@dataclasses.dataclass
class AddrInfo:
family: str
local: str
name: Annotated[
str,
pydantic.Field(
alias='ifname',
),
]
addr_info: list[AddrInfo]
name: Annotated[
str,
pydantic.Field(
alias='ifname',
)
]
addr_info: list[AddrInfo]
def interfaces_index() -> list[interfaces_index_t.Interface]:
res = pydantic.RootModel[
list[interfaces_index_t.Interface]
].model_validate_json(
subprocess.check_output([
'ip', '-j', 'addr',
]).decode('utf-8')
).root
res = (
pydantic.RootModel[list[interfaces_index_t.Interface]]
.model_validate_json(
subprocess.check_output(
[
'ip',
'-j',
'addr',
]
).decode('utf-8')
)
.root
)
return res
return res

File diff suppressed because it is too large Load Diff

@ -6,22 +6,23 @@ from typing import Any
from typing_extensions import Protocol
from abc import abstractmethod
C = typing.TypeVar("C", bound="Comparable")
C = typing.TypeVar('C', bound='Comparable')
class Comparable(Protocol):
@abstractmethod
def __eq__(self, other: Any) -> bool:
pass
@abstractmethod
def __eq__(self, other: Any) -> bool:
pass
@abstractmethod
def __lt__(self: C, other: C) -> bool:
pass
@abstractmethod
def __lt__(self: C, other: C) -> bool:
pass
def __gt__(self: C, other: C) -> bool:
return (not self < other) and self != other
def __gt__(self: C, other: C) -> bool:
return (not self < other) and self != other
def __le__(self: C, other: C) -> bool:
return self < other or self == other
def __le__(self: C, other: C) -> bool:
return self < other or self == other
def __ge__(self: C, other: C) -> bool:
return (not self < other)
def __ge__(self: C, other: C) -> bool:
return not self < other

@ -5,121 +5,107 @@ import pprint
async def f1():
devices = await bleak.BleakScanner.discover()
return devices
devices = await bleak.BleakScanner.discover()
return devices
async def f2(device, timeout=None):
if timeout is None:
timeout = 1.0
if timeout is None:
timeout = 1.0
assert isinstance(timeout, float) and timeout >= 1e-8
assert isinstance(timeout, float) and timeout >= 1e-8
p = await bleak.BleakClient(
device,
timeout=timeout,
).__aenter__()
return p
p = await bleak.BleakClient(
device,
timeout=timeout,
).__aenter__()
return p
async def f3(client):
t1 = [
dict(
service=o.__dict__,
characteristics=[
o2.__dict__
for o2 in o.characteristics
]
)
for o in client.services
]
return t1
t1 = [dict(service=o.__dict__, characteristics=[o2.__dict__ for o2 in o.characteristics]) for o in client.services]
return t1
async def f5(
name_check=None,
name_check=None,
):
t2 = []
t2 = []
attempt = 0
attempt = 0
while True:
t1 = await f1()
pprint.pprint([o.__dict__ for o in t1])
while True:
t1 = await f1()
pprint.pprint([o.__dict__ for o in t1])
if not name_check is None:
assert inspect.isfunction(name_check)
if not name_check is None:
assert inspect.isfunction(name_check)
t5 = {
i : o.details[0].name()
for i, o in enumerate(t1)
}
t5 = {i: o.details[0].name() for i, o in enumerate(t1)}
t2.extend(
[
t1[k]
for k, v in t5.items()
if isinstance(v, str) and name_check(v)
]
)
else:
t2.extend(t1)
t2.extend([t1[k] for k, v in t5.items() if isinstance(v, str) and name_check(v)])
else:
t2.extend(t1)
if len(t2) > 0:
break
if len(t2) > 0:
break
attempt += 1
print('\rattempt #%d' % attempt, end='')
attempt += 1
print('\rattempt #%d' % attempt, end='')
return t2
return t2
async def f4(
timeout=None,
characteristics=None,
operations=None,
name_check=None,
timeout=None,
characteristics=None,
operations=None,
name_check=None,
):
if isinstance(name_check, str):
assert name_check in [
'watch fit',
]
name_check2 = lambda current_name: name_check.lower() in current_name.lower()
else:
name_check2 = name_check
if isinstance(name_check, str):
assert name_check in [
'watch fit',
]
name_check2 = lambda current_name: name_check.lower() in current_name.lower()
else:
name_check2 = name_check
assert not name_check2 is None
assert not name_check2 is None
if characteristics is None:
characteristics = [
'0000ffd1-0000-1000-8000-00805f9b34fb',
]
if characteristics is None:
characteristics = [
'0000ffd1-0000-1000-8000-00805f9b34fb',
]
t2 = await f5(
name_check=name_check2,
)
t2 = await f5(
name_check=name_check2,
)
if len(t2) == 0:
print('not found')
return
if len(t2) == 0:
print('not found')
return
t3 = None
try:
t3 = await f2(t2[0], timeout=timeout)
t4 = await f3(t3)
pprint.pprint(t4)
t3 = None
try:
t3 = await f2(t2[0], timeout=timeout)
t4 = await f3(t3)
pprint.pprint(t4)
if not operations is None and inspect.isfunction(operations):
await operations(
client=t3,
t4=t4,
)
else:
t6 = {}
for o in characteristics:
try:
t7 = await t3.read_gatt_char(o)
except Exception as exception:
print(traceback.format_exc())
t7 = None
t6[o] = t7
pprint.pprint(t6)
finally:
if not t3 is None:
await t3.disconnect()
if not operations is None and inspect.isfunction(operations):
await operations(
client=t3,
t4=t4,
)
else:
t6 = {}
for o in characteristics:
try:
t7 = await t3.read_gatt_char(o)
except Exception as exception:
print(traceback.format_exc())
t7 = None
t6[o] = t7
pprint.pprint(t6)
finally:
if not t3 is None:
await t3.disconnect()

@ -10,162 +10,149 @@ import threading
import cython
import datetime
from typing import (Any, Optional, TypeVar, Type, cast)
from typing import Any, Optional, TypeVar, Type, cast
# from scoping import scoping as s
def test(
_id: int,
T: float,
a: numpy.ndarray[Any, numpy.dtype[numpy.int32]],
) -> None:
with cython.nogil:
#if True:
started_at = datetime.datetime.now()
print('started')
def elapsed() -> float:
return (datetime.datetime.now() - started_at).total_seconds()
#a = 0
while elapsed() < T:
#a += 1
for k in range(1024 * 1024):
a[_id] += 1
print(['done', started_at, elapsed(), a[_id]])
def test(
_id: int,
T: float,
a: numpy.ndarray[Any, numpy.dtype[numpy.int32]],
) -> None:
with cython.nogil:
# if True:
started_at = datetime.datetime.now()
print('started')
def elapsed() -> float:
return (datetime.datetime.now() - started_at).total_seconds()
# a = 0
while elapsed() < T:
# a += 1
for k in range(1024 * 1024):
a[_id] += 1
print(['done', started_at, elapsed(), a[_id]])
M = TypeVar('M', bound=Type[Any])
def build(content: str, module: M) -> M:
import pathlib
import tempfile
import hashlib
import Cython.Build.Inline
import pathlib
import tempfile
import hashlib
import Cython.Build.Inline
sha256sum = hashlib.sha256(content.encode('utf-8')).digest().hex()
sha256sum = hashlib.sha256(content.encode('utf-8')).digest().hex()
output_dir = (pathlib.Path('.') / 'tmp' / 'cython' / sha256sum).absolute()
output_dir = (pathlib.Path('.') / 'tmp' / 'cython' / sha256sum).absolute()
if not output_dir.exists() or True:
os.makedirs(str(output_dir), exist_ok=True)
if not output_dir.exists() or True:
os.makedirs(str(output_dir), exist_ok=True)
source_path = output_dir / ('_%s.pyx' % sha256sum)
if not source_path.exists():
with io.open(str(source_path), 'w') as f:
f.write(content)
source_path = output_dir / ('_%s.pyx' % sha256sum)
if not source_path.exists():
with io.open(str(source_path), 'w') as f:
f.write(content)
t1 = Cython.Build.Inline._get_build_extension()
t1.extensions = Cython.Build.cythonize(str(source_path))
t1.build_temp = str(pathlib.Path('/'))
t1.build_lib = str(output_dir)
# t2 = Cython.Build.Inline.Extension(
# name=sha256sum,
# )
t1.run()
t1 = Cython.Build.Inline._get_build_extension()
t1.extensions = Cython.Build.cythonize(str(source_path))
t1.build_temp = str(pathlib.Path('/'))
t1.build_lib = str(output_dir)
#t2 = Cython.Build.Inline.Extension(
# name=sha256sum,
#)
t1.run()
return cast(M, Cython.Build.Inline.load_dynamic('_%s' % sha256sum, glob.glob(str(output_dir / ('_%s*.so' % sha256sum)))[0]))
return cast(
M,
Cython.Build.Inline.load_dynamic(
'_%s' % sha256sum,
glob.glob(
str(output_dir / ('_%s*.so' % sha256sum))
)[0]
)
)
raise NotImplementedError
raise NotImplementedError
def mypyc_build(file_path: pathlib.Path) -> Any:
import pathlib
import tempfile
import hashlib
import mypyc.build
import Cython.Build.Inline
import pathlib
import tempfile
import hashlib
import mypyc.build
import Cython.Build.Inline
assert isinstance(file_path, pathlib.Path)
assert isinstance(file_path, pathlib.Path)
#sha256sum = hashlib.sha256(content.encode('utf-8')).digest().hex()
# sha256sum = hashlib.sha256(content.encode('utf-8')).digest().hex()
#output_dir = (pathlib.Path('.') / 'tmp' / 'cython' / sha256sum).absolute()
output_dir = pathlib.Path('.') / 'tmp' / 'mypyc'
sha256sum = file_path.stem
lib_pattern = file_path.parent / ('%s.cpython*.so' % sha256sum)
lib_dir = pathlib.Path('.')
# output_dir = (pathlib.Path('.') / 'tmp' / 'cython' / sha256sum).absolute()
output_dir = pathlib.Path('.') / 'tmp' / 'mypyc'
sha256sum = file_path.stem
lib_pattern = file_path.parent / ('%s.cpython*.so' % sha256sum)
lib_dir = pathlib.Path('.')
def lib_path_glob(path: str | pathlib.Path) -> Optional[pathlib.Path]:
res: list[str] = glob.glob(str(path))
def lib_path_glob(path: str | pathlib.Path) -> Optional[pathlib.Path]:
res : list[str] = glob.glob(str(path))
if len(res) == 0:
return None
else:
return pathlib.Path(res[0])
if len(res) == 0:
return None
else:
return pathlib.Path(res[0])
need_build: bool = False
need_build : bool = False
lib_path: Optional[pathlib.Path] = None
lib_path : Optional[pathlib.Path] = None
lib_path = lib_path_glob(lib_pattern)
lib_path = lib_path_glob(lib_pattern)
if not lib_path is None:
t2 = file_path.stat()
t3 = lib_path.stat()
if t3.st_mtime < t2.st_mtime:
need_build = True
if not lib_path is None:
t2 = file_path.stat()
t3 = lib_path.stat()
if t3.st_mtime < t2.st_mtime:
need_build = True
del t2
del t3
else:
need_build = True
del t2
del t3
else:
need_build = True
if need_build:
for o in [
output_dir,
output_dir / 'build' / file_path.parent,
]:
os.makedirs(str(o), exist_ok=True)
# source_path = output_dir / ('_%s.py' % sha256sum)
source_path = file_path
# with io.open(str(source_path), 'w') as f:
# f.write(content)
t1 = Cython.Build.Inline._get_build_extension()
t1.extensions = mypyc.build.mypycify([str(source_path)], target_dir=str(output_dir / 'build'))
t1.build_temp = str(output_dir)
t1.build_lib = str(lib_dir)
# t2 = Cython.Build.Inline.Extension(
# name=sha256sum,
# )
t1.run()
if need_build:
for o in [
output_dir,
output_dir / 'build' / file_path.parent,
]:
os.makedirs(
str(o),
exist_ok=True
)
#source_path = output_dir / ('_%s.py' % sha256sum)
source_path = file_path
#with io.open(str(source_path), 'w') as f:
# f.write(content)
lib_path = lib_path_glob(lib_pattern)
t1 = Cython.Build.Inline._get_build_extension()
t1.extensions = mypyc.build.mypycify(
[str(source_path)],
target_dir=str(output_dir / 'build')
)
t1.build_temp = str(output_dir)
t1.build_lib = str(lib_dir)
#t2 = Cython.Build.Inline.Extension(
# name=sha256sum,
#)
t1.run()
return Cython.Build.Inline.load_dynamic(
#'_%s' % sha256sum,
# t1.extensions[0].name,
file_path.stem,
str(lib_path),
)
lib_path = lib_path_glob(lib_pattern)
raise NotImplementedError
return Cython.Build.Inline.load_dynamic(
#'_%s' % sha256sum,
#t1.extensions[0].name,
file_path.stem,
str(lib_path),
)
raise NotImplementedError
class Source:
@staticmethod
def test2(
_a : numpy.ndarray[Any, numpy.dtype[numpy.int64]],
_id : numpy.dtype[numpy.int32] | int,
T : float=16
) -> int:
raise NotImplementedError
@staticmethod
def test2(_a: numpy.ndarray[Any, numpy.dtype[numpy.int64]], _id: numpy.dtype[numpy.int32] | int, T: float = 16) -> int:
raise NotImplementedError
source = build(r'''
source = build(
r"""
cimport cython
@cython.boundscheck(False)
@ -226,52 +213,52 @@ def test2(long long [:] _a, int _id, double T=16) -> int:
return _a[_id]
''', Source)
""",
Source,
)
def test_cython(N: int=4, T:int=16) -> None:
#a = [0] * N
a = numpy.zeros((N,), dtype=numpy.int64)
t = [
threading.Thread(
target=functools.partial(
source.test2,
a,
k,
T,
)
)
for k in range(N)
]
def test_cython(N: int = 4, T: int = 16) -> None:
# a = [0] * N
a = numpy.zeros((N,), dtype=numpy.int64)
for o in t:
o.start()
for o in t:
o.join()
t = [
threading.Thread(
target=functools.partial(
source.test2,
a,
k,
T,
)
)
for k in range(N)
]
#cython_module['test2'](a, 0)
for o in t:
o.start()
for o in t:
o.join()
def test_mypyc(N: int=4, W:int=35) -> None:
cython2 = mypyc_build(
(pathlib.Path(__file__).parent / 'cython2.py').relative_to(
pathlib.Path.cwd()
)
)
# cython_module['test2'](a, 0)
# from .cython2 import fib
#a = [0] * N
t = [
threading.Thread(
target=functools.partial(
cython2.fib,
W,
)
)
for k in range(N)
]
def test_mypyc(N: int = 4, W: int = 35) -> None:
cython2 = mypyc_build((pathlib.Path(__file__).parent / 'cython2.py').relative_to(pathlib.Path.cwd()))
for o in t:
o.start()
for o in t:
o.join()
# from .cython2 import fib
# a = [0] * N
t = [
threading.Thread(
target=functools.partial(
cython2.fib,
W,
)
)
for k in range(N)
]
for o in t:
o.start()
for o in t:
o.join()

@ -1,10 +1,12 @@
import time
def fib(n: int) -> int:
if n <= 1:
return n
else:
return fib(n - 2) + fib(n - 1)
if n <= 1:
return n
else:
return fib(n - 2) + fib(n - 1)
t0 = time.time()
fib(32)

@ -5,378 +5,334 @@ import os
def kernel_1_sample_scrap(
max_articles=None,
max_articles=None,
):
if max_articles is None:
max_articles = 1
if max_articles is None:
max_articles = 1
with requests.get(
'https://dev.to',
) as p:
t10 = p.content.decode('utf-8')
t11 = pyquery.PyQuery(t10)
t13 = t11('.crayons-story__title > a')
t12 = [
pyquery.PyQuery(o).attr('href')
for o in t13
]
pprint.pprint(t12)
t14 = [
'https://dev.to/%s' % o
for o in t12
]
with requests.get(
'https://dev.to',
) as p:
t10 = p.content.decode('utf-8')
t11 = pyquery.PyQuery(t10)
t13 = t11('.crayons-story__title > a')
t12 = [pyquery.PyQuery(o).attr('href') for o in t13]
pprint.pprint(t12)
t14 = ['https://dev.to/%s' % o for o in t12]
t8 = []
for t7 in t14[:max_articles]:
with requests.get(
t7,
) as p:
t1 = p.content.decode('utf-8')
t2 = pyquery.PyQuery(t1)
t3 = t2('.comment__content')
t6 = []
for o in t3:
t4 = pyquery.PyQuery(o)
t5 = t4('.comment__header > a').attr['href']
t9 = t4('.comment__body').text()
t6.append(
dict(
author=t5,
text=t9,
)
)
t8 = []
for t7 in t14[:max_articles]:
with requests.get(
t7,
) as p:
t1 = p.content.decode('utf-8')
t2 = pyquery.PyQuery(t1)
t3 = t2('.comment__content')
t6 = []
for o in t3:
t4 = pyquery.PyQuery(o)
t5 = t4('.comment__header > a').attr['href']
t9 = t4('.comment__body').text()
t6.append(
dict(
author=t5,
text=t9,
)
)
#pprint.pprint(t3)
pprint.pprint(t6)
t8.append(
dict(
article=t7,
comments=t6,
)
)
# pprint.pprint(t3)
pprint.pprint(t6)
t8.append(
dict(
article=t7,
comments=t6,
)
)
pprint.pprint(t8)
pprint.pprint(t8)
return dict(
t1=t1,
t2=t2,
t3=t3,
t6=t6,
t8=t8,
t12=t12,
)
return dict(
t1=t1,
t2=t2,
t3=t3,
t6=t6,
t8=t8,
t12=t12,
)
def kernel_2():
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU,SimpleRNN
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU, SimpleRNN
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
# %matplotlib inline
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
# %% [markdown]
# # Configuring TPU's
#
# For this version of Notebook we will be using TPU's as we have to built a BERT Model
# %% [markdown]
# # Configuring TPU's
#
# For this version of Notebook we will be using TPU's as we have to built a BERT Model
# %% [code]
# Detect hardware, return appropriate distribution strategy
try:
# TPU detection. No parameters necessary if TPU_NAME environment variable is
# set: this is always the case on Kaggle.
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
print('Running on TPU ', tpu.master())
except ValueError:
tpu = None
# %% [code]
# Detect hardware, return appropriate distribution strategy
try:
# TPU detection. No parameters necessary if TPU_NAME environment variable is
# set: this is always the case on Kaggle.
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
print('Running on TPU ', tpu.master())
except ValueError:
tpu = None
if tpu:
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
# Default distribution strategy in Tensorflow. Works on CPU and single GPU.
strategy = tf.distribute.get_strategy()
if tpu:
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
# Default distribution strategy in Tensorflow. Works on CPU and single GPU.
strategy = tf.distribute.get_strategy()
print("REPLICAS: ", strategy.num_replicas_in_sync)
print('REPLICAS: ', strategy.num_replicas_in_sync)
# %% [code]
train = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv')
validation = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/validation.csv')
test = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/test.csv')
# %% [code]
train = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv')
validation = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/validation.csv')
test = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/test.csv')
# %% [markdown]
# We will drop the other columns and approach this problem as a Binary Classification Problem and also we will have our exercise done on a smaller subsection of the dataset(only 12000 data points) to make it easier to train the models
# %% [markdown]
# We will drop the other columns and approach this problem as a Binary Classification Problem and also we will have our exercise done on a smaller subsection of the dataset(only 12000 data points) to make it easier to train the models
# %% [code]
train.drop(['severe_toxic','obscene','threat','insult','identity_hate'],axis=1,inplace=True)
# %% [code]
train.drop(['severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'], axis=1, inplace=True)
# %% [code]
train = train.loc[:12000,:]
train.shape
# %% [code]
train = train.loc[:12000, :]
train.shape
# %% [markdown]
# We will check the maximum number of words that can be present in a comment , this will help us in padding later
# %% [markdown]
# We will check the maximum number of words that can be present in a comment , this will help us in padding later
# %% [code]
train['comment_text'].apply(lambda x:len(str(x).split())).max()
# %% [code]
train['comment_text'].apply(lambda x: len(str(x).split())).max()
# %% [markdown]
# ### Data Preparation
# %% [markdown]
# ### Data Preparation
# %% [code]
xtrain, xvalid, ytrain, yvalid = train_test_split(
train.comment_text.values, train.toxic.values, stratify=train.toxic.values, random_state=42, test_size=0.2, shuffle=True
)
# %% [code]
xtrain, xvalid, ytrain, yvalid = train_test_split(train.comment_text.values, train.toxic.values,
stratify=train.toxic.values,
random_state=42,
test_size=0.2, shuffle=True)
# %% [markdown]
# # Before We Begin
#
# Before we Begin If you are a complete starter with NLP and never worked with text data, I am attaching a few kernels that will serve as a starting point of your journey
# * https://www.kaggle.com/arthurtok/spooky-nlp-and-topic-modelling-tutorial
# * https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle
#
# If you want a more basic dataset to practice with here is another kernel which I wrote:
# * https://www.kaggle.com/tanulsingh077/what-s-cooking
#
# Below are some Resources to get started with basic level Neural Networks, It will help us to easily understand the upcoming parts
# * https://www.youtube.com/watch?v=aircAruvnKk&list=PL_h2yd2CGtBHEKwEH5iqTZH85wLS-eUzv
# * https://www.youtube.com/watch?v=IHZwWFHWa-w&list=PL_h2yd2CGtBHEKwEH5iqTZH85wLS-eUzv&index=2
# * https://www.youtube.com/watch?v=Ilg3gGewQ5U&list=PL_h2yd2CGtBHEKwEH5iqTZH85wLS-eUzv&index=3
# * https://www.youtube.com/watch?v=tIeHLnjs5U8&list=PL_h2yd2CGtBHEKwEH5iqTZH85wLS-eUzv&index=4
#
# For Learning how to visualize test data and what to use view:
# * https://www.kaggle.com/tanulsingh077/twitter-sentiment-extaction-analysis-eda-and-model
# * https://www.kaggle.com/jagangupta/stop-the-s-toxic-comments-eda
# %% [markdown]
# # Before We Begin
#
# Before we Begin If you are a complete starter with NLP and never worked with text data, I am attaching a few kernels that will serve as a starting point of your journey
# * https://www.kaggle.com/arthurtok/spooky-nlp-and-topic-modelling-tutorial
# * https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle
#
# If you want a more basic dataset to practice with here is another kernel which I wrote:
# * https://www.kaggle.com/tanulsingh077/what-s-cooking
#
# Below are some Resources to get started with basic level Neural Networks, It will help us to easily understand the upcoming parts
# * https://www.youtube.com/watch?v=aircAruvnKk&list=PL_h2yd2CGtBHEKwEH5iqTZH85wLS-eUzv
# * https://www.youtube.com/watch?v=IHZwWFHWa-w&list=PL_h2yd2CGtBHEKwEH5iqTZH85wLS-eUzv&index=2
# * https://www.youtube.com/watch?v=Ilg3gGewQ5U&list=PL_h2yd2CGtBHEKwEH5iqTZH85wLS-eUzv&index=3
# * https://www.youtube.com/watch?v=tIeHLnjs5U8&list=PL_h2yd2CGtBHEKwEH5iqTZH85wLS-eUzv&index=4
#
# For Learning how to visualize test data and what to use view:
# * https://www.kaggle.com/tanulsingh077/twitter-sentiment-extaction-analysis-eda-and-model
# * https://www.kaggle.com/jagangupta/stop-the-s-toxic-comments-eda
# %% [markdown]
# # Simple RNN
#
# ## Basic Overview
#
# What is a RNN?
#
# Recurrent Neural Network(RNN) are a type of Neural Network where the output from previous step are fed as input to the current step. In traditional neural networks, all the inputs and outputs are independent of each other, but in cases like when it is required to predict the next word of a sentence, the previous words are required and hence there is a need to remember the previous words. Thus RNN came into existence, which solved this issue with the help of a Hidden Layer.
#
# Why RNN's?
#
# https://www.quora.com/Why-do-we-use-an-RNN-instead-of-a-simple-neural-network
#
# ## In-Depth Understanding
#
# * https://medium.com/mindorks/understanding-the-recurrent-neural-network-44d593f112a2
# * https://www.youtube.com/watch?v=2E65LDnM2cA&list=PL1F3ABbhcqa3BBWo170U4Ev2wfsF7FN8l
# * https://www.d2l.ai/chapter_recurrent-neural-networks/rnn.html
#
# ## Code Implementation
#
# So first I will implement the and then I will explain the code step by step
# %% [markdown]
# # Simple RNN
#
# ## Basic Overview
#
# What is a RNN?
#
# Recurrent Neural Network(RNN) are a type of Neural Network where the output from previous step are fed as input to the current step. In traditional neural networks, all the inputs and outputs are independent of each other, but in cases like when it is required to predict the next word of a sentence, the previous words are required and hence there is a need to remember the previous words. Thus RNN came into existence, which solved this issue with the help of a Hidden Layer.
#
# Why RNN's?
#
# https://www.quora.com/Why-do-we-use-an-RNN-instead-of-a-simple-neural-network
#
# ## In-Depth Understanding
#
# * https://medium.com/mindorks/understanding-the-recurrent-neural-network-44d593f112a2
# * https://www.youtube.com/watch?v=2E65LDnM2cA&list=PL1F3ABbhcqa3BBWo170U4Ev2wfsF7FN8l
# * https://www.d2l.ai/chapter_recurrent-neural-networks/rnn.html
#
# ## Code Implementation
#
# So first I will implement the and then I will explain the code step by step
# %% [code]
# using keras tokenizer here
token = text.Tokenizer(num_words=None)
max_len = 1500
# %% [code]
# using keras tokenizer here
token = text.Tokenizer(num_words=None)
max_len = 1500
token.fit_on_texts(list(xtrain) + list(xvalid))
xtrain_seq = token.texts_to_sequences(xtrain)
xvalid_seq = token.texts_to_sequences(xvalid)
token.fit_on_texts(list(xtrain) + list(xvalid))
xtrain_seq = token.texts_to_sequences(xtrain)
xvalid_seq = token.texts_to_sequences(xvalid)
# zero pad the sequences
xtrain_pad = sequence.pad_sequences(xtrain_seq, maxlen=max_len)
xvalid_pad = sequence.pad_sequences(xvalid_seq, maxlen=max_len)
#zero pad the sequences
xtrain_pad = sequence.pad_sequences(xtrain_seq, maxlen=max_len)
xvalid_pad = sequence.pad_sequences(xvalid_seq, maxlen=max_len)
word_index = token.word_index
word_index = token.word_index
# %% [code]
# %%time
with strategy.scope():
# A simpleRNN without any pretrained embeddings and one dense layer
model = Sequential()
model.add(Embedding(len(word_index) + 1, 300, input_length=max_len))
model.add(SimpleRNN(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# %% [code]
#%%time
with strategy.scope():
# A simpleRNN without any pretrained embeddings and one dense layer
model = Sequential()
model.add(Embedding(len(word_index) + 1,
300,
input_length=max_len))
model.add(SimpleRNN(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
model.summary()
return dict(
model=model,
xtrain_pad=xtrain_pad,
strategy=strategy,
xvalid_pad=xvalid_pad,
xtrain_seq=xtrain_seq,
token=token,
max_len=max_len,
xtrain=xtrain,
xvalid=xvalid,
ytrain=ytrain,
yvalid=yvalid,
)
return dict(
model=model,
xtrain_pad=xtrain_pad,
strategy=strategy,
xvalid_pad=xvalid_pad,
xtrain_seq=xtrain_seq,
token=token,
max_len=max_len,
xtrain=xtrain,
xvalid=xvalid,
ytrain=ytrain,
yvalid=yvalid,
)
def kernel_3(
o_2,
nb_epochs=None,
o_2,
nb_epochs=None,
):
if nb_epochs is None:
nb_epochs = 5
if nb_epochs is None:
nb_epochs = 5
# %% [markdown]
# Writing a function for getting auc score for validation
# %% [markdown]
# Writing a function for getting auc score for validation
# %% [code]
def roc_auc(predictions,target):
import sklearn.metrics
'''
# %% [code]
def roc_auc(predictions, target):
import sklearn.metrics
"""
This methods returns the AUC Score when given the Predictions
and Labels
'''
"""
fpr, tpr, thresholds = sklearn.metrics.roc_curve(target, predictions)
roc_auc = sklearn.metrics.auc(fpr, tpr)
return roc_auc
fpr, tpr, thresholds = sklearn.metrics.roc_curve(target, predictions)
roc_auc = sklearn.metrics.auc(fpr, tpr)
return roc_auc
# %% [code]
if os.path.exists('model.h5'):
o_2['model'].load_weights('model.h5')
else:
o_2['model'].fit(
o_2['xtrain_pad'],
o_2['ytrain'],
nb_epoch=nb_epochs,
batch_size=64*o_2['strategy'].num_replicas_in_sync
) #Multiplying by Strategy to run on TPU's
o_2['model'].save_weights('model.h5')
# %% [code]
if os.path.exists('model.h5'):
o_2['model'].load_weights('model.h5')
else:
o_2['model'].fit(
o_2['xtrain_pad'], o_2['ytrain'], nb_epoch=nb_epochs, batch_size=64 * o_2['strategy'].num_replicas_in_sync
) # Multiplying by Strategy to run on TPU's
o_2['model'].save_weights('model.h5')
# %% [code]
scores = o_2['model'].predict(o_2['xvalid_pad'])
print(
"Auc: %.2f%%" % (
roc_auc(
scores,
o_2['yvalid']
)
)
)
# %% [code]
scores = o_2['model'].predict(o_2['xvalid_pad'])
print('Auc: %.2f%%' % (roc_auc(scores, o_2['yvalid'])))
# %% [code]
scores_model = []
scores_model.append(
{
'Model': 'SimpleRNN',
'AUC_Score': roc_auc(
scores,
o_2['yvalid']
)
}
)
# %% [code]
scores_model = []
scores_model.append({'Model': 'SimpleRNN', 'AUC_Score': roc_auc(scores, o_2['yvalid'])})
# %% [markdown]
# ## Code Explanantion
# * Tokenization<br><br>
# So if you have watched the videos and referred to the links, you would know that in an RNN we input a sentence word by word. We represent every word as one hot vectors of dimensions : Numbers of words in Vocab +1. <br>
# What keras Tokenizer does is , it takes all the unique words in the corpus,forms a dictionary with words as keys and their number of occurences as values,it then sorts the dictionary in descending order of counts. It then assigns the first value 1 , second value 2 and so on. So let's suppose word 'the' occured the most in the corpus then it will assigned index 1 and vector representing 'the' would be a one-hot vector with value 1 at position 1 and rest zereos.<br>
# Try printing first 2 elements of xtrain_seq you will see every word is represented as a digit now
# %% [markdown]
# ## Code Explanantion
# * Tokenization<br><br>
# So if you have watched the videos and referred to the links, you would know that in an RNN we input a sentence word by word. We represent every word as one hot vectors of dimensions : Numbers of words in Vocab +1. <br>
# What keras Tokenizer does is , it takes all the unique words in the corpus,forms a dictionary with words as keys and their number of occurences as values,it then sorts the dictionary in descending order of counts. It then assigns the first value 1 , second value 2 and so on. So let's suppose word 'the' occured the most in the corpus then it will assigned index 1 and vector representing 'the' would be a one-hot vector with value 1 at position 1 and rest zereos.<br>
# Try printing first 2 elements of xtrain_seq you will see every word is represented as a digit now
# %% [code]
o_2['xtrain_seq'][:1]
# %% [code]
o_2['xtrain_seq'][:1]
def kernel_4(
o_2,
input_texts=None,
o_2,
input_texts=None,
):
import keras.preprocessing.sequence
import keras.preprocessing.sequence
if input_texts is None:
input_texts = [
'blahb blahb blah',
'Hello World!',
'This is very good!',
'A very non toxic comment! This is so polite and polished one!'
]
if input_texts is None:
input_texts = ['blahb blahb blah', 'Hello World!', 'This is very good!', 'A very non toxic comment! This is so polite and polished one!']
t6 = []
for o in input_texts:
t1 = o
t2 = o_2['token'].texts_to_sequences(
[t1],
)
t3 = keras.preprocessing.sequence.pad_sequences(
t2,
maxlen=o_2['max_len']
)
t4 = o_2['model'].predict(
t3,
)
t6.append(
dict(
text=o,
score=t4[0][0],
)
)
pprint.pprint(
dict(
t1=t1,
t2=t2,
t3=t3,
t4=t4,
)
)
pprint.pprint(t6)
t6 = []
for o in input_texts:
t1 = o
t2 = o_2['token'].texts_to_sequences(
[t1],
)
t3 = keras.preprocessing.sequence.pad_sequences(t2, maxlen=o_2['max_len'])
t4 = o_2['model'].predict(
t3,
)
t6.append(
dict(
text=o,
score=t4[0][0],
)
)
pprint.pprint(
dict(
t1=t1,
t2=t2,
t3=t3,
t4=t4,
)
)
pprint.pprint(t6)
return dict(
t6=t6,
)
return dict(
t6=t6,
)
def kernel_5(
o_1=None,
o_2=None,
o_1=None,
o_2=None,
):
if o_1 is None:
o_1 = kernel_1_sample_scrap(max_articles=50)
if o_1 is None:
o_1 = kernel_1_sample_scrap(max_articles=50)
if o_2 is None:
o_2 = kernel_2()
o_3 = kernel_3(
o_2=o_2,
nb_epochs=1
)
if o_2 is None:
o_2 = kernel_2()
o_3 = kernel_3(o_2=o_2, nb_epochs=1)
t1 = sum(
[
[
o['text'] for o in o2['comments']
] for o2 in o_1['t8']
], []
)
t1 = sum([[o['text'] for o in o2['comments']] for o2 in o_1['t8']], [])
t2 = kernel_4(
o_2=o_2,
input_texts=t1
)
t2 = kernel_4(o_2=o_2, input_texts=t1)
t3 = sorted(
t2['t6'],
key=lambda x: x['score'],
)
pprint.pprint(t3)
t3 = sorted(
t2['t6'],
key=lambda x: x['score'],
)
pprint.pprint(t3)

File diff suppressed because it is too large Load Diff

@ -3,34 +3,34 @@ import unittest
class TestCrypto(unittest.TestCase):
def test_password_utils(self) -> None:
salt = b'asdfasdfasdf'
def test_password_utils(self) -> None:
salt = b'asdfasdfasdf'
secret = 'blah'
secret = 'blah'
hash_res = crypto.PasswordUtils.secret_hash(
secret,
mode='bytes',
salt=salt,
)
self.assertEqual(
hash_res,
(
salt,
b'\xdak\xd15\xfa\x8e\xc8\r\xc3\xd2c\xf1m\xb0\xbf\xe6\x98\x01$!j\xc8\xc0Hh\x84\xea,\x91\x8b\x08\xce',
),
)
hash_res = crypto.PasswordUtils.secret_hash(
secret,
mode='bytes',
salt=salt,
)
self.assertEqual(
hash_res,
(
salt,
b'\xdak\xd15\xfa\x8e\xc8\r\xc3\xd2c\xf1m\xb0\xbf\xe6\x98\x01$!j\xc8\xc0Hh\x84\xea,\x91\x8b\x08\xce',
),
)
check_res = crypto.PasswordUtils.secret_check(
secret,
*hash_res,
)
check_res = crypto.PasswordUtils.secret_check(
secret,
*hash_res,
)
self.assertTrue(check_res)
self.assertTrue(check_res)
self.assertFalse(
crypto.PasswordUtils.secret_check(
secret + 'asdfasdfsdf',
*hash_res,
)
)
self.assertFalse(
crypto.PasswordUtils.secret_check(
secret + 'asdfasdfsdf',
*hash_res,
)
)