-
Notifications
You must be signed in to change notification settings - Fork 2
Expand file tree
/
Copy pathcoding tips ml
More file actions
454 lines (280 loc) · 12.6 KB
/
coding tips ml
File metadata and controls
454 lines (280 loc) · 12.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
- (regulatization) Lasso is great for feature selection, but when building regression models,
Ridge regression should be your first choice.
-Logistic regression is for classficiation, not regression.
- property DataFrame.values
Return a Numpy representation of the DataFrame.
Warning
We recommend using DataFrame.to_numpy() instead.
Only the values in the DataFrame will be returned, the axes labels will be removed.
-the greater the AUC (aread under the curve) the better the model
If the AUC is greater than 0.5, the model is better than random guessing. Always a good sign!
-After computing the GridSeachCV, I not just want the best params but also the best score:
# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Best score is {}".format(logreg_cv.best_score_))
-Note that RandomizedSearchCV will never outperform GridSearchCV.
Instead, it is valuable because it saves on computation time.
- format() function https://www.geeksforgeeks.org/python-format-function/
print ("Hello, I am {} years old !".format(18))
-Exploratory data analysis should always be the precursor to model building.
-axis=0 means columns
axis=1 means rows
-"
# Convert '?' to NaN
df[df == '?'] = np.nan
# Print the number of NaNs
print(df.isnull().sum())
"
--imputation == transformation
" # Setup the Imputation transformer: imp
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
"
-"
You can use the .fit() and .predict() methods on pipelines just as you did with your classifiers and regressors"
-" ROC (receiver operating characteristic) Curve
A curve of true positive rate vs. false positive rate at different classification thresholds. See also AUC.
-To loop through a set of code a specified number of times, we can use the ----range()---- function.
When you use a range loop you are saying that you want to count one by one from one number until you hit another.
-The ELSE keyword in a for loop specifies a block of code to be executed when the loop is finished:
-round () to round a float to an interger
-True es 1 y False es 0
-sorted () los ordena de menor a mayor
-help(len) opens up the documentation from inside the IPython Shell for the len() function
-numpy.column_stack() function is used to stack 1-D arrays as columns into a 2-D array.
It takes a sequence of 1-D arrays and stack them as columns to make a single 2-D array.
import numpy as geek
# input array
in_arr1 = geek.array(( 1, 2, 3 ))
print ("1st Input array : \n", in_arr1)
in_arr2 = geek.array(( 4, 5, 6 ))
print ("2nd Input array : \n", in_arr2)
# Stacking the two arrays
------>out_arr = geek.column_stack((in_arr1, in_arr2))
print ("Output stacked array:\n ", out_arr)
Output:
1st Input array :
[1 2 3]
2nd Input array :
[4 5 6]
------->Output stacked array:
[[1 4]
[2 5]
[3 6]]
- np.corrcoef Return correlation coefficients.
- (1. , 7. , 11. ) the dots after the number indicate that the numbers are floats
> type(2.)
float
> type(2)
int
--
The following statements are true of a tidy (also known as "long-format") dataset?
Each variable must have its own column.
Each observation must have its own row.
tidy data is an alternative name for the common statistical form called a model matrix
or data matrix. A data matrix is defined in [1] as follows:
A standard method of displaying a multivariate set of data
is in the form of a data matrix in which rows correspond to sample individuals
and columns to variables,
so that the entry in the ith row and jth column gives the value of the jth variate
as measured or observed on the ith individual.
Hadley Wickham later defined "Tidy Data" as data sets that are arranged such that
each variable is a column and each observation (or case) is a row.
--
Consider the Pandas DataFrame score below.
Identify any missing data, printing True for missing values, and False for non-missing values.
id math physics
0 101 90.0 88.0
1 102 NaN 89.0
2 103 85.0 NaN
Complete the code to return the output
>>>score.isnull()
--
An online games platform would like to assign a special status
to their users based on the number of experience points (xp) that users have.
The user_xp DataFrame contains the user_ids and xp per user.
Use Pandas method to create a new status column that indicates a user's special status.
The bins and labels have been defined for you based on the xp status table.
-- xp status
[0 - 125 000) Bronze
[125 000 - 250 000) Silver
[250 000 - 375 000) Gold
[375 000 - 500 000) Platinum
[500 000 - inf) Diamond
Complete the code to return the output
import pandas as pd
bins = [0, 125000, 250000, 375000, 500000, float('inf')]
labels = ["bronze", "silver", "gold", "platinum", "diamond"]
user_xp['status'] = pd.cut(user_xp['xp'], bins=bins, labels=labels)
user_xp.head()
Expected Output
user_id xp status
0 1520965 292382 gold
1 1577585 45308 bronze
2 706833 493393 platinum
3 1303891 342382 gold
4 650641 127259 silver
--
Consider the Pandas DataFrame df below.
Change the data type of the column height to integer.
name height
0 Heather 160.3
1 Chris 175.1
2 Eva 165.4
Complete the code to return the output
df.height.astype('int64')
--
You have imported the food data and now want to view the first few rows to perform a quick visual inspection of the data.
Print the first 5 rows of the data.
print(food.head())
--
import the JSON file data.json >> as a Pandas DataFrame.
import json
import pandas as pd
x = pd.read_json('data.json')
--
the following is the correct function from the pandas package to import a CSV file
read_csv()
--
Pickle Module
Python pickle module is used for serializing and de-serializing
a Python object structure.
Any object in Python can be pickled so that it can be saved on disk.
What pickle does is that it “serializes” the object first before writing it to file.
Pickling is a way to convert a python object (list, dict, etc.)
into a character stream.
The idea is that this character stream contains all the information necessary to reconstruct
the object in another python script.
--
pdb — The Python Debugger
Source code: Lib/pdb.py
The module pdb defines an interactive source code debugger for Python programs.
It supports setting (conditional) breakpoints and single stepping at the source line level,
inspection of stack frames, source code listing, and evaluation of arbitrary Python code
in the context of any stack frame.
It also supports post-mortem debugging
and can be called under program control.
The debugger is extensible – it is actually defined as the class Pdb.
This is currently undocumented but easily understood by reading the source.
The extension interface uses the modules bdb and cmd.
The debugger’s prompt is (Pdb). Typical usage to run a program under control of the debugger is:
>>>
>>> import pdb
>>> import mymodule
>>> pdb.run('mymodule.test()')
> <string>(0)?()
(Pdb) continue
> <string>(1)?()
(Pdb) continue
NameError: 'spam'
> <string>(1)?()
(Pdb)
--
Numpy
NumPy’s main object is the homogeneous multidimensional array.
It is a table with same type elements,
i.e, integers or string or characters (homogeneous), usually integers.
In NumPy, dimensions are called axes.
The number of axes is called the rank.
Some of the important attributes of a NumPy object are:
Ndim: displays the dimension of the array
Shape: returns a tuple of integers indicating the size of the array
Size: returns the total number of elements in the NumPy array
Dtype: returns the type of elements in the array, i.e., int64, character
Itemsize: returns the size in bytes of each item
Reshape: Reshapes the NumPy array
NumPy array elements can be accessed using indexing. Below are some of the useful examples:
A[2:5] will print items 2 to 4. Index in NumPy arrays starts from 0
A[2::2] will print items 2 to end skipping 2 items
A[::-1] will print the array in the reverse order
A[1:] will print from row 1 to end
Since Machine Learning requires lots of scientific calculations,
it is much better to use NumPy’s ndarray, which provides a lot of convenient and optimized
implementations of essential mathematical operations on vectors.
Vectorized operations perform faster than
matrix manipulation operations performed using loops in python.
ndarray slices are actually views on the same data buffer.
If you modify it, it is going to modify the original ndarray as well.
The way multidimensional arrays are accessed using NumPy
is different from how they are accessed in normal python arrays.
The generic format in NumPy multi-dimensional arrays is:
Array[row_start_index : row_end_index, column_start_index : column_end_index]
--
When to rebase? When to Merge?
If the feature branch you are getting changes from is shared with other developers,
rebasing is not recommended, because the rebasing process will create inconsistent repositories.
For individuals, rebasing makes a lot of sense.
If you want to see the history completely same as it happened, you should use merge.
>>Merge preserves history whereas rebase rewrites it.
--
The way multidimensional arrays are accessed using NumPy
is different from how they are accessed in normal python arrays.
The generic format in NumPy multi-dimensional arrays is:
Array[row_start_index:row_end_index, column_start_index: column_end_index]
--
What is Pandas?
Similar to NumPy, Pandas is one of the most widely used python libraries in data science.
It provides high-performance, easy to use structures and data analysis tools.
Unlike NumPy library which provides >>> objects for multi-dimensional arrays,
Pandas provides in-memory >>2d table object called Dataframe.
It is like a spreadsheet with column names and row labels.
Hence, with 2d tables, pandas is capable of providing many additional functionalities
like creating pivot tables, computing columns based on other columns and plotting graphs.
Like NumPy, Pandas also provide the basic mathematical functionalities like addition, subtraction and conditional operations and broadcasting.
Pandas dataframe object represents a spreadsheet with cell values, column names, and row index labels.
A Dataframe can be visualized as dictionaries of Series.
Dataframe rows and columns are simple and intuitive to access. Pandas also provide SQL-like functionality to filter, sort rows based on conditions.
--
Why Use Lambda Functions? in python
Lambda functions are used when you need a function for a short period of time.
This is commonly used when you want to pass a function as an argument to higher-order functions,
that is, functions that take other functions as their arguments.
------
Python Objects and Classes
https://www.programiz.com/python-programming/class
Python is an object oriented programming language.
Unlike procedure oriented programming,
where the main emphasis is on functions,
object oriented programming stresses on objects.
An object is simply
a collection of data (variables) and methods (functions) that act on those data.
Similarly, a class is a blueprint for that object.
We can think of class as a sketch (prototype) of a house.
It contains all the details about the floors, doors, windows etc.
Based on these descriptions we build the house. House is the object.
As many houses can be made from a house's blueprint,
we can create many objects from a class.
An object is also called an
instance of a class
and the process of creating this object is called instantiation.
Defining a Class in Python
Like function definitions begin with the -def- keyword in Python,
class definitions begin with a -class- keyword.
The first string inside the class is called docstring
and has a brief description about the class.
Although not mandatory, this is highly recommended.
Here is a simple class definition.
class MyNewClass:
'''This is a docstring. I have created a new class'''
pass
A class creates a new local -namespace-
where all its attributes are defined.
Attributes may be data or functions.
As soon as we define a class,
a new class object is created with the same name.
This class object allows us to access the different attributes
as well as to instantiate new objects of that class.
class Person:
"This is a person class"
age = 10
def greet(self):
print('Hello')
# Output: 10
print(Person.age)
# Output: <function Person.greet>
print(Person.greet)
# Output: 'This is my second class'
print(Person.__doc__)
Output
10
<function Person.greet at 0x7fc78c6e8160>
This is a person class