Numpy Python Name fads part37 [w/ subs]



00:00:00 – our last task in this project is to
00:00:02 – identify name feds that is popular names
00:00:06 – that appear suddenly and then fade away
00:00:08 – quickly as we do so we will see how to
00:00:11 – group data with pandas group I how to
00:00:14 – compute aggregations and how to combine
00:00:16 – boolean masks
00:00:18 – let's go to the bathroom notebook and
00:00:20 – let's select the 0705 feds begin
00:00:23 – exercise file we will continue with
00:00:27 – award from the last videos so let's
00:00:30 – select cell and run all the cell's let
00:00:34 – us look at this plot for the popularity
00:00:36 – of the top six girl names between 1985
00:00:40 – in 1995 most of these names
00:00:43 – we're only popular for a relatively
00:00:44 – short period this prompts the question
00:00:46 – of how we can identify a name fat if
00:00:50 – that will have a certain spiciness to
00:00:53 – the plot more like Brittany here then
00:00:56 – Elizabeth so what we need to do is to
00:00:59 – compute a single number for each name
00:01:01 – that we tell us how spiky the plot would
00:01:04 – be however the number should be
00:01:05 – insensitive to the total number of
00:01:08 – appearances for a given name after all
00:01:10 – is small not very popular fad is still
00:01:12 – affect it turns out that the trick to
00:01:14 – computing the spiciness will be to sum
00:01:17 – of the squares of the frequencies of the
00:01:18 – names is a mathematical fact that if you
00:01:21 – multiply a function by itself
00:01:23 – this increases its contrast if you wish
00:01:26 – it's spiking asst let's start by
00:01:29 – computing the total number of babies
00:01:31 – with a given name overall years for this
00:01:34 – we will use the group by function
00:01:36 – grouping by s** and name and then some
00:01:38 – all the values in each group so we take
00:01:41 – all the data frame all year's group by
00:01:44 – s** and name take a sump and look at the
00:01:48 – very first few records this is what we
00:01:53 – need
00:01:53 – although summing the years themselves in
00:01:56 – the column ear is righteously so we can
00:01:59 – select only the column number let me
00:02:01 – copy the code from above and that the
00:02:03 – column selection before the sum this is
00:02:08 – the series which is what we need an
00:02:10 – assignment to the variable totals so let
00:02:13 – me copy the code again and assign it now
00:02:17 – for the spiciness the aggregation
00:02:19 – function sum is not quite what we need
00:02:22 – we need to sum of the squares pandas
00:02:25 – doesn't have a function for that we can
00:02:27 – define it ourselves
00:02:29 – so let's do that define some SQ of X to
00:02:34 – return the sum of x squared and let's
00:02:38 – compute the spiciness by repeating the
00:02:40 – group by operation selecting number
00:02:43 – applying some square using the pandas
00:02:47 – function AG for aggregation last we want
00:02:51 – to divide by the totals doing that we
00:02:54 – put very popular and less popular names
00:02:56 – on an even footing let me correct my
00:02:59 – typos and call this spike eNOS let's
00:03:04 – have a look indeed this bike eNOS is a
00:03:06 – number between 0 and 1 which happens
00:03:09 – when a name appears only in a single
00:03:11 – year i will select only the names that
00:03:14 – appear relatively frequently requesting
00:03:18 – that totals be greater than 5,000 i will
00:03:22 – call this spiky common let me copy the
00:03:25 – subset so that we can sort it and let's
00:03:29 – sort in descending order
00:03:32 – finally let's look at the most spiky
00:03:36 – common names these are iconic Shaquille
00:03:39 – Jays Edelen Harper and so on
00:03:43 – let's look at the least spiky names by
00:03:47 – taking the tail of this series
00:03:48 – indeed Shaquille is more of a Fed the
00:03:52 – louisa let's see that in a plot set a
00:03:55 – fixed size for slander plot and then
00:03:58 – plot Louisa and Shaquille this is a
00:04:03 – vector illustration of a fetish name
00:04:04 – against one that is not let's plot the
00:04:07 – top-10 finish names we select the top
00:04:10 – ten grab the index and the index values
00:04:13 – these are the names
00:04:15 – let's make a figure again and loop over
00:04:19 – s** and name last as to a legend
00:04:24 – i will use a list comprehension to
00:04:26 – select only the name from the tools that
00:04:30 – contain s** and name i'll put the legend
00:04:33 – in the upper left here we go
00:04:36 – the problem here is that most of these
00:04:38 – names are popular now so we don't know
00:04:41 – whether they're fats just yet they may
00:04:43 – have staying power
00:04:44 – what we can do is to add another cut to
00:04:47 – the data that excludes names popular in
00:04:49 – the last 10 years for that we need to
00:04:52 – compute totals over the last 10 years
00:04:55 – i'll do that in totals recent goodbye
00:04:59 – s** and name grab the number from the
00:05:03 – sum and now at the boolean selection 20
00:05:06 – years based on the year value
00:05:08 – ok let me go back and grab my coat for
00:05:12 – computing spiky common i'm going to add
00:05:18 – another boolean condition i do this with
00:05:21 – the m % the stands for logical and so we
00:05:27 – select common names the names that were
00:05:29 – not very common over the last 10 years
00:05:30 – the list has changed
00:05:34 – let's try plotting them i grabbed my
00:05:36 – plotting code above one more cell
00:05:41 – defining
00:05:43 – fats from the index values let's try
00:05:46 – these
00:05:48 – yes that's more like it
00:05:51 – shiki oh and katina indeed


Video Url:
http://youtu.be/yls96JfpUBw

(Visited 3 times, 1 visits today)

Comments

comments