In my last article, I showed you a simple Twitter script to connect to the API and clean some data. That script, was horrible, as all the code resided in a single .go file and we didn’t leverage concurrency.
I have now refactored the code & have now split it out as we show below; I’ve implemented concurrency using channels and go routines and I have added some functionality (like outputting my data to a dataframe).
My main.go file is now much cleaner than it was in the previous article. Here, I now simply define the list of Twitter handles I want to loop through and define a couple of channels. Remember, from this article that channels communicate between all the go routines and main routines to keep the main function informed about the progress of each routine.
Once we have defined the channels, we get on to looping through each handle. We then include the keyword go in front of our function calls to tell them to launch as go-routines to help us utilize concurrency.
Finally, we define 2 listeners that wait for 2 responses from each channel, before the application exits. Why 2 responses? Because they’re waiting for the length of the handles slice to be processed – we have 2 handles to process & we don’t want to stop the code until they have finished.
The first function we call is SearchProfile, which is included in search.go. So, let’s look at that.
Here, the first thing we do is to define a new struct called user_details. In here, we include all of the fields we’re interested in from the Twitter API call response.
Within the function itself, we connect to the API using the connect() function from api.go:
We then assign the results of the GetUsers API call to various variable names. We then pass the date into the CalcAge function, which returns the age of the account (today minus account opening date).
Finally, we assign those values to the user_details struct type that we defined above and we pass a message back to the user_channel to say that this profile has been processed.
Next, we have getposts.go. Inside here, we define our getPosts function and this is where things get a bit more confusing.
We do our API call and then we start defining a bunch of slices. We will populate these with the data we want to store each time we iterate over one of the posts in the response.
So, for each of the tweets in the response, we store the values to a particular variable name. For example value.IdStr is stored in a variable called id. During this process, we run some cleanup on our data to extract the hour from the datetime stamp; to clean the date format and so on.
Then we proceed to append the cleaned values to the slices we made earlier. By this point then, we have a bunch of populated slices – if we have 10 posts, we will have 10 items in each slice, if you stick all of those slices together you’d visually have something that looks a bit like a table.
Then, we calculate the totals for retweets and favorites using functions from aggregate.go. As you can see, these functions take the whole columns as slices and then iterate over those slices to calculate the totals.
In the same aggregate.go file, we also have functions to find the maximum number of retweets of favorites. Here, we iterate over the whole column again and look for the largest number.
We then loop through the search results again and we say ‘if the retweet count is = the max retweet count then that is the most retweeted tweet in the list.
Then, we create a new qframe which takes all of our appended slices in as an input and generates a nice output dataframe.
We then run a group by on that dataframe to calculate the sum of retweets, favorites and posts per day.
Finally, we tell the post_channel that we have finished processing this users posts.
Below I have included the cleanup function so you can see the actions I was performing.