Trying to organize my Twitter timeline, using unsupervised learning

Created 15 Nov 2017 04:47:06, last modified 15 Nov 2017 05:45:31

I'm a frequent user of Twitter, but I realize that among the major social networks it could be the hardest to get into. One of the big obstacles for me was that, as I followed more and more people representing my different interests, my timeline became an overcrowded mess with too many different types of content. For example, at one point I started following many tech-related accounts and comic book art-related accounts at the same time, and when I would go on Twitter I could never reasonably choose to consume content from only one of the groups.

drawing credit goes to u/Sinyuri

Even after learning to adapt to this, I still thought that it would be nice to be able to detect distinct groups among the twitter accounts that I followed. The impetus to finally start a project about this came when I started using cluster analysis algorithms in my machine learning class - the algorithms used seemed to be exactly the right idea for this kind of community detection. With that I set off on the task to collect and analyze the data from my own Twitter follow list, with clusters!

The work I've done since then is still in progress (mostly because the results I'm getting aren't that great yet), and as I make more progress I'll be making more posts about it!

All the code is available on Github.

More details below!

The simplest aim: data representation of a Twitter account

To set up the clustering, there has to be a way to represent each twitter account in my follow list in some form that can be analyzed. For this project, I decided to identify a Twitter account purely by the other accounts that they follow. I represent this in a vector representation that is just like a row of an adjacency matrix for a graph. This graph has a "central" account, and we build up the graph based on who that central account follows -- these are what we'll call the accounts of interest. These accounts of interest each follow other accounts (including some of each other), and we're interested in these accounts, too — we'll call these the accounts to examine.

If we take all the accounts that all our accounts of interest follow and deduplicate them, we have a set of nodes of size \(m\). If we take following as a (directional) edge between two nodes, we can represent a twitter account as a vector, the size of it being \(m\) by \(1\). If we have \(n\) accounts of interest, then our data is represented in the form of an \(n \times m\) matrix. If account of interest \(i\) follows account to examine \(j\), the cell in the \(i\)th row and the \(j\)th column will be a \(1\), otherwise it'll be \(0\). It's worth noting that, in almost all cases, \(m\) will be much larger than \(n\), which is a serious, almost existential problem to this project that I will go into more detail about later in the post.

Collecting the data

Now that I've decided on a representation for every twitter account in my follow list, I need to collect the data necessary to fill up the matrix with all of the edges present in our follow graph. The way that I originally was thinking of doing it was via the Twitter API, which would probably have been the most straightforward way to do it. However, Twitter heavily rate limits doing this: if you want to get a user's follow list, then you can only do so 15 times every 15 minutes (look at the GET friends/list row), which is pretty unwieldy to work with.

To work around this, I wrote a selenium script in python that would go through a user's follow list just by going to the /{user}/following endpoint and collect the names of the accounts that they follow just from the HTML. Here's the relevant function:

def get_follow_list(username, your_username, password):
    browser = webdriver.Firefox()

    browser.get("{username}/following".format(**{'username' : username}))

    username_field = browser.find_element_by_class_name("js-username-field")
    password_field = browser.find_element_by_class_name("js-password-field")


    ui.WebDriverWait(browser, 5)

        ui.WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "ProfileCard")))
        return []

    last_height = browser.execute_script("return document.body.scrollHeight;")
    num_cards = len(browser.find_elements_by_class_name("ProfileCard"))
    while True:
        browser.execute_script("window.scrollTo(0, document.body.scrollHeight)");

            new_height = ui.WebDriverWait(browser, 1.5).until(new_height_reached(last_height))
            last_height = new_height
            # print(sys.exc_info())

    print("Loop exited; follow list extracted")

    ret = browser.execute_script("return document.documentElement.innerHTML;")
    soup = BeautifulSoup(ret, 'html.parser')
    cards = soup.findAll("div", {"class": "ProfileCard"})
    screennames = [card.get("data-screen-name") for card in cards]

    return screennames

And here's a video of it in action:

In the end, I get a load of text files, each pertaining to an account's follow list, and the information in those text files is just simply the names of all the accounts they follow. Having that all stuffed in a folder, I set out processing the data.

From a profile to a vector

I did all of the clustering in Julia, not Python, because it's something that I've grown familiar with since starting my school's machine learning class and have started really admiring because of its surprising ease of use at its performance point. I encourage everyone with even a passing interest in scientific computing to try and learn it, it really has been fun so far.

We eventually need to get to the matrix representation of our data that I was describing earlier. I did this very messily and without much consideration for performance beyond basic common sense, so bear with me.

The following code reads all the followed accounts from the files and loads them into a dictionary. The keys of this dictionary are the followed accounts, and the values are all the accounts of interest that follow them.

followed = Dict()

mutable struct FollowedAccount

for (n, file) in enumerate(validFiles)
    strm = open(file, "r")
    followedAccs = readlines(strm)
    for acc in followedAccs
        if haskey(followed, acc)
            followed[acc].numFollowing += 1
            push!(followed[acc].following, usernames[n])
            followed[acc] = FollowedAccount(1, [usernames[n]])

So, in followed, we now have a dictionary where we can find all the users following a particular account. We construct the data matrix by first initializing an \(n \times m\) matrix of \(0\)s and filling the cell in the \(i\)th row and the \(j\)th column with a \(1\) if account of interest \(i\) follows account-to-examine \(j\).

(dim,) = size(allAccounts)
(n,) = size(accountsOfInterest)

dataPoints = zeros(n,dim)

for (usernameIdx, username) in enumerate(accountsOfInterest)

    for (followedIdx, followedAcc) in enumerate(allAccounts)
        if in(username, followed[followedAcc].following)
            dataPoints[usernameIdx, followedIdx] = 1

(Note: some variable names have been changed for easier reading in this post)

What the cluster?

Clustering algorithms are a form of unsupervised learning, meaning that there's no single "right answer" that they need to try to guess at. Clustering algorithms take data in the form of a bunch of points and partitions them into some distinct groups. In general (in all clustering algorithms I've seen so far), "closer" points are commonly grouped together, while points that are further apart tend to go in different clusters.

The notion of "closer" is obviously centrally important to clustering algorithms. This is why it was important for us to represent the things we want to cluster, Twitter accounts, as vectors: we can actually measure a quantifiable between any two of them by taking the norm of their difference.

The clustering algorithm used for this project is my own artisanally handcrafted implementation of DBSCAN, which I learned about in class. DBSCAN is a density-based clustering algorithm, meaning that it finds regions where many points exist in a small radius. The minimum number of points that it searches for, and what radius it searches within, are the two parameters of a DBSCAN model, MINPTS and RADIUS.

The algorithm works in pseudocode as follows:

    For every point in the dataset
        If the point is already labelled as being part of the cluster, skip it
            If MINPTS points exist within RADIUS of the point:
                ExpandCluster around that point

    Label p as belonging to a new cluster
    For every unlabelled point within RADIUS of p:
        call ExpandCluster around those points

Here's the actual implementation:

function densityCluster(dataPoints)
    n = size(dataPoints, 1)

    MINPTS = 5
    RADIUS = sqrt(n * 3 / 10)

    y = zeros(n)

    currentClusterNum = 1

    for i in 1:n
        if y[i] != 0

        for j in 1:n
            if i !== j
                push!(distances, norm(dataPoints[j, :] - dataPoints[i, :]))

        if sort(distances)[MINPTS] <= RADIUS
            println("calling expandCluster on point ", i)
            currentClusterNum = expandCluster!(y, dataPoints, i, currentClusterNum, RADIUS, MINPTS)

    return y

function expandCluster!(y, dataPoints, idx, currentClusterNum, radius, minPts)

    queue = [idx]

    visited = Set()

    n = size(dataPoints, 1)

    count = 0

    for o in queue
        if y[o] == 0
            y[o] = currentClusterNum
            count += 1

        (sortedPoints, distances, perm) = closestPoints(dataPoints, idx, n-1)

        for j in 1:n
            if distances[j] <= radius && y[perm[j]] == 0
                push!(queue, perm[j])
                push!(visited, perm[j])

    if count < minPts
        for pt in visited
            y[pt] = 0
        return currentClusterNum

    return currentClusterNum + 1

Of central importance to the DBSCAN algorithm is the notion of getting the closest points to a particular point, in order. It takes \(O(nd)\) time to find these closest points (in my implementation I just call sort instead of just keeping track of the smallest points so I believe mine might actually be \(O(dn\log n)\)). There's definitely been some fantastic work in data structures and algorithms that use approximations or other clever workings to make finding closest points faster, which is vital when your dataset is huge (which is the case for things like the datasets at large software companies). In any case, here's my implementation of closestPoints, which may cause you to cringe:

function closestPoints(X::Array{Float64, 2}, i, numPoints)

    n = size(X, 1)

    for j in 1:n
        push!(distances, norm(dataPoints[j, :] - dataPoints[i, :]))

    return X[sortperm(distances)[1:numPoints], :], distances, sortperm(distances)

The Curse of Dimensionality

The DBSCAN algorithm is great, but it's sensitive to the choices of the two parameters MINPTS and RADIUS. I'm actually still trying to select the best choices for those two parameters, which can be difficult since it's hard to judge the output of the algorithm. Furthermore, like any data analysis algorithm, it won't produce any meaningful data if the data is unusable for whatever reason.

In this case, if I just ran DBSCAN on the matrix form of the data I collected from my Twitter follow list, it would be unlikely to produce any useful output. This is because of a problem known as the Curse of Dimensionality. DBSCAN is a spatial clustering algorithm, which depends on dense clusters existing in the space you're exploring. Since the space we're exploring has a number of dimensions (\(m\)) that is much, much higher than the number of data points (\(n\)), the points are in general too distant to find any sort of dense regions at all.

To deal with this, I have to reduce the dimensions (the number of columns) of the matrix. The algorithm that I used for this is called principal component analysis, or PCA. It does what the name suggests, figuring out the most principal components, the "most important" features, and reducing the dimensionality by transforming the matrix so that only the columns corresponding to those principal components are left.

I very naively use this algorithm without fully understanding details of how it works (I'll gain some understanding soon, though, and maybe write a post about it!), inputting a target dimensionality I want to achieve and just throwing PCA at my function. The specific library I used was MultivariateStats.jl.

I have about 300 accounts of interest, so I reduce the dimensionality of the matrix from some number much greater than 300 to something decently below 300, like 40. I also experimented with standardizing each of the data points so that they're generally closer together.

function rescale(A, dim::Integer=1)
    res = A .- mean(A, dim)
    res ./= map!(x -> x > 0.0 ? x : 1.0, std(A, dim))
    return res

reshapedData = convert(Matrix{Float64}, dataPoints)
X = rescale(reshapedData, 1)
M = fit(PCA, X'; maxoutdim = 40)

So now, instead of a \(n \times m\) matrix I now have an \(n \times 40\) matrix, and since my \(n\) is significantly greater than 40, I can do the DBSCAN on it without running into the Curse. I'm not totally sure if doing clustering on the output of PCA is entirely appropriate but with this particular set of parameters, I'm able to gain results that verge on the passable.

Struggles, and things I tried that didn't make the cut

The most difficult part of this experiment was being able to code a correct expandCluster function. At first, my algorithm was placing almost every data point in a single cluster. After much inspection of my code (with a regrettable lack of unit testing which I aim to fix sooner or later), I noticed that the algorithm wasn't handling the case where there were reachable points, but they were all already assigned to a cluster. I definitely had to rewrite the basic iteration through the queue a bunch of times.

Also, I didn't come up with this particular set of algorithms and representations on the first try. There were many things I tried before going ahead with them, some of which I may bring back when I go further into this.

At the outset, instead of using PCA, I used a much more naive metric for reducing dimensionality: eliminate accounts that weren't followed by at least, say, 15 people from my accounts of interest. I realized this wasn't a very smart approach when there may be a cluster of 14 people who each have in common that they follow an account that this threshold-based method would eliminate.

I also tried to use t-SNE, but ended up opting for PCA because it seemed like there was even less of a hope of me being able to understand it and tweaking it accordingly in a reasonable timeframe. Perhaps one day.

So, how'd it do?

The short answer is "Interestingly." After I coded up the solution, I spent about a week (on and off, of course) just trying out different values for the parameters:

  • The dimensionality reduction I want
  • The values for MINPTS and RADIUS for the DBSCAN algorithm

Eventually, I settled on values of \(40\) for the number of dimensions I wanted from PCA, MINPTS = 5, and RADIUS = sqrt(n * 3 / 10). It's not because there was any sort of metric by which these are optimized against — the "fitness" of the output of a clustering algorithm is entirely subjective, after all. It's just because I found that that these choices gave 6 clusters (and the "no-cluster" assignment) for 300 data points that looked subjectively passable.

By passable, unfortunately, I don't mean that I can always find out what the clusters actually mean. I found that, perhaps, the distinct groups in my follow list that I originally thought were pretty distinguishable may have more unclear borders after all. For example, if I were to subjectively tag certain accounts as "uni CS students", the algorithm found that the uni CS students were spread across the clusters it created.

Overall, although the clusters are pretty nonsensical, at least I was able to get from the algorithm putting everything in a single cluster to creating six distinct clusters.

Here are the six clusters. Note, there are actually only six clusters: "Cluster 0" consists of all the data points that can't be clustered together, i.e. "outliers" in some sense.

Cluster 0: String["3ncr1pt3d", "ACLU", "alexcruise", "allwantoo", "AOCrows", "biiigfoot", "BillSimmons", "brucesharpe", "casskhaw", "daviottenheimer", "desi_inda_house", "dom2d", "dragosr", "dylancuthbert", "EmilyDreyfuss", "EmilyGorcenski", "Eric_Hennenfent", "fl3uryz", "FreedomeVPN", "FunnyAsianDude", "generalelectric", "GonzoHacker", "hacks4pancakes", "HertzDevil", "hiromu1996", "hirosemaryhello", "InfoSecSherpa", "ivy_hollivana", "JJRedick", "josephfcox", "JossFong", "jstnkndy", "katgleason", "kelleyrobinson", "KenRoth", "KimDotcom", "KirkWJohnson", "larsvedo", "ltkilroy", "marktmaclean", "MichaelSSmithII", "MITEngineering", "MITSloan", "mzbat","natazilla", "netflix", "NGhoussoub", "NimkoAli", "nrrrdcore", "Nullbyte0x80", "ParameterHacker", "Patrick_McHale", "Pogue", "Rachel__Nichols", "rcallimachi", "robbyoconnor", "russwest44", "SadiqKhan", "screenjunkies", "ShamsCharania", "SHAQ", "shutupmikeginn", "skayyali1", "sovietvisuals", "swannodette", "thejoshlamb", "tinyrobots", "tpolecat", "t_sloughter", "verainstitute", "viiolaceus", "wojespn", "ZachLowe_NBA"]

Cluster 1: String["acciojackie", "actualrecruiter", "adultswim", "aleph7", "antirez", "__apf__", "aprilmpls", "ARetVet", "Arjyparjy", "avasdemon", "avery_katko", "AvidHacker", "AvVelocity", "azuresun", "b0rk", "baesedandy", "BBCBreaking", "beriwanravandi", "blakegriffin32", "Bourdain", "briankrebs", "campuscodi", "chriszhu12", "cjbecktech", "Colossal", "coolbho3k", "cryptonym0", "CryptoVillage", "CybertronVGC", "da_667", "dennisl_music", "dysinger", "elpritchos", "EtikaWNetwork", "fggosselin", "gommatt", "GoogleDoodles", "grisuy", "HamillHimself", "hannibalburess", "HariniLabs", "hemalchevli", "hrw", "ijlaal_", "irfansharifm", "JennyENicholson", "kelhutch17", "kevinmitnick", "kporzee", "Levus28", "MalwareTechBlog", "matthias_kaiser", "MichelleGhsoub", "MISSINGinCANADA", "MITEECS", "MonotoneTim", "oviosu", "pixpilgames", "pud", "radareorg", "rendevouz09", "rice_fry", "rocallahan", "saemg", "samhocevar", "sarahkendzior", "Schwarzenegger", "scottlynch78", "sean3z", "shadowproofcom", "SylviaEarle", "theQuietus", "the_suzerain", "Toby4WARD", "toriena", "trevorettenboro", "ubcprez", "UncleSego", "unkyoka", "user935", "x0rz", "xoreaxeaxeax"]

Cluster 2: String["benheck", "melissa05092", "pwnallthethings", "rishj09", "uhlane", "vgdunkey"]

Cluster 3: String["BerniceKing", "BetaList", "biancasubion", "BrandSanderson", "FastForwardLabs", "Jinichuu", "theTunnelBear"]

Cluster 4: String["brainpicker", "byrook1e", "CanEmbUSA", "carolynporco", "CatchaSCRapper", "charmaine_klee", "charmwitch", "christenrhule", "CockroachDB", "codeahoy", "colour", "CommitStrip", "compArch1", "coolgaltw", "countchrisdo", "cperciva", "cursedimages_2", "cursedimages", "DenisTrailin", "dennis_lysenko", "dronesec", "Ean_Dream", "elnathan_john", "encryptme", "e_sjule", "faceb00kpages", "Fox0x01", "FreedomofPress", "jessysaurusrex", "KamalaHarris", "mononcqc", "norm", "racholau", "reiver", "SWUBC", "Wugabuga_", "YashTag1"]

Cluster 5: String["GEResearch", "haikus_are_easy", "hasherezade", "i0n1c", "i_cannot_read", "ididmorepushups", "idiot_joke_teen", "INFILTRATION85", "internetofshit", "InthelifeofLuke", "InuaEllams", "jacobtwlee", "japesinator", "jeriellsworth", "Jiyanxa", "JoelEmbiid", "JohnBoyega", "Junk_lch", "justinschuh", "jxson", "kelseyhightower", "KingJames", "kinucakes", "klei", "kmuehmel", "Kosan_Takeuchi", "krypt3ia", "ksenish", "KurtBusiek", "l0stkn0wledge", "Legal_Briefs08", "LibyaLiberty", "MachinePix", "malwareunicorn", "motomaratai", "nwHacks", "rei_colina", "SciBugs", "skamille", "susanthesquark", "trufae"]

Cluster 6: String["Mechazawa", "MiraiAttacks", "MoebiusArt1", "msftmmpc", "musalbas", "naval", "ndneighbor", "nigguyen", "omniboi", "oocanime", "PanguTeam", "PengestMunch", "ProfSaraSeager", "progpaintings", "prozdkp", "ramonashelburne", "RascherMichael", "readingburgers", "regretsuko", "repjohnlewis", "riibrego", "RoguePOTUSStaff", "rowletbot", "Rutiggers", "s0lst1c3", "s7ephen", "sadserver", "seanswush", "SenJohnMcCain", "sheer_hope", "SheerPriya", "shiunken", "stinkbug", "swipathefox", "Syphichan", "tajfrancis", "tea_flea", "ternus", "thealexhoar", "thebenheckshow", "theHeckwKaren", "TheRujiK", "ThisIsJBird", "tokyomachine", "torbooks", "Tyler_J_Mallloy", "UBCAvocadoWatch", "VessOnSecurity", "VGArtAndTidbits", "weworkremotely", "WJ_VJ", "worrydream", "YO_SU_RA"]

What's next?

This is really still a work in progress so I wouldn't even consider this project to even be complete; the purpose of this post is more to share my first explorations into this space, as a relative novice to data science. The next steps for this project would probably involve a lot of research and a lot more implementation and parameter tweaking. Some specific areas are:

  • Understanding what PCA actually does and tweaking the parameter of the output dimension in a more informed way
    • Toggling rescaling
  • Understanding the implications of using a PCA result as the input to a clustering algorithm
  • Using different clustering algorithms on the data, like k-means
  • Possibly automating parameter tuning (in a brute-force way, rather than an optimization way because any optimization is still unclear)
  • Trying to use more features based on tweet, biography, and other forms of content in addition to the follow graph data
  • Try to use different algorithms for dimensionality reduction, like t-SNE or auto encoders

I'm going to take a bit of a break from this project for now to focus on other projects, but look forward to an eventual follow-up with my further experimentation in coming up with a better algorithm!