Two weeks ago I was delighted to be invited to speak at the ALPSP International Industy conference. ALPSP is one of the main industry associations for scholarly publishers and so the audience was comprised mainly of publishers and those with direct interest in the publishing industry. I was included in the industry updates parallel session with Jan Reichelt (Mendeley), Ryan Jones (Pubget)and Daniel Mayer (Temis). All the sessions are provided here with mine excerpted below.
Scrazzl was featured this week as part of siliconrepublic’s one to watch series. Carmel Doyle explains that the “One to Watch series will look at Irish start-ups from the technology and scientific spaces that are fast making waves in their respective industries and could be on the cusp of something big for their niche innovations”. The video below contains some extracts from that interview.
I was recently invited to pitch at Tech All Stars. Instead of the usual investor deck that I use, I wanted to try something different. Inevitably, the focus of an investor deck is on the hard numbers, the tangibles, investors want to know what traction you have and how you are going to make lots of money for them. I guess that is fair enough given that you are usually asking those same investors to invest in your business.
On the face of it our business is a relatively simple proposition, for scientists we provide search for research materials and we sell advertising solutions to companies that make research materials. In the middle are publishers who we partner with. Publishers give us access to their content/traffic and we make money for them. This is pretty much my standard investor deck in 3 sentences.
The frustrating thing is that it’s not the whole story and making money for publishers and suppliers is not what has motivated our team for the past three years. For Tech All Stars I prepared two slidedecks, one was for investors and one was for us. The following is the pitch I wanted to give but never got to….
At the core of what we do at scrazzl is data-mining and document analysis. With greater than 5 million full-text articles and more than 20 million abstracts we have no shortage of raw data. Undoubtedly, analysis of content and content usage has the potential to provide insights of great benefit to the research community. Globally, efforts to develop alternative metrics (altmetrics) to supplement impact factors and citation counting are gathering pace and are being spearheaded by companies like PloS, CiteULike and Mendeley. In parallel, publisher led initiatives like the Usage Factor project also promise to provide useful insights into article usage and quality.
It has been clear for some time that the traditional methods of establishing article quality and impact, namely peer review, JIF (journal impact factor) and citation count are straining under the weight of new content being produced every year. The emergence of altmetrics as a supplementary measure of quality taking account of storage, links, bookmarks and conversations certainly has the potential to alleviate some of the problems caused by the continuing data tsumani. Collectively, journal and article level metrics are designed to help readers assess research quality at the level of the article.
Apart from the question “Is this article worth reading?” there are quite a number of other questions asked by researchers that could in the future benefit from improvements in analysis of content and content usage. For example, Datacite is working to promote citation and attribution of primary datasets thus making it easier for researchers to source good primary data. At scrazzl our focus is on the provision of qualitative metrics to support decision making in experimental science and medicine. Our working hypothesis is that by collating data related to material usage in a large sample of articles, it is possible to derive qualitative measures and insights into the use and application of those materials. Through analysis of articles in combination with metrics related to article quality and researcher influence, we are working to help our users answer questions like:
- How good is this product, should I use it/prescribe it? Recommendation engines like TripAdvisor have revolutionised how we make decisions about the hotels we book or the places we visit. However, the fragmented nature of the scientific supply market coupled with the fact that scientists do not purchase like ordinary consumers (usually they buy through procurement systems) has meant that there is little by way of qualitative information relating to experimental materials. Knowing that other scientists have successfully used the tools and technologies that you are interested in can have a significant bearing on your decision to buy a specific product. In the competitive world of research the cost of failed experiments is high. An initial focus of our article analysis work has been on extracting usage data for experimental materials. We render these derivative metrics through our Product Metrics Widget.
- What are the optimal conditions for the experiment that I plan to run? Slightly more technically challenging than gathering usage statistics on material entities in articles is gathering optimal working conditions for those same materials. Take for example a specific antibody the use of which has been reported more than 500 times in published research. As a researcher, you are interested in the range of dilutions that have been successful for this antibody. Providing this normally fragmented and disparate information in a structured and actionable form is a current R&D focus.
- Who are the researchers that have the experience or resources that I need? Sites like LinkedIn have been successful because they have built a community of skilled individuals and made it really easy to create a network within that community. In science there have been many efforts to create a “facebook for science” most of which have failed. Those that are succeeding, Mendeley and CiteUlike, for example, are doing so because their core value proposition has never been to create a facebook for science. These platforms work to solve other issues such as reference management and social bookmarking and allow users to build a social network as a by-product. For scientists seeking to find specific material resources or technical experts arguably the most complete source of information is in existing literature. At scrazzl, by mapping the associations between scientists and the experimental tools that they document in their research, we are making it easier for scientists to find the people and answers that they require.
Currently we are at the start of a transformation in how the quality and impact of research is assessed. As new standards of research qualification (usage factor, altmetrics) mature it is likely that researchers will demand more tools that can unlock the collective insights housed within content and content usage to enable better decision making. There will be a greater demand for standardised metrics of quality in many areas of scientific enquiry. For our part we will be working hard to unlock the benefits of largescale article analysis in all areas of experimental science.
At scrazzl.com we love big data, and recently we were having a discussion about how it might be cool to look at the number of research articles published by Reed-Elsevier that mention particularly nasty cancers. Why would we do this, I hear you ask? Well, our working hypothesis was that cancer research output, as measured by the “mention frequency” of different cancer types in peer-reviewed journal articles, should closely mirror incidence and/or mortality statistics. However, what we found was quite unexpected!
Figures 1 and 2 below trace the death rates and incidence, respectively, per 100,000 of population in the US between 2003 and 2007 as reported by the Centres for Disease Control (CDC). The cancers resulting in the largest number of deaths over that period included: lung cancer, prostate cancer, breast cancer, colon cancer, pancreatic cancer, ovarian cancer and leukemia. As expected lung cancer resulted in the most deaths, and killed on average twice as many people as the next nearest cancers, breast and prostate. We next took a look at major cancer incidence per 100,000 and consistently, over the time period examined, prostate cancer had the highest incidence rates, with breast cancer coming in a close second. So, what conclusions can be drawn from this data?
- Lung cancer is a particularly problematic cancer and while the overall mortality trend is down it still kills a lot of people in the US every year.
- The incidence of prostate cancer is trending upwards and has the highest incidence of all cancers, albeit for a sub-set of the population (men).
- Breast cancer still has significantly high mortality rates but the overall mortality trend is downward (-7.14% between 2003 and 2007.
- Leukemia has the 7th highest mortality rate and the incidence of leukemia does not feature in the CDC top ten.
Figure 1 Age-Adjusted Cancer Death Rates for the 6 Primary Sites with the Highest Rates within Race- and Ethnic-Specific Categories (Data provided by Centres for Disease Control and Prevention).
Figure 2 Age-Adjusted Invasive Cancer Incidence Rates for the 10 Primary Sites with the Highest Rates within Race- and Ethnic-Specific Categories(Data provided by Centres for Disease Control and Prevention).
After bringing the CDC data together, we proceeded to run a “phrase analysis” on the text content of 4 million+ research articles using scrazzl AnalyticsTM. We defined our keywords as the major cancers resulting in death as reported by the CDC, namely, lung cancer, prostate cancer, breast cancer, colon cancer and leukemia. Figure 3 documents our results.
Figure 3 scrazzl AnalyticsTM showing the number of Reed-Elsevier articles that specifically mention notable cancers between 2003 and 2007.
scrazzl AnalyticsTM was configured in this instance to count a single phrase occurrence per article, so for example if “leukemia” appeared 10 times in an article it would only be counted once by our technology. This is important as we are interested in article count and not simply phrase count. The results are striking, leukemia which has in relative terms a low mortality rate and incidence rate has the highest article count. Breast cancer which has a significantly lower incidence rate than prostate cancer also features prominently in article counts. It is important to note at this stage that our analysis has not taken account of phrase aliases so there is a margin of error within our results. Nevertheless, the obvious disparity between the death/incidence rates and research efforts as measured by article count is striking and begs the question are research efforts focussing on the wrong cancers?
While this is a somewhat simplified approach to measuring trends in cancer research investment vs the most dangerous cancers in the population, the results prompt further discussion nonetheless. Similar to popular culture, scientific research can follow unexpected trends. Certain diseases become “popular” to research, perhaps because they are particularly emotive or have high incidence in population subsets that are vulnerable (leukemia and children). It is also likely that cancers such as breast cancer and leukemia attract disproportionate amounts of research funding and therefore attract greater numbers of researchers. It is also notable that the various charities that support research efforts on these types of cancers, leukemia and breast cancer, are very well organised and well supported within the global community.
When making decisions about research focus it is important that researchers, funding agencies and journal editorial staff are mindful of disease macrostatistics. When prioritising research individual scientists should look at the cold hard facts i.e. which cancers are killing the most people and occurring most frequently in populations. Our analysis of research output suggests that currently this may not be happening. Prostate cancer, lung cancer, and colon cancer are killing large numbers of people but appear to be under-researched when compared to the research output relating to breast cancer and leukemia.
To explore these findings further and improve the accuracy of this analysis we are currently looking at a number of solutions and are keen to get some suggestions from you guys. There are two issues that we think may be impacting the accuracy:
- Aliases and disease variants: Many cancers have more than one name or exist as multiple variants for this reason we are seeking to incorporate a disease ontology into our analysis.
- Limited data coverage: This analysis has been performed on Elsevier data only and while this represents a significant proportion of research content it is not all encompassing. Over the coming months we hope to add more data sources allowing more accurate analysis.
In my next post I’m going to take a look at some more interesting trends that we are seeing in disease related article data and how we are making it possible to view this data in new and useful ways.
Having just passed the 2 month line in my time here with scrazzl, I figure it’s about time I jot down a few thoughts about what we’re getting up to in here, and what it’s like to join such a young company early on.
Who am I?
My name is Daniel Hunt, and I’ve joined the scrazzl tech team as a senior engineer.
What do I do?
But, more specifically, I’m a dedicated techy who likes to keep his fingers in all the pies :)
What pies do I ♥?
I ♥ JS.
Ask me to choose between JS and PHP and I’ll cry. Ask me to choose between JS and anything else, and I’ll hug you with curly braces.
Naturally enough, this has recently developed into a heavy interest in all things Node, but I’ve yet to find an excuse to build anything scrazzl related in it :)
I ♥ databases
There’s just something about the hidden world of database schema design/maintenance/improvement that floats my boat.
The strictly normalised approach of standard relational databases (such as MySQL) has been my bread and butter for years now.
I ♥ JS & databases
As you’ll read a little later, I’ve recently come to love the combination of the other 2 things I love most.
I ♥ PHP
At this stage, PHP is a mature language. There are plenty of other upstarts claiming the mindshare amongst new developers, but I’ve yet to come across something to pull me away from its allure.
When it comes to web development, rapid prototyping and easy maintainability, PHP holds its own with elegant style.
I’ve been lucky enough to have been surrounded by many, many great PHP developers in the past, who have helped me climb the development ladder. As a result, now that I’ve joined a startup, I want to make sure that others get the same chances I got.
I ♥ the cloud
I’ve been lucky enough to have been exposed to Amazon’s EC2 regularly for the past few years - almost as soon as it was available, if I remember correctly.
Without it, nothing I’ve built over that time would have the flexibility, capability or reliability that it would otherwise have had. The cloud, as they say, is the future.
But it’s not. It’s the present. We’ve already arrived at the future.
Why did I join scrazzl?
Since entering the workforce a number of years ago, I’ve found myself gravitating towards smaller companies with every job-hop I made. More recently, I had found myself happy with the work I had done in my previous role, as the lead developer on DeviceAtlas.
Having spent a number of years there, honing my (admittedly, considerable ;) ) skills, I decided it was time to look for a new challenge. scrazzl popped up on my radar not long after I started scouting for one of those challenges, piquing my interest immediately.
The combination of an interesting problem to solve, a clear requirement in the market for the product itself, an enthusiastic CTO and a chance to help build the tech-stack from the ground up turned out to be too good an opportunity for me to skip.
So, I took the plunge.
2 months in
As I mentioned above, I’m now 2 months into my time here. In that time, I’ve been exposed to a whole array of technologies that I’ve been looking forward to trying out for a very long time, some of which I’ve mentioned below.
We are, first and foremost, a PHP dev house.
Our core libraries are written in PHP, as are our toolkits and anything else we find needs doing.
The wealth of development knowledge in scrazzl is incredible, and the elegance with how the main systems have been created knows no bounds. (too much? yeah. too much…. ;) )
Our main website is built using Zend Framework
Having only had limited exposure to this framework before joining scrazzl, I was a little apprehensive about just how easily I’d be able to jump in and start adding to the codebase.
Luckily enough, scrazzl has a pretty strict coding standard, so not only was everything well documented, but the existing code more or less steered me through the forest that was the initial learning curve.
We’ve used MongoDB extensively
As it relies entirely on JSON data structures, I feel right at home when playing about in MongoDB data. Easy to understand, and easy to get started with, this has been one of the highlights of working at scrazzl so far.
Schemaless design feels very natural once you shake off the shackles of SQL normalisation - it’s obviously not suitable for everything, but we’ve found it ideal for our analytics system.
Its speed is truly remarkable: even when dealing with very large data sets I’ve yet to see it crumble under the strain.
Knowing what I know now about NoSQL data storage mechanisms, I’ll be very reluctant to jump straight into a standard MySQL install without first considering the reasons for using a relational database!
Our servers run on Amazon EC2
I mentioned how much I love EC2 above.
It’s all true.
- I love the flexibility it provides, by allowing us to bring up/tear down servers at will.
- I love the redundancy it allows you to build into your systems, by forcing you to think of WHEN something will fail, not IF.
- I love how easy it is to build scalable solutions (such an easy thing to say, I know….).
But most of all?
I love that scrazzl uses it.
What’s it like in scrazzl?
scrazzl is a young company. So young, in fact, that we’ve been looking for more people for a few months now, and are hoping to grow very fast, very soon.
The technology we get to play with daily is interesting, self-rewarding and because we’re building our stack from the ground up, completely under our own control. I’m really looking forward to expanding what we have as the team grows - we have some very big plans in place for every corner of our systems! :)
The work itself is very rewarding, and absolutely every line of code written has a direct impact on what we do.
With so much work to do, it’s very important that everyone gets on well together. I think I’m very lucky to have managed to move all the way from huge multinational companies, down to a small startup and found myself on a great team of people. We pump through an amazing amount of work, are constantly evaluating one another to make sure that no one is getting left behind in the (organised) mayhem, and somehow we still manage to have a laugh each and every day.
If you’re interested in what we do, and would like to join us, feel free to contact us directly - email@example.com
We’re looking for great people to help the fantastic atmosphere we’ve already started to build here to grow - so don’t be shy!
I’m looking forward to the following months, where we will be launching our site officially, signing up a whole raft of customers, building our team significantly and partying until the small hours.
Well, I’ve got to have something to aim for, don’t I?
Over the past eight months we have participated in two well known business accelerator programmes, Propel and LaunchPad. For the greater good I felt it might be useful to share some of my thoughts on the pros/cons of each.
1. Propel is currently run on behalf of Enterprise Ireland (EI) by PA consulting. In our case the programme was split into two phases. Phase 1 involved ~20 companies and was run over a number of 1day seminars. These seminars were designed to help promoters flesh out their propositions a bit and develop a first pass elevator pitch. Promoters were told from the start that only 10 companies would progress to phase two so there was a healthy competitive element to the programme from the start. In my experience the people who openly participated and were receptive to constructive criticism were the companies that ultimately progressed to phase 2. Phase 2 was run over 6 months and involved 10 companies. Most companies received CORD funding and desk space at a location of their choosing. My phase 2 pros and cons:
- Great participation and support from other companies involved.
- Great speakers who were willing to share their experience.
- Challenging environment. The programme consistently challenged your assumptions about your target market.
- I would say the programme is too long. 12 weeks should be enough for a minimal viable product and market validation.
- A little heavy on business theory at times.
- There was no prize money on the table for the final pitches or overall winner. This took the competitive edge out of the programme a bit.
Overall opinion: Go for it, you’ll learn a lot and it will knock some corners off your proposition.
2. LaunchPad at the NDRC started in January 2011 and ran for 12 weeks. There were 10 companies selected to participate. Companies received €5k per promoter and €5k expenses, desk space and services. Unlike Propel it was a requirement that all companies be based at the NDRC. Throughout the programme there was a heavy emphasis on applying the lean start-up methodology to projects. The brief from the LaunchPad team from day 1 was focus on commercial viability because technical viability is pretty much a given. This forced many of us out of our comfort zone toward our market and customers. My LaunchPad pros and cons
- Positive, constructive work space.
- The heavy emphasis on messaging and presentation was great for us.
- Focussed competitive energy between companies throughout the programme.
- You have to give up equity. Of course this can viewed as a positive too as it gives you negotiating experience in a safe environment.
- Weak on connections and door-opening. This was our experience with both Propel and LaunchPad.
- There is no support for business plan preparation which most investors and EI require for any kind of funding.
Overall opinion: Go for it, great programme and a great place to accelerate an idea.
Final thoughts: In my view every first time start-up should look to do at least one accelerator programme. The reality is that there is so much that you need to know when you start a business and attempting it in isolation will be challenging in the extreme.
scrazzl is hiring! First up, we’re looking for a great PHP engineer. Below is a list of things we’d like to see from a technical skills point of view. On a personal level, we’d like to work with someone who deals well with pressure, has a good sense of humour and of course is a hard worker. ( You also like music, beer, and pretending to do extreme sports like mountain biking and surfing! )
You’d have a keen interest in various different technologies / languages and not just the usual PHP-verse, but we’d definitely like to see:
– Apache Solr
– NoSQL such as Mongo, Redis etc
– Appcelerator Titanium
– A brief email outlining who you are and what you are about.
– A C.V.
Don’t forget to include:
– LinkedIn profile
– GitHub profile
– Last.fm profile ;)
What happens then? Well, we’ll meet up with people either face to face or via Skype (we’re good with tele-working!) to talk about what they want, what we want and see if things match up. We’ll probably ask you some fairly technically challenging questions too.
So, drop us a line to firstname.lastname@example.org and we’ll kick things off from there.
This weekend, I got the chance to attend SciBarCamb in Cambridge, and thought I would write briefly about the experience. There were a number of very interesting talks and I’ve written about a few of them below.
In Ian Mulvany‘s talk we discussed the limits of knowledge and whether we can build a computer which could accurately represent anything in existence, something which could pass the Turing test, a computer which could build other computers etc etc. My immediate thoughts went to things that we believe to be impossible to replicate / predict in general, e.g. chaos theory – weather systems etc, and if these things hold true, then can we really build such a system ?
We went on to talk about whether or not we are losing knowledge through abstraction of information. For example, most people don’t understand how a smartphone works. Does this reduce what we know or what we will know as a society, or does it enable us to advance at a greater pace? This could have been a day conference in itself to be honest!
Another interesting chat was in relation to data producers and data consumers. A number of people spoke about the lack of information flow within the scientific community at the moment and how to improve upon that. One point which was interesting, was the lack of options for exposing your research data in conjunction with a particular publication. It was pointed out byCameron Neylon that Data Dryad was one such service.
Apart from a lack of services, it seems that scientists are quite conservative about exposing their data, and that these tools won’t take off until this attitude changes. The attitude was summed up by Ian Mulvany’s comment: “I hope in my lifetime to be at a data in science discussion where the topic is favourite sharing tools not whether they should”
This chat connected naturally to Matt Wood‘s demo of a tool which he has developed called Lark (not lark.com as it happens, which is a “silent waking system for couples” !). The tool was similar to a project management system for your data. The most interesting function was that the data could be replicated / forked by other users in a Github-like fashion, but more importantly, that in the event of a problem in the data, all users who replicated the data would automatically be notified. Pretty interesting.
All in all, it was a great event, and congrats to the organisers for a job well done. I’d definitely go again if only to try and steal some brain power from the incredibly smart people who attended :)
The last couple of months have been pretty crazy for us. In December, we launched a closed beta of scrazzl at UCD and have been busy managing feedback from those users.
Last month I was delighted to be asked to speak to a class of MBA students from Smurfit Business school on the topic of entrepreneurship. The following is the slidedeck and some notes from that talk.
Over the next few months we are going to be announcing some exciting new features for our customers and users so watch this space.
View more presentations from scrazzl