Not being a Com Sci graduate has never worried me before, despite the fact that a lot of my work revolves around code: I build web-based applications such as content management systems and learning support tools, and I teach coding skills to students learning to create Interactive Media projects. These projects all live in a domain fairly high up the hardware stack, and sure, it's a domain which is looked on as a poor relative by 'real' programmers who think of Flash and PHP and the like as the place where 'script-kiddies' hang out. And we kiddies rely on the hardcore skills of those 'real' programmers who interface with the chips, manage the hardware, and protect us from the buffer overrun. We're not likely to induce a BSOD (blue screen of death) from our AJAX fluff, and when the consequences of our code expands like gas because we don't care whether we're making copies or passing references, the admin-God just kills our allocated processes and all is well.
However, the lack of a formal grounding in some of the computing paradigms does rear its ugly head from time to time, such as when you first discover that your code doesn't work because you don't know that you're creating copies, rather than passing references, or you have no idea why your floating point number is different every time. It's recently reared its head in a project I've been working on, because I'd never really thought very deeply about databases before. I mean, if you want to store something, what more do you need than a filesystem and a RDBMS?
I've been working on a data management system, which I've been hoping will make my life a great deal easier. The idea follows from the repetitive tasks I carry out in each project: reusing very similar pieces of code every time I work on a project, but needing to make small adjustments in every case. For example, websites where users add content require code which matches input to a database schema and makes the transaction. Structurally these objects are almost identical, but in each scenario the nature of the information stored is always slightly different. So in each project, some time needs to be spent designing the schema, and repurposing code to fit the nature of the information stored, how it is accessed, etc. So the idea is that if I build a data management layer, I can keep reusing the same code to manage the validation, form creation, retrieval and transactions. All that needs to change in each project is some set of templates which define what kind of information will be stored. This is all so far, so good.
So, in the case of a blogging tool, the definition of what a blog post consists of is being moved from the functional code to a templating system. This system can then be reused in projects where the data entities are things other than blog posts - learning objects, for example, or journal articles, or images with metadata, etc. This all seems so fine, that the awesome idea occurred to me that if I were to make the templating system flexible enough that users could define their own templates, then the whole layer could become an open source software project which others might find useful for creating their own web applications, whether they were creating communal blogs, data repositories, picture galleries, indeed - anything which would require users to input data as defined by the owners of the site.
This is the point at which my inexperience began to unravel my progress. In principle, there is nothing wrong with the notion of introducing this much flexibility into the data layer. However, I hadn't really considered the consequence that such flexibility would have on my idea of structure. Relational databases are the only way I really know how to store data on a server in a form that is quickly retrieved. Relational databases are designed for highly structured data (and of course, a 'template' is effectively a way of describing highly structured data). So I set about thinking about how I could make database schemas and transactions flexible enough to allow multiple templates with a potentially infinite number of component parts (titles, authors, links, pictures, locations, tags, pet-names, ids, descriptions, paragraphs, dates, geo-tags, semantic-web-entities, files, bibliographical data, the list is of course potentially infinite). The solutions which presented themselves all seemed to increase complexity exponentially, and I feared creating a monster system which would hog resources, whether it was due to joining countless tables in the RDBMS, or accessing multiple flat XML files in each page request, or simply filling memory with endless bloat. Maybe all of these potential problems would be not be an issue. But given that I was trying to create a system for many possible uses, how could I possibly know? Here's part of the problem I've found in terms of taking code you use to do one thing, and trying to make it flexible enough for others to use for other purposes: is it possible to write code that does one thing well, and then make it do anything well?
A moment of clarity occurred recently: I'd been following Damian Katz's updates on CouchDB for a while and finally got round to reading the wiki documentation for his system, and a sentence leapt out at me, and made me realise why I was having so much trouble:
"Unlike SQL databases which are designed to store and report on highly structured, interrelated data, CouchDB is designed to store and report on large amounts of semi-structured, document oriented data. CouchDB greatly simplifies the development of document oriented applications, which make up the bulk of collaborative web applications.
Thank you for showing me the wood! How do you create flexible structures? What is a flexible structure? The fact that I'd never considered there to be any possibility other than storing data in a RDBMS meant I didn't even know that was the question I was trying to answer. It's pure Hobbes vs Rousseau, in code. And btw, I don't yet know the answer.