Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store Technology meta data in BQ and icons in GCS #73

Merged
merged 18 commits into from
Nov 18, 2024
Merged

Conversation

max-ostapenko
Copy link

@max-ostapenko max-ostapenko commented Nov 8, 2024

  • metadata upload to wappalyzer.apps BQ table
  • icons upload to Cloud Storage
  • upload to BQ on push to main

We will require the data about the technologies:

@max-ostapenko max-ostapenko marked this pull request as ready for review November 8, 2024 18:50
@max-ostapenko max-ostapenko changed the title Bq upload BQ upload Nov 8, 2024
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 9 out of 14 changed files in this pull request and generated 1 suggestion.

Files not reviewed (5)
  • package.json: Language not supported
  • src/technologies/r.json: Language not supported
  • src/technologies/s.json: Language not supported
  • src/technologies/u.json: Language not supported
  • src/technologies/w.json: Language not supported
Comments skipped due to low confidence (1)

scripts/upload_technology.js:165

  • [nitpick] The variable name 'app' is ambiguous. It should be renamed to 'technologyData' for clarity.
const app = {


if (job.status.errors && job.status.errors.length > 0) {
console.error('Errors encountered:', job.status.errors)
throw new Error('Error loading data into BigQuery')
Copy link
Preview

Copilot AI Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message could be more descriptive. Consider including more context about the error.

Suggested change
throw new Error('Error loading data into BigQuery')
throw new Error(`Error loading data into BigQuery table ${datasetName}.${tableName}: ${job.status.errors}`)

Copilot is powered by AI, so mistakes are possible. Review output carefully before use.

Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
@max-ostapenko max-ostapenko changed the title BQ upload Store Technology meta data in BQ and icons in GCS Nov 16, 2024
Copy link

WPT test run for https://almanac.httparchive.org/en/2022/

WPT test run results: http://webpagetest.httparchive.org/results.php?test=241118_VV_1
Detected technologies:

{
    "detected": {
        "IaaS": "Google Cloud",
        "JavaScript libraries": "web-vitals",
        "RUM": "web-vitals",
        "Performance": "Priority Hints,Google Cloud Trace",
        "Security": "HSTS",
        "Webmail": "Google Workspace",
        "Email": "Google Workspace",
        "Analytics": "Google Analytics",
        "CDN": "Cloudflare",
        "Miscellaneous": "RSS,Open Graph"
    },
    "detected_apps": {
        "Google Cloud": "",
        "web-vitals": "",
        "Priority Hints": "",
        "HSTS": "",
        "Google Workspace": "",
        "Google Cloud Trace": "",
        "Google Analytics": "",
        "Cloudflare": "",
        "RSS": "",
        "Open Graph": ""
    },
    "detected_technologies": {
        "Google Cloud": {
            "name": "Google Cloud",
            "description": "Google Cloud is a suite of cloud computing services.",
            "slug": "google-cloud",
            "categories": [
                {
                    "id": 63,
                    "slug": "iaas",
                    "groups": [
                        7
                    ],
                    "name": "IaaS",
                    "priority": 8
                }
            ],
            "confidence": 100,
            "version": "",
            "icon": "Google Cloud.svg",
            "website": "https://cloud.google.com",
            "pricing": [],
            "cpe": "cpe:2.3:a:google:cloud_platform:*:*:*:*:*:*:*:*"
        },
        "web-vitals": {
            "name": "web-vitals",
            "description": "The web-vitals JavaScript is a tiny, modular library for measuring all the web vitals metrics on real users.",
            "slug": "web-vitals",
            "categories": [
                {
                    "id": 59,
                    "slug": "javascript-libraries",
                    "groups": [
                        9
                    ],
                    "name": "JavaScript libraries",
                    "priority": 9
                },
                {
                    "id": 78,
                    "slug": "rum",
                    "groups": [
                        2
                    ],
                    "name": "RUM",
                    "priority": 9
                }
            ],
            "confidence": 100,
            "version": "",
            "icon": "web-vitals.svg",
            "website": "https://github.com/GoogleChrome/web-vitals",
            "pricing": [],
            "cpe": null
        },
        "Priority Hints": {
            "name": "Priority Hints",
            "description": "Priority Hints exposes a mechanism for developers to signal a relative priority for browsers to consider when fetching resources.",
            "slug": "priority-hints",
            "categories": [
                {
                    "id": 92,
                    "slug": "performance",
                    "groups": [
                        7
                    ],
                    "name": "Performance",
                    "priority": 9
                }
            ],
            "confidence": 100,
            "version": "",
            "icon": "Priority Hints.svg",
            "website": "https://wicg.github.io/priority-hints/",
            "pricing": [],
            "cpe": null
        },
        "HSTS": {
            "name": "HSTS",
            "description": "HTTP Strict Transport Security (HSTS) informs browsers that the site should only be accessed using HTTPS.",
            "slug": "hsts",
            "categories": [
                {
                    "id": 16,
                    "slug": "security",
                    "groups": [
                        11
                    ],
                    "name": "Security",
                    "priority": 9
                }
            ],
            "confidence": 100,
            "version": "",
            "icon": "default.svg",
            "website": "https://www.rfc-editor.org/rfc/rfc6797#section-6.1",
            "pricing": [],
            "cpe": null
        },
        "Google Workspace": {
            "name": "Google Workspace",
            "description": "Google Workspace, formerly G Suite, is a collection of cloud computing, productivity and collaboration tools.",
            "slug": "google-workspace",
            "categories": [
                {
                    "id": 30,
                    "slug": "webmail",
                    "groups": [
                        4
                    ],
                    "name": "Webmail",
                    "priority": 2
                },
                {
                    "id": 75,
                    "slug": "email",
                    "groups": [
                        4,
                        2
                    ],
                    "name": "Email",
                    "priority": 9
                }
            ],
            "confidence": 100,
            "version": "",
            "icon": "Google.svg",
            "website": "https://workspace.google.com/",
            "pricing": [],
            "cpe": null
        },
        "Google Cloud Trace": {
            "name": "Google Cloud Trace",
            "description": "Google Cloud Trace is a distributed tracing system that collects latency data from applications and displays it in the Google Cloud Console.",
            "slug": "google-cloud-trace",
            "categories": [
                {
                    "id": 92,
                    "slug": "performance",
                    "groups": [
                        7
                    ],
                    "name": "Performance",
                    "priority": 9
                }
            ],
            "confidence": 100,
            "version": "",
            "icon": "google-cloud-trace.svg",
            "website": "https://cloud.google.com/trace",
            "pricing": [],
            "cpe": null
        },
        "Google Analytics": {
            "name": "Google Analytics",
            "description": "Google Analytics is a free web analytics service that tracks and reports website traffic.",
            "slug": "google-analytics",
            "categories": [
                {
                    "id": 10,
                    "slug": "analytics",
                    "groups": [
                        8
                    ],
                    "name": "Analytics",
                    "priority": 9
                }
            ],
            "confidence": 100,
            "version": "",
            "icon": "Google Analytics.svg",
            "website": "https://google.com/analytics",
            "pricing": [],
            "cpe": null
        },
        "Cloudflare": {
            "name": "Cloudflare",
            "description": "Cloudflare is a web-infrastructure and website-security company, providing content-delivery-network services, DDoS mitigation, Internet security, and distributed domain-name-server services.",
            "slug": "cloudflare",
            "categories": [
                {
                    "id": 31,
                    "slug": "cdn",
                    "groups": [
                        7
                    ],
                    "name": "CDN",
                    "priority": 9
                }
            ],
            "confidence": 100,
            "version": "",
            "icon": "CloudFlare.svg",
            "website": "https://www.cloudflare.com",
            "pricing": [],
            "cpe": null
        },
        "RSS": {
            "name": "RSS",
            "description": "RSS is a family of web feed formats used to publish frequently updated works—such as blog entries, news headlines, audio, and video—in a standardized format.",
            "slug": "rss",
            "categories": [
                {
                    "id": 19,
                    "slug": "miscellaneous",
                    "groups": [
                        6
                    ],
                    "name": "Miscellaneous",
                    "priority": 10
                }
            ],
            "confidence": 100,
            "version": "",
            "icon": "RSS.svg",
            "website": "https://www.rssboard.org/rss-specification",
            "pricing": [],
            "cpe": null
        },
        "Open Graph": {
            "name": "Open Graph",
            "description": "Open Graph is a protocol that is used to integrate any web page into the social graph.",
            "slug": "open-graph",
            "categories": [
                {
                    "id": 19,
                    "slug": "miscellaneous",
                    "groups": [
                        6
                    ],
                    "name": "Miscellaneous",
                    "priority": 10
                }
            ],
            "confidence": 100,
            "version": "",
            "icon": "Open Graph.png",
            "website": "https://ogp.me",
            "pricing": [],
            "cpe": null
        }
    }
}

@max-ostapenko max-ostapenko merged commit 603be50 into main Nov 18, 2024
2 checks passed
@max-ostapenko max-ostapenko deleted the bq-upload branch November 18, 2024 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants